<<

Dual Principal Component Pursuit

and

Filtrated Algebraic Subspace Clustering

by

Manolis C. Tsakiris

A dissertation submitted to The Johns Hopkins University in conformity with the

requirements for the degree of Doctor of Philosophy.

Baltimore, Maryland

March, 2017

c Manolis C. Tsakiris 2017

All rights reserved Abstract

Recent years have witnessed an explosion of data across scientific fields enabled by ad- vances in sensor technology and distributed hardware systems. This has given rise to the challenge of efficiently processing such data for performing various tasks, such as face recognition in images and videos. Towards that end, two operations on the data are of fun- damental importance, i.e., dimensionality reduction and clustering, and the idea of learning one or more linear subspaces from the data has proved a fruitful notion in that context. Nev- ertheless, state-of-the-art methods, such as Robust Principal Component Analysis (RPCA) or Sparse Subspace Clustering (SSC), operate under the hypothesis that the subspaces have dimensions that are small relative to the ambient , and fail otherwise.

This thesis attempts to advance the state-of-the-art of subspace learning methods in the regime where the subspaces have high relative dimensions. The first major contribution of this thesis is a single subspace learning method called Dual Principal Component Pursuit

(DPCP), which solves the robust PCA problem in the presence of outliers. Contrary to sparse and low- state-of-the-art methods, the theoretical guarantees of DPCP do not place any constraints on the dimension of the subspace. In particular, DPCP computes the

ii ABSTRACT

of the subspace, thus it is particularly suited for subspaces of low

. This is done by solving a non-convex cosparse problem on the sphere, whose

global minimizers are shown to be vectors normal to the subspace. Algorithms for solving

the non-convex problem are developed and tested on synthetic and real data, showing that

DPCP is able to handle higher subspace dimensions and larger amounts of outliers than

existing methods. Finally, DPCP is extended theoretically and algorithmically to the case

of multiple subspaces.

The second major contribution of this thesis is a subspace clustering method called

Filtrated Algebraic Subspace Clustering (FASC), which builds upon algebraic geometric

ideas of an older method, known as Generalized Principal Component Analysis (GPCA).

GPCA is naturally suited for subspaces of large dimension, but it suffers from two weak-

nesses: sensitivity to noise and large computational complexity. This thesis demonstrates

that FASC addresses successfully the robustness to noise. This is achieved through an

equivalent formulation of GPCA, which uses the idea of filtrations of unions of subspaces.

An algebraic geometric analysis establishes the theoretical equivalence of the two meth-

ods, while experiments on synthetic and real data reveal that FASC not only dramatically

improves upon GPCA, but also upon existing methods on several occasions.

Primary Reader: Rene´ Vidal

Secondary Reader: Daniel P. Robinson

iii Dedication

This thesis is dedicated to my advisor, Rene´ Vidal, for marveling me with his genius, rigor, endurance and enthusiasm.

iv Acknowledgments

First of all, I am thankful to my advisor, Prof. Rene´ Vidal, whose influence has been enor- mous: working with him has been the best academic experience in my life. Secondly, I am thankful to Prof. Daniel P. Robinson, for many useful conversations, for always be- ing supportive and friendly, and, quite importantly, for not being easily convinced: this led to the identification of a few inaccuracies and the improvement of several arguments in this thesis. I also thank Prof. Aldo Conca of the department of the Uni- versity of Genova for enthusiastically answering many questions that i had regarding the

Castelnuovo-Mumford regularity of subspace arrangements, as well as Prof. Glyn Harman for pointing out a Koksma-Hlawka inequality for integration on the unit sphere.

Then I thank Miss Debbie Race of the ECE department for being extremely helpful and patient with several administrative issues. I thank the Center for Imaging Science (CIS) for being a cozy home for the last 4 years and the ECE department for always being supportive and admitting me as a PhD student to begin with.

I thank all people who were nice to me during my stay in Baltimore. Special thanks go to sensei Ebon Phoenix for passionately teaching me martial arts, and to Tony Hatzigeor-

v ACKNOWLEDGMENTS galis and Jose Torres for being like brothers to me. I also thank the JHU newbie, Christos

Sapsanis, from whom, remarkably, the JHU graduate community has a lot to benefit. And

Guilherme Franca and Tao Xiong for being great friends.

Finally, i thank my best friend Dimitris Lountzis for being who he is, and most impor- tantly i thank my family, Chris, Evi and Michalis, for their infinite love.

vi Contents

Abstract ii

Acknowledgments v

List of Tables xiii

List of Figures xv

1 Introduction 1

1.1 Modeling data with linear subspaces ...... 1

1.1.1 Modeling data with a single subspace ...... 1

1.1.2 Modeling data with multiple subspaces ...... 3

1.2 Challenges and the role of dimension ...... 4

1.3 Contributions of this thesis ...... 5

1.3.1 Dual principal component pursuit (DPCP) ...... 6

1.3.2 Filtrated algebraic subspace clustering ...... 8

1.4 Notation ...... 10

vii CONTENTS

2 Prior Art 13

2.1 Learning a single subspace in the presence of outliers ...... 15

2.2 Learning multiple subspaces ...... 20

2.3 Challenges in high relative dimensions ...... 25

3 Dual Principal Component Pursuit 29

3.1 Introduction ...... 29

3.2 Single subspace learning with outliers via DPCP ...... 32

3.2.1 Problem formulation ...... 32

3.2.1.1 Data model ...... 32

3.2.1.2 Conceptual formulation ...... 33

3.2.1.3 pursuit by `1 minimization ...... 34

3.2.2 Theoretical analysis of the continuous problem ...... 37

3.2.2.1 The underlying continuous problem ...... 38

3.2.2.2 Conditions for global optimality and convergence . . . . 42

3.2.3 Theoretical analysis of the discrete problem ...... 48

3.2.3.1 Discrepancy bounds between continuous and discrete prob-

lems ...... 49

3.2.3.2 Conditions for global optimality of the discrete problem 56

3.2.3.3 Conditions for convergence of the discrete recursive al-

gorithm ...... 66

3.2.4 Algorithmic contributions ...... 71

viii CONTENTS

3.2.4.1 Relaxed DPCP and DPCA algorithms ...... 71

3.2.4.2 Relaxed and denoised DPCP ...... 74

3.2.4.3 Denoised DPCP ...... 76

3.2.4.4 DPCP via iteratively reweighted least-squares ...... 79

3.2.5 Experimental evaluation ...... 80

3.2.5.1 Computing a single dual principal component ...... 81

3.2.5.2 Outlier detection using synthetic data ...... 84

3.2.5.3 Outlier detection using real face and object images . . . . 90

3.3 Learning a hyperplane arrangement via DPCP ...... 96

3.3.1 Problem overview ...... 96

3.3.2 Data model ...... 97

3.3.3 Theoretical analysis of the continuous problem ...... 98

3.3.3.1 Derivation, interpretation and basic properties of the con-

tinuous problem ...... 99

3.3.3.2 The cases of i) two and ii) orthogonal hy-

perplanes ...... 102

3.3.3.3 The case of three equiangular hyperplanes ...... 105

3.3.3.4 Conditions of global optimality for an arbitrary hyper-

plane arrangement ...... 114

3.3.4 Theoretical analysis of the discrete problem ...... 120

3.3.5 Algorithms ...... 131

ix CONTENTS

3.3.5.1 Learning a hyperplane arrangement sequentially . . . . . 132

3.3.5.2 K-Hyperplanes via DPCP ...... 132

3.3.6 Experimental evaluation ...... 134

3.3.6.1 Synthetic data ...... 134

3.3.6.2 3D plane clustering of real Kinect data ...... 142

3.4 Conclusions ...... 148

4 Advances in Algebraic Subspace Clustering 155

4.1 Review of algebraic subspace clustering ...... 156

4.1.1 Subspaces of codimension 1 ...... 157

4.1.2 Subspaces of equal dimension ...... 160

4.1.3 Known number of subspaces of arbitrary dimensions ...... 161

4.1.4 Unknown number of subspaces of arbitrary dimensions ...... 165

4.1.5 Computational complexity and recursive ASC ...... 167

4.1.6 Instability in the presence of noise and spectral ASC ...... 168

4.1.7 The challenge ...... 170

4.2 Filtrated algebraic subspace clustering (FASC) ...... 171

4.2.1 Filtrations of subspace arrangements: geometric overview . . . . . 171

4.2.2 Filtrations of subspace arrangements: theory ...... 177

4.2.2.1 Data in general position in a subspace arrangement . . . 177

4.2.2.2 Constructing the first step of a filtration ...... 182

4.2.2.3 Deciding whether to take a second step in a filtration . . . 185

x CONTENTS

4.2.2.4 Taking multiple steps in a filtration and terminating . . . 188

4.2.2.5 The FASC algorithm ...... 193

4.3 Filtrated spectral algebraic subspace clustering ...... 195

4.3.1 Implementing robust filtrations ...... 195

4.3.2 Combining multiple filtrations ...... 198

4.3.3 The FSASC algorithm ...... 199

4.3.4 A distance-based affinity ...... 200

4.3.5 Discussion on the computational complexity ...... 203

4.4 Experiments ...... 205

4.4.1 Experiments on synthetic data ...... 206

4.4.2 Experiments on real motion ...... 217

4.5 Algebraic clustering of affine subspaces ...... 219

4.5.1 Motivation ...... 219

4.5.2 Problem statement and traditional approach ...... 221

4.5.3 Algebraic geometry of unions of affine subspaces ...... 224

4.5.3.1 Affine subspaces as affine varieties ...... 224

4.5.3.2 The projective of affine subspaces ...... 227

4.5.4 Correctness theorems for the homogenization trick ...... 232

4.6 Conclusions ...... 238

4.7 Appendix ...... 239

4.7.1 Notions from commutative algebra ...... 239

xi CONTENTS

4.7.2 Notions from algebraic geometry ...... 241

4.7.3 Subspace arrangements and their vanishing ideals ...... 244

5 Conclusions 255

Bibliography 257

Vita 274

xii List of Tables

3.1 Mean running times in seconds, corresponding to the experiment of Figure 3.13 for data balancing parameter α =1...... 142 3.2 3D plane clustering error for a of the real Kinect dataset NYUdepthV2. n is the number of fitted planes. GC(0) and GC(1) refer to clustering error without or with spatial smoothing, respectively...... 149

4.1 Mean subspace clustering error in % over 100 independent trials for syn- thetic data randomly generated in three random subspaces of R9 of di- mensions (d1,d2,d3). There are 200 points associated to each subspace, which are corrupted by zero-mean additive white noise of standard devia- tion σ =0, 0.01 and support in the orthogonal complement of the subspace. 207 4.2 Mean subspace clustering error in % over 100 independent trials for syn- thetic data randomly generated in three random subspaces of R9 of di- mensions (d1,d2,d3). There are 200 points associated to each subspace, which are corrupted by zero-mean additive white noise of standard de- viation σ = 0.03, 0.05 and support in the orthogonal complement of the subspace...... 208 4.3 Mean intra-cluster connectivity over 100 independent trials for synthetic data randomly generated in three random subspaces of R9 of dimensions (d1,d2,d3). There are 200 points associated to each subspace, which are corrupted by zero-mean additive white noise of standard deviation σ and support in the orthogonal complement of each subspace...... 209 4.4 Mean inter-cluster connectivity in % over 100 independent trials for syn- thetic data randomly generated in three random subspaces of R9 of dimen- sions (d1,d2,d3). There are 200 points associated to each subspace, which are corrupted by zero-mean additive white noise of standard deviation σ and support in the orthogonal complement of each subspace...... 210

xiii LIST OF TABLES

4.5 Mean running time of each method in seconds over 100 independent trials for synthetic data randomly generated in three random subspaces of R9 of dimensions (d1,d2,d3). There are 200 points associated to each subspace, which are corrupted by zero-mean additive white noise of standard devia- tion σ =0.01 and support in the orthogonal complement of each subspace. The reported running time is the time required to compute the affinity ma- trix, and it does not include the spectral clustering step. The experiment is run in MATLAB on a standard Macbook-Pro with a dual core 2.5GHz Processor and a total of 4GB Cache memory...... 211 4.6 Mean subspace clustering error in % over 100 independent trials for syn- thetic data randomly generated in four random subspaces of R9 of di- mensions (8, 8, 5, 3). There are 200 points associated to each subspace, which are corrupted by zero-mean additive white noise of standard devia- tion σ = 0, 0.01, 0.03, 0.05 and support in the orthogonal complement of each subspace...... 215 4.7 Mean clustering error (E) in %, intra-cluster connectivity (C1), and inter- cluster connectivity (C2) in % for the Hopkins155 data ...... 218

xiv List of Figures

3.1 Various vectors and angles appearing in the proof of Theorem 3.4...... 44 3.2 Geometry of the optimality condition (3.98) and (3.114) for the case d = 1,D = 2,M = N = 5. The polytope Mob∗ + Conv( o ) + Span(b∗) ± 1 misses the point N Sign(xˆ>b∗)xˆ and so the optimality condition can not − be true for both b∗ = Span(xˆ) and φ large...... 63 3.3 Geometry of the optimality6⊥ S condition (3.98) and (3.114) for the case d = 1,D = 2,M = N = 5. A critical b∗ exists, but its angle from is 6⊥ S S small, so that the polytope Mob∗ +Conv( o )+Span(b∗) can contain the ± 1 point N Sign(xˆ>b∗)xˆ. However, b∗ can not be a global minimizer, since small− angles from yield large objective values...... 64 3.4 Geometry of the optimalityS condition (3.98) and (3.114) for the case d = 1,D = 2,N << M. Critical points b∗ do exist and moreover they can have large angle from . This is because6⊥ S N is small and so the poly- S tope Mob∗ +Conv( o1)+Span(b∗) contains the point N Sign(xˆ>b∗)xˆ. Moreover, such critical± points can be global minimizers.− Condition (3.97) of Theorem 3.12 prevents such cases from occuring...... 65 3.5 Various quantities associated to the performance of DPCP-r; see 3.2.5.1. Figure 3.5(a) shows whether condition (3.97) is true (white) or not§ (black). Figure 3.5(b) shows the angle from of nˆ 10 after 10 iterations of DPCP- r when (3.97) is true; the other casesS are mapped to black. Figure 3.5(c)

shows whether a random nˆ 0 satisfies φ0 >φ0∗, where φ0∗ is as in Thm. 3.15. Figure 3.5(d) shows the angle from of nˆ 10 for random nˆ 0. Figure 3.5(e) shows the angle from of the rightS singular vector of X˜ corresponding to the smallest singular value,S Figure 3.5(f) shows the corresponding angle of

nˆ 10, and Figure 3.5(g) plots φ0∗...... 83

xv LIST OF FIGURES

3.6 Outlier/Inlier separation in the absence of noise over 10 independent trials. The horizontal axis is the outlier ration defined as M/(N + M), where M is the number of outliers and N is the number of inliers. The vertical axis is the relative inlier subspace dimension d/D; the dimension of the ambient space is D = 30. Success (white) is declared by the existence of a threshold that, when applied to the output of each method, perfectly separates inliers from outliers...... 85 3.7 ROC curves as a of noise standard deviation σ and outlier percent- age R, for subspace dimension d = 25 in ambient dimension D = 30. The horizontal axis is False Positives ratio and the vertical axis is True Positives ratio. The number associated with each curve is the area above the curve; smaller numbers reflect more accurate performance...... 88 3.8 ROC curves as a function of noise standard deviation σ and outlier percent- age R, for subspace dimension d = 29 in ambient dimension D = 30. The horizontal axis is False Positives ratio and the vertical axis is True Positives ratio. The number associated with each curve is the area above the curve; smaller numbers reflect more accurate performance...... 89 3.9 Average ROC curves and areas over the curves for different percentages R of outliers; see 3.2.5.3. Both inliers and outliers come from EYaleB. C1 means data are centered (C0 not centered), N1 means data are normalized (N0 not normalized)...... 91 3.10 Average ROC curves and areas over the curves for different percentages R of outliers; see 3.2.5.3. Inliers come from EYaleB, outliers from Cal- tech101. C1 means data are centered (C0 not centered), N1 means data are normalized (N0 not normalized)...... 92 3.11 ROC curves for three different projection dimensions, when there are 33% face outliers; data are centered but not normalized (C1 N0)...... 93 3.12 ROC curves for three different projection dimensions, when− there are 33% outliers from Caltech101; data are centered but not normalized (C1 N0). . 93 3.13 Clustering accuracy as a function of the number of hyperplanes n vs− rela- tive dimension d/D vs data balancing (α). White corresponds to 1, black to 0...... 135 3.14 Clustering accuracy as a function of the number of hyperplanes n vs rela- tive dimension d/D. Data balancing parameter is set to α =0.8...... 136 3.15 Clustering accuracy as a function of the number of hyperplanes n vs outlier ratio vs data balancing (α). White corresponds to 1, black to 0...... 137 3.16 Clustering accuracy as a function of the number of hyperplanes n vs outlier ratio. Data balancing parameter is set to α =0.8...... 138 3.17 Segmentation into planes of 5 in dataset NYUdepthV2 without spa- tial smoothing. Numbers are segmentation errors...... 150 3.18 Segmentation into planes of image 5 in dataset NYUdepthV2 with spatial smoothing. Numbers are segmentation errors...... 151

xvi LIST OF FIGURES

3.19 Segmentation into planes of image 2 in dataset NYUdepthV2 without spa- tial smoothing. Numbers are segmentation errors...... 152 3.20 Segmentation into planes of image 2 in dataset NYUdepthV2 with spatial smoothing. Numbers are segmentation errors...... 153

4.1 A union of two lines and one plane in general position in R3...... 161

4.2 The geometry of the unique degree-2 polynomial p(x)=(b1>x)(f >x) that vanishes on 1 2 3. b1 is the normal vector to plane 1 and f is the normal vectorS to∪S the∪S plane spanned by lines and .S ...... 163 H23 S2 S3 4.3 (a): The plane spanned by lines 2 and 3 intersects the plane 1 at the line . (b): Intersection of the originalS subspaceS arrangement =S S4 A S1 ∪S2 ∪S3 with the intermediate ambient space (1), giving rise to the intermediate V1 subspace arrangement (1) = . (c): Geometry of the unique A1 S2 ∪S3 ∪S4 degree-3 polynomial p(x)=(b2>x)(b3>x)(b4>x) that vanishes on 2 3 (1) S ∪S ∪ 4 as a variety of the intermediate ambient space 1 . bi i, i =2, 3, 4. . 175 4.4 ClusteringS error ratios for both 2 and 3 motionsV in Hopkins155,⊥S ordered increasingly for each method. Errors start from the 90-th smallest error of each method...... 219

xvii Chapter 1

Introduction

1.1 Modeling data with linear subspaces

1.1.1 Modeling data with a single subspace

In many fields of science and engineering, datasets can naturally be viewed as of

a coordinate RD, with each data point being a vector of D coordinates. For

example, a digital grayscale image can be viewed as a coordinate vector having as many

coordinates as the number of pixels. It is then intuitively expected that if a collection of

images have similar content, then their representation in RD would not be arbitrary, rather it would exhibit a special structure, reflecting the fact that these points correspond to similar images. It is further expected, that the more similar the images are, the simpler the structure

1 CHAPTER 1. INTRODUCTION of their representation in RD will be. Then knowledge of this structure can potentially be used to simplify the representation of the dataset, i.e., for using less than D coordinates, or for comparing the dataset to another dataset, and so on.

Since RD is a geometric object naturally generalizing the three-dimensional space, it is reasonable, in our effort to formalize concepts such as structure and its attributes special or simpler, to consider geometric objects of RD itself. Among various possibilities, linear subspaces have proved both simple and effective. Indeed, the question of finding a linear subspace of dimension d < D, that passes as close as possible to a given collection of points of RD, is an old one, and its classical solution, known as Principal Component Analysis

(PCA) [50, 78], is still one of the most popular techniques in data analysis, in areas as diverse as engineering [74], economics and sociology [116], chemistry [57], physics [68], and genetics [79] to name a few; see [55] for more applications.

The success of PCA is largely due to two facts: First, the hypothesis that data lie close to a proper linear subspace of the ambient space is a valid one in many applications, e.g., face images of the same individual under fixed pose but varying illumination conditions lie close to a 9-dimensional linear subspace [4], trajectories across video frames of image points that belong to the same rigid body lie close to a 3-dimensional affine subspace [94], and point correspondences between two views of the same static scene lie close to an 8- dimensional linear subspace [47]. Second, the error metric used in PCA to penalize the modeling errors is the Euclidean distance. This allows for computation of the optimal subspace in closed form solely via linear algebraic algorithms, in particular via the Singular

2 CHAPTER 1. INTRODUCTION

Value Decomposition (SVD), for which very efficient software has been developed through the years.

1.1.2 Modeling data with multiple subspaces

Even though the data model of a single linear subspace is a fundamental one, there are many occasions, where modeling with multiple subspaces is more precise. Intuitively, this is the case when there are more than one classes present in the data.

As a practical example, consider a moving surveillance camera taking a video of a moving car, with the rest of the background being fixed. As already mentioned, point trajectories corresponding to the car lie close to a 3-dimensional affine subspace, because they belong to the same rigid body motion. On the other hand, the motion of the camera induces a motion to the background, so that in fact there are two linear subspaces associ- ated with the data: one corresponding to the motion of the car (original motion of the car superimposed with the motion of the camera), and one corresponding to the motion of the background. Thus, the right data model for this example is a union of two 3-dimensional affine subspaces. As another example, a collection of face images of n individuals under

fixed pose but varying illumination, is naturally modeled by a union of n 9-dimensional linear subspaces [4].

As it turns out, there is a variety of applications across diverse areas of study, where a union of subspaces, also known as a subspace arrangement, is a more appropriate model than a single subspace. Important applications include computer vision problems, such as

3 CHAPTER 1. INTRODUCTION

motion segmentation [107, 110], structure from motion [46] and multiple view geometry

[47], face clustering [27], and 3D point cloud analysis [81], as well as genomics [103], document clustering [87], and system identification [71]. As a consequence, the problem of fitting a subspace arrangement to a given set of points of RD has gained significant attention over the past 15 years [9,15,27,28,30,31,42,64,66,69,93,102,106,109–111,119], giving rise to the field of subspace clustering [105], also known as Generalized Principal

Component Analysis [112].

1.2 Challenges and the role of dimension

There are many challenges associated with learning linear subspaces from a given dataset.

To begin with, suppose that we are given a dataset that perfectly lies in the union of n linear

subspaces of known dimensions, and suppose that the goal is to find the subspaces and clus-

ter the data according to their dataset membership. If the subspaces themselves were given,

then we could simply cluster the data points based on their distance to the given subspaces

(if a point has zero distance from a subspace, then it belongs to that subspace). If, on the

other hand, the clustering of the data according to their subspace membership were known,

then we could estimate the subspaces by applying PCA to each cluster. Nevertheless, if

both the subspaces and the clustering are unknown, then this is an especially challenging

problem, since it is not clear how to set up a rigorous solution.

On top of the intrinsic difficulty of subspace learning, most real world datasets are

contaminated with outliers and noise, and they may have missing entries. Moreover, both

4 CHAPTER 1. INTRODUCTION the number of subspaces and their dimensions are usually unavailable, and they need to be estimated from the data.

When the underlying subspaces are low-dimensional, additional structures are present in the dataset, such as low-rank components [31, 64, 66, 106, 118] or sparse self-expressive patterns [27, 28, 30, 54, 77, 84, 120], which can be elegantly used to overcome many of the aforementioned challenges. As an example, when the data lie close to a low-dimensional subspace and they are corrupted in an entry-wise fashion, the robust PCA method of [12] detects and corrects these corruptions, by decomposing the observed data to the sum of a low-rank and a . However, when the underlying subspaces have high relative dimensions (the relative dimension of a subspace is the of the subspace dimension over the dimension of the ambient space), these additional structures disappear, which makes dealing with outliers, handling unknown a-priori knowledge of the subspace dimensions, and designing scalable algorithms even more challenging, and in fact these challenges are currently largely unsolved.

1.3 Contributions of this thesis

This thesis addresses several challenges associated with learning subspaces of high relative dimension. The first part of the thesis is devoted to a new subspace learning method called

Dual Principal Component Pursuit, which is shown to advance the state-of-the-art in learn- ing a single subspace of high relative dimension in the presence of outliers, as well as in clustering multiple hyperplanes (subspaces of maximal relative dimension). The second

5 CHAPTER 1. INTRODUCTION part of the thesis is devoted to a new subspace clustering method called Filtrated Algebraic

Subspace Clustering, which is shown to advance the state-of-the-art of Algebraic Subspace

Clustering (ASC) [109–111], the latter being one of the best theoretically justifiable meth- ods for clustering data drawn from a union of subspaces of high relative dimension.

1.3.1 Dual principal component pursuit (DPCP)

The first part of this thesis (Chapter 3) introduces a new subspace learning method called

Dual Principal Component Pursuit (DPCP) [96, 98, 99]. The adjective dual refers to the fact that DPCP searches for a for the orthogonal complement of a subspace. As such, DPCP is naturally suited for subspaces of maximal relative dimension, for which, the orthogonal complement is a low-dimensional subspace. Nevertheless, DPCP can in principle be applied to subspaces of any dimension.

In a noiseless setting, the basic idea of DPCP is to search for a hyperplane that contains as many points of the dataset as possible. Such a hyperplane will be called a maximal hy- perplane. As it turns out, any maximal hyperplane tends to contain all points coming from the same subspace, a property that proves to be valuable for subspace learning, particu- larly in high relative dimensions. For example, if the dataset consists of points lying in a single hyperplane and is corrupted by many outliers, then, if the data points are in general position, there is a unique maximal hyperplane coinciding with the hyperplane associated to the inliers. On the other hand, if the data are drawn from a union of hyperplanes, and once again are in general position (inside their respective hyperplanes), then any maximal

6 CHAPTER 1. INTRODUCTION hyperplane must be one of the hyperplanes associated to the data. This concept general- izes to the case where the subspaces are not hyperplanes, in which case the objective is to compute a basis for the orthogonal complement of each of the underlying subspaces.

We cast the computational search for maximal hyperplanes as an `1 minimization prob- lem on the sphere, whose global minimizer is ideally the normal vector to such a hyper- plane. While the objective function is convex, the constraint is not, which renders the problem non-convex. An extensive mathematical analysis culminates in theorems that de- scribe conditions, under which any global solution of the non-convex problem is orthogonal to the underlying subspace (or orthogonal to the dominant one if there are more than one subspaces), thus serving as a guarantee for a recursive computation of the orthogonal com- plement of the subspace.

As far as the non-convex optimization problem is concerned, the thesis investigates several different techniques for solving it. Among them, emphasis is placed on minimizing the `1 objective on an adaptively learned of tangent spaces to the sphere, which computationally amounts to solving a recursion of linear programs. The mathematical analysis provides theorems that establish convergence of the recursion in a finite number of iterations to a vector orthogonal to the underlying subspace. Other scalable techniques for solving the non-convex problem are also investigated.

The thesis discusses extensive experiments on synthetic data, which demonstrate the superiority of DPCP to the state-of-the-art in learning a single subspace of high relative dimension in the presence of outliers, or clustering data drawn from a union of hyperplanes.

7 CHAPTER 1. INTRODUCTION

Experiments using real data show that DPCP is competitive to the state-of-the-art in fitting

planes to 3D point clouds, or detecting outliers that corrupt a collection of face images of a

single individual.

1.3.2 Filtrated algebraic subspace clustering

The second part of this thesis (Chapter 4) is concerned with Algebraic Subspace Clustering

(ASC) [109–111], which is one of the most theoretically appropriate methods for clustering

data drawn from a union of subspaces of high relative dimension. Moreover, it has very

interesting connections to algebraic geometry [45,48] and commutative algebra [1,25,73],

which are fascinating branches of mathematics very rarely used in machine learning (see

also [67] for another recent such related method). As such, ASC is a very appealing method

to study and improve towards making progress in learning subspaces of high relative di-

mension.

The main idea behind ASC is that a (transversal) union of linear subspaces is uniquely

characterized by a set of vanishing polynomials, which can be estimated from the data.

Once the vanishing polynomials are available (or their estimates), a subspace associated to the data can be obtained as the linear subspace generated by the gradients of all the polynomials evaluated at any (non-singular) point in the subspace.

There are two main challenges that have been preventing ASC from being applicable to modern datasets. First, one needs to estimate the correct number of linearly independent vanishing polynomials of degree equal to the number of the subspaces. Numerically, this

8 CHAPTER 1. INTRODUCTION

corresponds to estimating the rank of a matrix, which is a precarious task in the presence of

even moderate amounts of noise, particularly, since, it is known that the clustering accuracy

is very sensitive to the estimation of this rank. The second main challenge of ASC is

that, even though the vanishing polynomials can be computed using the Singular Value

Decomposition (SVD), the SVD itself is to be applied on a multivariate Vandermode matrix

whose dimension is exponential in the number of subspaces and the ambient dimension. In

other words, ASC is characterized by an exponential computational complexity.

The main contribution of the second part of this thesis is a new algebraic algorithm,

called Filtrated Algebraic Subspace Clustering (FASC) [95, 101]. The key idea behind

FASC is to construct a nested sequence := of subspace arrangements A A0 ⊃ A1 ⊃ ··· , ,... (i.e., a descending filtration), with each arrangement embedded in an inter- A0 A1 Ai

mediate ambient space i of strictly smaller dimension than i 1, where is the original V V − A union of subspaces and = RD is the original ambient space. Given a point in one of V0 the subspaces, say x , the method constructs a nested sequence of subspace ar- ∈S⊂A rangements with the following properties: (1) is contained in A0 ⊃ A1 ⊃···⊃Ac S each intermediate arrangement, i.e., ; (2) the sequence stabilizes at after a finite S⊂Ai S number of steps, i.e., = , and (3) the codimension of the subspace is equal to the num- S Ac ber of steps of the filtration, i.e., c = codim( ). As a consequence, one can identify the S subspace containing each data point by constructing a filtration at that point. The numerical implementation of FASC is based on using the filtration of two distinct points in the dataset to determine whether they lie in the same subspace or not. This is the basic concept behind

9 CHAPTER 1. INTRODUCTION

defining a pairwise affinity between all points in the dataset, and subsequently obtaining

the data clusters by means of standard spectral clustering. Interestingly, the resulting al-

gorithm, called Filtrated Spectral Algebraic Subspace Clustering (FSASC) [97], not only

dramatically improves upon the performance of earlier ASC methods, thus addressing the

challenge of robustness of ASC to noise, but also exhibits state-of-the-art performance in

the Hopkins155 dataset [94], which is a benchmark dataset for motion segmentation.

A second contribution of this second part of this thesis is a rigorous algebraic-geometric

study of the algebraic clustering of affine subspaces [100]. In fact, very little theory exists

in the subspace clustering literature as far as dealing with affine subspaces is concerned.

Usually, affine subspaces are treated in the machine learning literature by reduction to the

case of linear subspaces, by means of appending an extra coordinate in the coordinate rep-

resentation of the data [104]. This is known as projectivization, homogenization or more

casually as the homogenization trick. Even though homogenization has been successful, it has been viewed only as a heuristic, not supported by any actual theory. This thesis es- tablishes the theoretical correctness of dealing with affine subspaces through homogeneous coordinates, when clustering is done algebraically.

1.4 Notation

We begin with some general notation. The notation ∼= stands for isomorphism in whatever category the objects lying to the left and right of the symbol belong to. The notation ' denotes approximation. For any positive n let [n] := 1, 2,...,n . For any positive { } 10 CHAPTER 1. INTRODUCTION number α let α denote the smallest integer that is greater than α. For sets , , the d e A B set is the set of all elements of that do not belong in . The right null space A\B A B of a matrix B is denoted by (B). If is a subspace of RD, then dim( ) denotes the N S S dimension of and π : RD is the orthogonal projection of RD onto . For vectors S S → S S D D b, b0 R we let ∠b, b0 be the angle between b and b0. If b is a vector of R and ∈ S a linear subspace of RD, the principal angle of b from is ∠b,π (b). The symbol S S ⊕ denotes of subspaces. The orthogonal complement of a subspace in RD is S D D ⊥. If y ,..., y are elements of R , we denote by Span(y ,..., y ) the subspace of R S 1 s 1 s D 1 D D spanned by these elements. S − denotes the unit sphere of R . For a vector w R we ∈ define wˆ := w/ w , if w = 0, and wˆ := 0 otherwise. With a mild abuse of notation k k2 6 we will be treating on several occasions matrices as sets, i.e., if X is D N and x a point × of RD, the notation x X signifies that x is a column of X . Similarly, if O is a D M ∈ × matrix, the notation X O signifies the points of RD that are common columns of X ∩ and O. Also, the shorthand RHS stands for Right-Hand-Side, and similarly for LHS. The notation Sign denotes the sign function Sign : R 1, 0, 1 defined as → {− }

x/ x if x =0, Sign(x)=  | | 6 (1.1)   0 if x =0.

 The subdifferential of the `1-norm 

D z =(z ,...,z )> z = z (1.2) 1 D 7→ k k1 | i| i=1 X

11 CHAPTER 1. INTRODUCTION is a set-valued function on RD defined as

Sign(x) if x =0, Sgn(x)=  6 (1.3)   [ 1, 1] if x =0. −  Next, we establish some more specialized notation in support of the algebraic-geometric aspects of the thesis. We let R[x] = R[x1,...,xD] be the polynomial ring over the real numbers in D variables. We use x to denote the vector of variables x = (x1,...,xD),

D while we reserve x to denote a data point x = (χ1,...,χD) of R . We denote by R[x]`

1 the set of all homogeneous polynomials of degree ` and similarly R[x] ` the set of all ≤ homogeneous polynomials of degree less than or equal to `. R[x] is an infinite dimensional real vector space, while R[x]` and R[x] ` are finite dimensional subspaces of R[x] of di- ≤

`+D 1 `+D mensions (D) := − and , respectively. We denote by R(x) the field of M` ` `   all rational functions over R and variables x ,...,x . If p ,...,p is a subset of R[x], 1 D { 1 s} we denote by p ,...,p the ideal generated by p ,...,p (see Definition 4.49). If is a h 1 si 1 s A subset of RD, we denote by the vanishing ideal of , i.e., the set of all elements of R[x] IA A that vanish on and similarly ,` := R[x]` and , ` := R[x] `. Finally, for A IA IA ∩ IA ≤ IA ∩ ≤ D a point x R , and a set R[x] of polynomials, x is the set of gradients of all the ∈ I ⊂ ∇I| elements of evaluated at x. I

1A polynomial in many variables is called homogeneous if all monomials appearing in the polynomial have the same degree.

12 Chapter 2

Prior Art

The importance of modeling data with linear subspaces had been recognized more than 100 years ago with the seminal work of Pearson on “Lines and planes of closest fit to systems of points in space” [78]. This led to the famous Principal Component Analysis (PCA) [50,

78], whose solution is achieved in closed form through the Singular Value Decomposition

D N D (SVD). Specifically, if the data matrix is X R × , consisting of N points in R , then ∈ the linear subspace of dimension d that passes closest to all points of X in the Euclidean sense, is the linear space spanned by the first d left singular vectors of X 1.

D M However, the presence of even a few outliers O R × may significantly bias the ∈ estimated subspace away from the true underlying subspace. Indeed, it is known that this

`2-optimal linear subspace is affected by the contribution of each and every point in the

1In principle, one should first center the data to have zero mean, and then search for a linear subspace that passes close to them.

13 CHAPTER 2. PRIOR ART

˜ D (N+M) now corrupted dataset X = [X O]Γ R × , where Γ is an unknown permutation ∈ indicating that we do not know which point is an inlier and which point is an outlier. Over

the past 36 years many research efforts have been devoted to robustly learning a linear

subspace from the data, while mitigating the effect of outliers. Traditional methods, such as

RANSAC [34], Influence-based Detection, Multivariate Trimming and M-Estimators [52,

55], use techniques from robust statistics; such methods are usually based on non-convex optimization, are sensitive to initialization and admit limited theoretical guarantees. On other hand, over the past 10 years methods based on convex optimization [12, 62, 84, 118] have gained significant popularity, due to their efficient implementations and theoretical guarantees. The reader is referred to 2.1 for a brief review of single subspace learning § methods. This review is by no means intended to be exhaustive; rather it focuses on either highly popular methods, such as RANSAC [34], or methods that are conceptually distinct from other methods, and whose techniques are intellectually related with the techniques used in this thesis.2

When the outliers are themselves well modeled by a linear subspace, as is for ex- ample the case of background subtraction in video surveillance [53], or when there are multiple classes associated to the data, as is the case in motion segmentation [94], it is more precise to learn multiple subspaces from the data, instead of a single subspace. This has led to a research field known as subspace clustering [105]. In its most fundamen- tal form, the subspace clustering problem assumes that the data are drawn from n lin-

2 For an extensive literature review on single subspace learning the reader is referred to [61], or to [3, 33] and references therein for online methods.

14 CHAPTER 2. PRIOR ART

D ˜ D N ear subspaces ,..., of R , and so the data matrix X R × has the structure S1 Sn ∈

˜ D Ni X = [X X X ]Γ, where X R × are N points coming from subspace , 1 2 ··· n i ∈ i Si i =1,...,n, and Γ is an unknown permutation indicating that we do not know which point comes from which subspace. Then the goal is to find a basis for each of the underlying sub- spaces , as well as cluster the data points X˜ according to their subspace membership. By Si now, a large variety of subspace clustering methods has appeared in the literature including algebraic [109–111], statistical [42, 93], information-theoretic [70], iterative [9, 102, 121], geometric and spectral techniques [6,15,17,27,28,30,31,56,64,66,106,114]; see 2.1 for § a brief review.

Overall, when the subspace associated to the data is low-dimensional, methods such as [84,118] are known to successfully detect outliers present in the dataset. Similarly, when the data lie close to a union of low-dimensional subspaces, methods such as [30] have been shown to cluster the data with high accuracy. On the other hand, the case of high relative dimensions is considerably harder and a satisfactory solution remains yet to be established.

This is due to a number of challenges associated to subspaces of high relative dimension, which are described in 2.3. §

2.1 Learning a single subspace in the presence of outliers

RANSAC. One of the oldest and most popular outlier detection methods in PCA is Ran- dom Sampling Consensus (RANSAC) [34]. The idea behind RANSAC is simple: alternate between randomly sampling a small subset of the dataset and computing a subspace model

15 CHAPTER 2. PRIOR ART

for this subset, until a model is found that maximizes the number of points in the entire

dataset that fit to it within some error. RANSAC is usually characterized by high learning

performance. However, it requires a high computational time, since it is often the case

that exponentially many trials are required in order to sample outlier-free subsets, and thus

obtain reliable models. Additionally, RANSAC requires as input an estimate for the di-

mension of the subspace as well as a thresholding parameter, which is used to distinguish

outliers from inliers; naturally the performance of RANSAC is very sensitive to these two

parameters.

`2,1-RPCA. Contrary to the classic principles that underlie RANSAC, modern methods for outlier detection in PCA are primarily based on convex optimization. One of the earliest and most important such methods, to be referred to as `2,1-RPCA, is the method of [118], which is in turn inspired by the Robust PCA algorithm of [12]. `2,1-RPCA computes a

3 (` + `2,1)-norm decomposition of the data matrix, instead of the (` + `1)-decomposition ∗ ∗

in [12]. More specifically, `2,1-RPCA solves the optimization problem

min L + λ E 2,1 , (2.1) L,E: X˜=L+E k k∗ k k which attempts to decompose the data matrix X˜ = [X O]Γ into the sum of a low-rank matrix L, and a matrix E that has only a few non-zero columns. The idea is that L is associated with the inliers, having the form L = [X 0D M ]Γ, and E is associated with ×

the outliers, having the form E = [0D N O]Γ. The optimization problem (2.1) is convex × and admits theoretical guarantees and efficient ADMM [7] implementations. However, it

3 Here `∗ denotes the nuclear norm, which is the sum of the singular values of the matrix. Also, `2,1 is defined as the sum of the euclidean norms of the columns of a matrix.

16 CHAPTER 2. PRIOR ART

is expected to succeed only when the intrinsic dimension d of the inliers is small enough

(otherwise [X 0D M ] will not be low-rank), and the outlier ratio is not too large (otherwise ×

[0D N O] will not be column-sparse). Finally, notice that `2,1-RPCA does not require as × input the subspace dimension d, because it does not directly compute an estimate for the

subspace. Rather, the subspace can be obtained subsequently by doing classic PCA on L, and now one does need an estimate for d.

SE-RPCA. Another state-of-the-art method, referred to as SE-RPCA, is based on the self-expressiveness property of the data matrix, a notion popularized by the work of [29,30] in the area of subspace clustering [105]. More specifically, observe that if a column of X˜ is an inlier, then it can in principle be expressed as a of d other columns of X˜, which are inliers. If the column is instead an outlier, then it will in principle require

D other columns to express it as a linear combination. The self expressiveness matrix C can be obtained as the solution to the convex optimization problem

min C s.t. X˜ = X˜C, Diag(C)= 0. (2.2) C k k1

Having computed the matrix of coefficients C, and under the hypothesis that d/D is small, ˜ a column of X is declared as an outlier, if the `1 norm of the corresponding column of C

is large; see [84] for an explicit formula. SE-RPCA admits theoretical guarantees [84] and

efficient ADMM implementations [30]. However, as is clear from its description, it is ex-

pected to succeed only when the relative dimension d/D is sufficiently small. Nevertheless,

in contrast to `2,1-RPCA, which in principle fails in the presence of a very large number of

outliers, SE-RPCA is still expected to perform well, since the existence of sparse subspace-

17 CHAPTER 2. PRIOR ART

preserving self-expressive patterns does not depend on the number of outliers present. Also,

similarly to `2,1-RPCA, SE-RPCA does not directly require an estimate for the subspace dimension d. Nevertheless, knowledge of d is necessary if one wants to furnish an actual subspace estimate. This would entail removing the outliers (a judiciously chosen threshold would also be necessary here) and doing PCA on the remaining points.

REAPER. A recently proposed single subspace learning method that admits an inter- esting theoretical analysis is the REAPER [62], which is conceptually associated with the optimization problem

L min (I Π)x˜ , s.t. Π is an orthogonal projection, Trace (Π)= d, (2.3) Π k D − jk2 j=1 X ˜ where x˜j is the j-th column of X . The matrix Π appearing in (2.3) can be thought of as

D d the product Π = UU >, where U R × contains in its columns an orthonormal basis ∈ for a d-dimensional linear subspace . As (2.3) is non-convex, [62] relaxes it to the convex S semi-definite program

L

min (ID P )x˜j , s.t. 0 P ID, Trace (P )= d, (2.4) P k − k2 ≤ ≤ j=1 X whose global solution P ∗ is subsequently projected in an ` sense onto the space of rank-d ∗

orthogonal projectors. It is shown in [62] that the orthoprojector Π∗ obtained in this way

is within a neighborhood of the orthoprojector corresponding to the true underlying inlier

subspace. One advantage of REAPER with respect to `2,1-RPCA and SE-RPCA, is that

its theoretical conditions do not explicitly require the inlier dimension d to be small. On

the other hand, contrary to `2,1-RPCA, REAPER does require a-priori knowledge of the

18 CHAPTER 2. PRIOR ART

inlier dimension d. Moreover, the semi-definite program (2.4) may become prohibitively

expensive to solve even for moderate values of the ambient dimension D. As a conse- quence, [62] proposed an Iteratively Reweighted Least Squares (IRLS) scheme to obtain a numerical solution of (2.4). Interestingly, it was shown in [62] that the objective value of this IRLS scheme converges to a neighborhood of the optimal objective value of problem

(2.4); nevertheless no other properties of this scheme seem to be known.

R1-PCA. Another related method is the so-called R1-PCA [23], which attempts to solve the following problem:

˜ min X UV , s.t. U >U = I, (2.5) U,V − 2,1

where U is an orthonormal basis for the estimated subspace, and V contains in its columns the low-dimensional representations of the points. Besides for alternating minimization with a power iteration scheme that converges to a local minimum, little else is known about how to solve the non-convex problem (2.5) to global optimality.

L1-PCA∗. Finally, the method L1-PCA∗ of [11] works with the orthogonal complement

of the subspace, but it is slightly unusual in that it learns `1 hyperplanes, i.e., hyperplanes that minimize the `1 distance to the points, as opposed to the Euclidean distance, e.g., used by the classic PCA, R1-PCA, or REAPER. More specifically, an `1 hyperplane learned by the data is a hyperplane with normal vector b that solves the problem

L ˜ min xj yj s.t. yj>b =0, j [L], (2.6) b SD−1; y RD,j [L] − 1 ∀ ∈ ∈ j ∈ ∈ j=1 X

˜ where yj is the representation of point xj in the hyperplane. Overall, no theoretical guaran-

19 CHAPTER 2. PRIOR ART

tees seem to be known for L1-PCA∗, as far as the subspace learning problem is concerned.

In , L1-PCA∗ requires the solution to quadratically many linear programs of size equal to the ambient dimension, which makes it computationally expensive.

2.2 Learning multiple subspaces

Self-expressiveness-based methods. One of the most popular families of subspace clus- tering methods is based on applying spectral clustering [115] to an N N affinity matrix × W built by exploiting the self-expressiveness property of the data. This latter property says

˜ ˜ D Ni that when the data matrix X has the structure X = [X X X ], where X R × 1 2 ··· n i ∈ are N points coming from subspace , i = 1,...,n, then every column x˜ of X˜ can i Si j ˜ be written as a linear combination x˜j = X cj of other points in the dataset, with the j-th entry of the vector of coefficients cj being zero. Then the idea is that if the relationship

X˜ = X˜C, Diag(C) = 0 is enforced together with a suitable regularization on C, the points that every point x˜j selects will be from the same subspace. As it turns out, defin- ing W jj0 to be the absolute value of the j0-th entry of cj has proved to be a simple yet powerful affinity. Popular choices for the regularization on C are the `1-norm, the nuclear norm, the Frobenius norm, or combinations thereof, leading to what is known as Sparse

Subspace Clustering (SSC) [27, 28, 30], Low-Rank Subspace Clustering [31, 64, 66, 106],

Least-Squares Subspace Clustering [69], and Elastic Net Subspace Clustering [54,77,120].

Under the typical assumption that the subspaces are low-dimensional and sufficiently sep- arated, this family of methods admits theoretical guarantees regarding the correctness of

20 CHAPTER 2. PRIOR ART

the affinity W , and its robustness to outliers and noise. Moreover, efficient algorithmic

implementations are available.

Spectral Curvature Clustering (SCC). Another yet conceptually distinct method from

the ones discussed so far is Spectral Curvature Clustering (SCC) [15], which is theoreti-

cally suitable for clustering subspaces of equal dimension d. The main idea of SCC is to

build a (d+1)-fold as follows. For each (d+1)-tuple of distinct points in the dataset,

say xj1 ,..., xjd+1 , the value of the tensor is set to

2 c (x ,..., x ) A p j1 jd+1 (2.7) (j1,...,jd+1) = exp 2 , − 2σ  !

where cp(xj1 ,..., xjd+1 ) is the polar curvature of the points xj1 ,..., xjd+1 (see [15] for an explicit formula) and σ is a tuning parameter. Intuitively, the polar curvature is a multiple of the volume of the simplex of the d +1 points, which becomes zero if the points lie in the same d-dimensional subspace, and the further the points lie from any such subspace the larger their volume becomes. SCC obtains the subspace clusters by unfolding the tensor

A to an affinity matrix, upon which spectral clustering is applied. As with ASC, the main

N bottleneck of SCC is computational, since in principle all d+1 entries of the tensor need to be computed. Even though the combinatorial complexity of SCC can be reduced, this comes at the cost of significant performance degradation.

RANSAC. On the other end of the spectrum, the classical method of RANSAC, de- scribed in 2.1, is often used to learn more than one linear subspaces. More specifically, § RANSAC attempts to identify a single subspace at a time, by treating points lying in Si other subspaces as outliers. Once a subspace estimate for one subspace, say , is ob- S1 21 CHAPTER 2. PRIOR ART

tained, points lying close to the subspace are removed and a second subspace is sought. As

in the case of a single subspace with outliers, RANSAC is sensitive to its thresholding pa-

rameter, and moreover, its efficiency depends on how big the probability is that d randomly

selected points lie close to the same underlying subpsace. In turn, this probability depends

on how large D is as well as how balanced or unbalanced the clusters are. If D is small,

then RANSAC is likely to succeed with few trials. The same is true if one of the clusters,

say X , is highly dominant, i.e., N >> N , i 2, since in such a case, identifying is 1 1 i ∀ ≥ S1 likely to be achieved with only a few trials. On the other hand, if D is large and the Ni are of the same order of magnitude, then exponentially many trials are required, and RANSAC becomes inefficient.

K-Subspaces. Another classical method for clustering subspaces is the so-called K-

Subspaces, which was proposed in [9, 102]. K-Subspaces attempts to minimize the non-

convex objective function

n N 2 (B ,..., B ; s ,...,s ) := s (i) B>x , (2.8) JKS 1 n 1 N j i j 2 i=1 " j=1 # X X

where s : [n] 0, 1 is the subspace assignment of point x , i.e., s (i)=1 if and j → { } j j

D ci only if point x has been assigned to subspace i, and B R × is an orthonormal basis j i ∈ for the orthogonal complement of subspace .4 Because of the non-convexity of (2.8), Si the typical way to perform the optimization is by alternating between assigning points to

clusters, i.e., given the B assigning x to its closest subspace (in the euclidean sense), { i} j 4It should be noted here that an alternative formulation exists, where one searches directly for the basis of the subspaces together with the representation of each point to its closest subspace; the two formulations are equivalent, with the one chosen in the main text being the more efficient in the case of high relative dimensions.

22 CHAPTER 2. PRIOR ART and fitting subspaces, i.e., given the segmentation s , computing the best ` subspace for { j} 2 each cluster by means of Singular Value Decomposition on each cluster.

The theoretical guarantees of K-Subspaces are limited to convergence to a local mini- mum in a finite number of steps. Moreover, knowledge of the subspace dimensions di (or c = D d ) is required. Finally, even though the alternating minimization i − i in K-Subspaces is computationally efficient, in practice several restarts are typically used, in order to select the best among multiple local minima. In fact, the higher the ambient dimension D is, the more restarts are required, which significantly increases the compu- tational burden of K-Subspaces. Moreover, K-Subspaces is robust to noise, but it is not robust to outliers, since the update of the subspaces is traditionally done in an `2 sense.

For this reason, it has been recently considered to replace the `2-norm with the `1-norm, leading to what is known as Median K-Flats [60, 121].

RANSAC/K-Subspaces Hybrids. In principle, any single subspace learning method can be used to perform subspace clustering, either via a RANSAC-style or a via a K-

Subspaces-style scheme or a combination of both. For example, if is a method that takes M a dataset and fits to it a linear subspace, for instance REAPER ( 2.1), then one can use § M to compute a first subspace, remove the points in the dataset lying close to it, then compute a second subspace and so on (RANSAC-style). Alternatively, one can start with a random guess for n subspaces, cluster the data according to their distance to these subspaces, and then use (instead of the classic SVD) to fit a new subspace to each cluster, and so on M (K-Subspaces-style). Generally speaking, such strategies can be powerful, provided that

23 CHAPTER 2. PRIOR ART

one has a good initialization in the case of K-Subspaces variants, but more importantly,

an accurate estimate for the subspace dimensions. In practice, this latter requirement is

usually a weakness, because estimating the subspace dimensions is a hard problem.

Algebraic Subspace Clustering. ASC was originally proposed in [110] with the goal of solving the subspace clustering problem in closed form, which was first achieved in [110] for the case of a union of hyperplanes (which are subspaces of maximal relative dimension), and later in [111] for subspaces of any dimension.

The idea behind ASC is to fit a polynomial p(x ,...,x ) R[x ,...,x ] of degree n 1 D ∈ 1 D to the data, where n is the number of hyperplanes, and x1,...,xD are polynomial indeter- minates. In the absence of noise, this polynomial can be shown to have the form

p(x)=(b>x) (b>x), x := [x , ,x ]> , (2.9) 1 ··· n 1 ··· D where bi is the normal vector to the i-th hyperplane. This reduces the problem of learn- ing the n hyperplanes to that of factorizing p(x) to the product of linear factors; this was elegantly done in [110]. When the data are contaminated by noise, the fitted polynomial need no longer be factorizable. This difficulty was circumvented in [111], where it was shown that the gradient of the polynomial evaluated at point xj is a good estimate for the normal vector of the hyperplane that xj lies closest to. This idea generalizes to the case

of subspaces of arbitrary dimensions, and an elegant theorem [20, 72, 111] assures that if

p1,...,ps is a basis of polynomials of degree n that vanish on the union of the subspaces

, and y is a point of , then = Span( p y , , p y )⊥; remarkably, S1 ∪···∪Sn i Si Si ∇ 1| i ··· ∇ s| i this formula is valid irrespectively of what the subspace dimensions are.

24 CHAPTER 2. PRIOR ART

Even though the theoretical guarantees of ASC are fascinating, its practical application

has been limited mainly due to two reasons. First, available implementations of ASC are

sensitive to noise or are suitable only for hyperplanes. Second, computing vanishing poly-

nomials of degree n is a task whose complexity is exponential in the number of subspaces n and ambient dimension D.

2.3 Challenges in high relative dimensions

When the subspaces associated to the data have low dimension, the task of learning them can be successfully done by existing methods [12, 30, 84, 118]. Despite this, comparably little is known about the high relative dimension setting, which only a few methods seem able to handle (mainly on an ad-hoc level). In this paragraph some of the key challenges in the high relative dimension setting are summarized.

1. Theoretical guarantees: Subspaces of high relative dimension tend to intersect,

e.g., two general hyperplanes in ambient dimension D always intersect in a (D 2)- − dimensional subspace. As a consequence, clustering points from the union of such

subspaces is intrinsically harder than in low relative dimensions, as the membership

of points lying close to the intersection is difficult to be resolved. This fact vio-

lates a major assumption of most modern subspace clustering methods (e.g., sparse

and low-rank methods), which require the subspaces to be sufficiently separated

[27, 28, 30, 31, 54, 64, 66, 69, 77, 106, 120] in order to provide guarantees of correct-

25 CHAPTER 2. PRIOR ART

ness. On the other hand, methods that do admit theoretical guarantees for the high

relative dimension setting, such as Algebraic Subspace Clustering (ASC) [109–111]

and Spectral Curvature Clustering (SCC) [15], are plagued by non-scalable imple-

mentations. Finally, although the K-Subspaces [9, 102] method is applicable to the

high relative dimension setting and scales reasonably well, it is based on a non-

convex objective function, lacks sufficient theoretical guarantees and is sensitive to

initialization.

2. Robustness to outliers: Informally, the average angle of a random point from a

randomly chosen subspace decreases as the subspace dimension increases. As a

consequence, distinguishing inliers from outliers becomes a challenge in the high

relative dimension regime. For example, state-of-the-art robust PCA methods [84,

118] perform well only when the subspace dimension is sufficiently small. Moreover,

RANSAC [34] can deal with high relative dimensions, but outliers are troublesome

unless the ambient dimension is small. Finally, ASC [109–111] and K-Subspaces

[9,102] are sensitive to outliers since they rely on SVD-based PCA, which is known

to lack robustness to outliers.

3. A priori knowledge of subspace dimensions: When the subspaces have low relative

dimension, the dataset exhibits additional structure, such as low-rank components

[31,64,66,106,118] or sparse self-expressive patterns [27,28,30,54,77,84,120]. This

additional structure makes it possible to detect outliers or cluster the data without

26 CHAPTER 2. PRIOR ART

needing to know the subspace dimensions in advance. On the other hand, when the

subspaces are of high relative dimension, this additional structure disappears. Thus,

methods such as RANSAC [34], K-Subspaces [9,102], SCC [15] and REAPER [62]

require a priori knowledge of the subspace dimensions. Remarkably, ASC [111] does

not require such knowledge, but this advantage comes at the cost of an exponential

computational complexity.

4. Non-convex optimization: Existing methods that are in principle suitable for sub-

spaces of high relative dimension and at the same time scale well, e.g., K-Subspaces

[9, 102] or Median K-Flats (MKF) [121], involve non-convex optimization. As a

consequence, the theoretical guarantees of such methods are usually limited to con-

vergence to a local minimum. In addition, one can empirically observe that their

performance is often very sensitive to initialization; this sensitivity becomes particu-

larly pronounced in large ambient dimensions.

5. Scalability: Some methods overcome several of the above challenges, but this comes

at the cost of high computational complexity. For example, ASC [109–111] has

strong theoretical guarantees but also exponential complexity, due to the fitting of

polynomials of degree n to the data. RANSAC [34] is robust to outliers but may

require an exponentially large number of trials in order to sample outlier-free subsets

of the data. SCC [15] can, in theory, cluster subspaces of equal high relative di-

mension but it requires the computation of exponentially many relations between the

27 CHAPTER 2. PRIOR ART

data. On the other hand, K-Subspaces [9,102] scales fairly well, but it is sensitive to

initialization, outliers, and requires advanced knowledge of the subspace dimensions.

In summary, while there has been much progress in subspace learning and clustering, most of the existing work does not apply to the case of subspaces of high relative dimension.

The goal of this thesis is to address several of the fundamental challenges mentioned above, by introducing Dual Principal Component Pursuit (Chapter 3) and Filtrated Algebraic

Subspace Clustering (Chapter 4) .

28 Chapter 3

Dual Principal Component Pursuit

3.1 Introduction

This chapter contains the first main contribution of this thesis, which is a method for ro-

bust subspace learning and clustering, called Dual Principal Component Pursuit (DPCP).

DPCP aims at learning a subspace by explicit computation of its orthogonal complement; this is a natural approach, when the subspace has high relative dimension, in which case its orthogonal complement is a low-dimensional linear subspace. Even though traditional methods such as PCA [50, 55, 78], RANSAC [34] or K-Subspaces [9, 60, 102, 121] can be configured to estimate the orthogonal complement of a subspace, the advantage of DPCP over such methods is that DPCP computes a basis for the orthogonal complement via solv-

29 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

ing a non-convex problem to global optimality.1

Consider for a moment the case of single subspace learning in the presence of outliers,

and suppose that there is no noise, i.e., that the inliers perfectly lie inside a linear subspace

of dimension d < D (here d can be anything). Then the key idea of DPCP comes from S

the fact that the inliers lie inside any hyperplane = Span(b )⊥ that contains the linear H1 1 subspace associated to the inliers. This suggests that, instead of attempting to fit directly S a low-dimensional linear subspace to the entire dataset, as done e.g. in [118], we can search for a maximal hyperplane that contains as many points of the dataset as possible. When H1 the data are in general position2, such a maximal hyperplane will contain the entire set of inliers together with possibly a few outliers. After removing the points that do not lie in that hyperplane, the robust subspace learning problem is reduced to one with a potentially much smaller outlier percentage than in the original dataset. In fact, the number of outliers in the new dataset will be precisely D d 1 D 2, and this upper bound can be used − − ≤ − to dramatically facilitate the outlier detection process using existing methods.

As an example, suppose we have N = 1000 inliers lying in general position inside a

90-dimensional linear subspace of R100. Suppose that the dataset is corrupted by M =

1000 outliers lying in general position in R100. Then among all hyperplanes of R100, the hyperplanes that contain as many points as possible from the dataset (maximal hyperplanes) must contain all 1000 inliers. On the other hand, since the dimensionality of the inliers is

1A recent method that also works with the orthogonal complement of the subspace is [11], for which, however, no theoretical guarantees seem to be known; see 2.1. 2By general position we mean that there are no relations§ between the data points, other than those implied by the membership of the inliers to the subspace . S

30 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

90 and the dimensionality of the hyperplane is 99, there are only 99 90 = 9 linearly − independent directions left for the hyperplane to fit, i.e., such a hyperplane will contain

exactly 9 outliers (it can not contain more outliers since this would violate the general

position hypothesis). If we remove the points of the dataset that do not lie in the hyperplane,

then we are left with 1000 inliers and only 9 outliers; interestingly, a simple application of

RANSAC will identify the remaining 9 outliers in only a few trials.

Alternatively, one can consider finding a sequence of c = D d orthogonal maximal −

hyperplanes = Span(b )⊥ with b b , i = j, thus leading to a Dual Principal Com- Hi i i ⊥ j 6 ponent Analysis (DPCA) of the dataset, in the sense that the inlier subspace is precisely S equal to c . Finally, when the data come from multiple subspaces, every maximal i=1 Hi hyperplaneT contains all points coming from one of the subspaces and the ideas presented

above can be adapted for the purpose of subspace clustering.

It follows from the above discussion that the problem of searching for maximal hyper-

planes is an important one. As it turns out, we can formalize this task as an `0 cosparsity-

type problem [75], which we will further relax to a non-convex `1 problem on the sphere. A large part of this chapter is devoted to showing that every global solution of this latter DPCP problem is a vector orthogonal to a subspace associated with the data. We further consider solving this problem by a recursion of convex relaxations, each of which is computation- ally equivalent to a linear program. We show that under mild conditions this recursion converges in a finite number of steps to a vector orthogonal to a subspace associated to the data. In particular, for the case of hyperplanes, the recursion converges to the global

31 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

minimum of the `1 non-convex problem. Alternative methods suitable for large-scale data

are also explored for solving the DPCP problem, and extensive experiments are performed

on both synthetic and real data, demonstrating the merit of DPCP.

3.2 Single subspace learning with outliers via DPCP

3.2.1 Problem formulation

In this section we formulate the problem of interest. We begin by establishing our data

model in 3.2.1.1, while in 3.2.1.2 we motivate the problem at a conceptual level. Finally, § § in 3.2.1.3 we formulate our problem as an optimization problem. §

3.2.1.1 Data model

We employ a deterministic noise-free data model, under which the given data is

˜ D L X =[X O]Γ =[x˜ ,..., x˜ ] R × , (3.1) 1 L ∈

D N D 1 where the inliers X =[x ,..., x ] R × lie in the intersection of the unit sphere S − 1 N ∈ with an unknown proper subspace of RD of unknown dimension 1 d D 1, and S ≤ ≤ − D M D 1 the outliers consist of M points O = [o ,..., o ] R × that lie on the sphere S − . 1 M ∈ The unknown permutation Γ indicates that we do not know which point is an inlier and which point is an outlier. Finally, we assume that the points X˜ are in general position, in the sense that there are no relations between the columns of X˜ except the ones implied by the inclusions X and X˜ RD. In particular, every D-tuple of columns of X˜ such ⊂ S ⊂ 32 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

that at most d points come frorm X is linearly independent. Notice that as a consequence

every d-tuple of inliers and every D-tuple of outliers are linearly independent, and also

X O = . Finally, to avoid degenerate situations we will assume that N d +1 and ∩ ∅ ≥ M D d.3 ≥ −

3.2.1.2 Conceptual formulation

Notice that we have made no assumption on the dimension of : indeed, can be anything S S from a line to a (D 1)-dimensional hyperplane. Ideally, we would like to be able to − partition the columns of X˜ into those that lie in and those that don’t. But under such S generality, this is not a well-posed problem since X lies inside every subspace that contains

, which in turn may contain some elements of O. In other words, given X˜ and without S any other a priori knowledge, it may be impossible to correctly partition X˜ into X and O.

Instead, it is meaningful to search for a linear subspace of RD that contains all of the inliers

and perhaps a few outliers. Since we do not know the intrinsic dimension d of the inliers, a natural choice is to search for a hyperplane of RD that contains all the inliers.

Problem 1. Given the dataset X˜ = [X O] Γ, find a hyperplane that contains all the H inliers X .

Notice that hyperplanes that contain all the inliers always exist: any non-zero vector

b inside the orthogonal complement ⊥ of the linear subspace associated to the inliers S S 3If the number of outliers is less than D d, then the entire dataset is degenerate, in the sense that it lies in a proper hyperplane of the ambient space.− In such a case we can reduce the coordinate representation of the data and eventually satisfy the stated condition.

33 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

defines a hyperplane (with normal vector b) that contains all inliers X . Having such a hyperplane at our disposal, we can partition our dataset as X˜ = X˜ X˜ , where X˜ H1 1 ∪ 2 1 are the points of X˜ that lie in and X˜ are the remaining points. Then by definition of H1 2 , we know that X˜ will consist purely of outliers, in which case we can safely replace H1 2 ˜ ˜ our original dataset X with X 1 and reconsider the problem of robust subspace learning on

X˜ . We emphasize that X˜ will contain all the inliers X together with precisely D d 1 1 1 − − outliers 4, a number which may be dramatically smaller than the original number of outliers, especially when d is large. If d is small, then one may apply existing methods such as [118],

[84] or [34] to identify the remaining outliers. Alternatively, if d is known, one may repeat the above process c = codim = D d times, until c linearly independent hyperplanes S − ,..., have been found that contain all the inliers, in which case = c . H1 Hc S k=1 Hk T

3.2.1.3 Hyperplane pursuit by `1 minimization

In this section we propose an optimization framework for computing a hyperplane that contains all the inliers. To proceed, we need a definition.

Definition 3.1. A hyperplane of RD is called maximal with respect to the dataset X˜, if H ˜ it contains a maximal number of data points in X , i.e., if for any other hyperplane † of H D ˜ ˜ R we have that Card(X ) Card(X †). ∩H ≥ ∩H

In principle, hyperplanes that are maximal with respect to X˜ always solve Problem 1.

4This comes from the general position hypothesis.

34 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Proposition 3.2. Suppose that N d +1 and M D d, and let be a hyperplane that ≥ ≥ − H is maximal with respect to the dataset X˜. Then contains all the inliers X . H

Proof. By the general position hypothesis on X and O, any hyperplane that does not con- tain X can contain at most D 1 points from X˜. We will show that there exists a hyperplane − that contains more than D 1 points of X˜. Indeed, take d inliers and D d 1 outliers and − − − let be the hyperplane generated by these D 1 points. Denote the normal vector to that H − hyperplane by b. Since contains d inliers, b will be orthogonal to these inliers. Since X H is in general position, every d-tuple of inliers is a basis for Span(X ). As a consequence, b will be orthogonal to Span(X ), and in particular b X . This implies that X and so ⊥ ⊂H will contain N + D d 1 d +1+ D d 1 > D 1 points of X˜. H − − ≥ − − −

In view of Proposition 3.2, we may restrict our search for hyperplanes that contain all

the inliers X to the subset of hyperplanes that are maximal with respect to the dataset X˜.

The advantage of this approach is immediate: the set of hyperplanes that are maximal with

respect to X˜ is in principle computable, since it is precisely the set of solutions of the

following optimization problem

˜ min X >b s.t. b =0. (3.2) b 0 6

The idea behind (3.2) is that a hyperplane = Span(b)⊥ contains a maximal number of H columns of X˜ if and only if its normal vector b has a maximal cosparsity level with respect

to the matrix X˜>, i.e., the number of non-zero entries of X˜>b is minimal.

Since (3.2) is a combinatorial problem, which in general can not be solved efficiently,

35 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

we consider its natural relaxation

X˜> min b s.t. b 2 =1, (3.3) b 1 k k

which we will refer to as Dual Principal Component Pursuit (DPCP). A major question

that arises, to be answered in Theorem 3.12, is under what conditions every global solution

of (3.3) is orthogonal to the inlier subspace Span(X ). A second major question, raised

D 1 by the non-convexity of the constraint b S − , is how to efficiently solve (3.3) with ∈ theoretical guarantees for global optimality.

We emphasize here that the optimization problem (3.3) is far from new; interestingly, its earliest appearance in the literature that we are aware of is in [85], where the authors proposed to solve it by means of the recursion of convex problems given by 5

˜> nk+1 := argmin X b . (3.4) b>nˆ =1 1 k

Notice that the optimization problem in (3.4) is computationally equivalent to a linear pro- gram; this makes the recursion (3.4) a very appealing candidate for solving the non-convex

(3.3). Even though [85] proved the very interesting result that the sequence n generated { k} by (3.4) converges to a critical point of (3.3) in a finite number of steps (see Theorem 3.14), there is no reason to believe that in general this limit point is a global minimum of (3.3).

Interestingly, this thesis adopts the recursion (3.4) of [85] for solving (3.3), and shows that it converges to a normal vector to the subspace (Theorem 3.15), which, in the special case of d = D 1, implies convergence to the global minimum of (3.3). − 5Being unaware of the work of [85], we independently proposed the same recursion in [96].

36 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Other work in which optimization problem (3.3) appears are [80, 86, 88–91]. More specifically, [86] proposes to solve (3.3) by replacing the quadratic constraint b>b = 1 with a linear constraint b>w = 1 for some vector w. In [80, 89] (3.3) is approximately solved by alternating minimization, while a Riemannian trust-region approach is employed in [90]. Finally, we note that problem (3.3) is closely related to the non-convex problem

(2.3) associated with REAPER. To see this, suppose that the REAPER orthoprojector Π appearing in (2.3), represents the orthogonal projection to a hyperplane with unit-` H 2 normal vector b. In such a case I Π = bb>, and it readily follows that problem (2.3) D − becomes identical to problem (3.3).

3.2.2 Theoretical analysis of the continuous problem

In this section we begin our theoretical investigation of the non-convex problem (3.3) as well as the recursion of convex relaxations (3.4), by associating to these two problems certain underlying continuous problems ( 3.2.2.1). We will refer to (3.3) and (3.4) as dis- § crete problems, since they involve a finite set of inliers and outliers. In sharp contrast, the continuous problems depend on uniform distributions on the respective inlier and outlier spaces, which makes them easier to analyze. The analysis of this section reveals that both the continuous analogue of (3.3) as well the continuous recursion corresponding to (3.4) are naturally associated with vectors orthogonal to the inlier subspace ( 3.2.2.2). This S § suggests that under certain conditions on the distribution of the data, the same must be true for the discrete problem (3.3) and recursion (3.4) (to be established in 3.2.3). §

37 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

3.2.2.1 The underlying continuous problem

In this section we show that the problems of interest (3.3) and (3.4) can be viewed as dis-

crete versions of certain continuous problems, which we analyze. To begin with, consider

D 1 D 1 given outliers O = [o ,..., o ] S − and inliers X = [x ,..., x ] S − , 1 M ⊂ 1 N ⊂ S∩ and recall the notation X˜ = [X O]Γ, where Γ is an unknown permutation. Next, for any

D 1 D 1 b S − define the function fb : S − R by fb(z) = b>z . Define also discrete ∈ → D 1 measures µO and µX on S − associated with the outliers and inliers respectively, as

1 M 1 N µO(z)= δ(z o ) and µX (z)= δ(z x ), (3.5) M − j N − j j=1 j=1 X X D 1 where δ( ) is the Dirac function on S − , which is defined through the property ·

g(z)δ(z z0)dµSD−1 = g(z0), (3.6) z SD−1 − Z ∈

D 1 D 1 for every g : S − R and every z S − ; where µ D−1 is the uniform measure on → 0 ∈ S D 1 S − .

With these definitions, we have that the objective function X˜>b appearing in (3.3) 1

and (3.4) is the sum of the weighted expectations of the function under the measures O fb µ

38 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

and µX , i.e.,

M N ˜ X >b = O>b + X >b = b>oj + b>xj (3.7) 1 1 1 j=1 j=1 X X M N = b>z δ(z oj)dµSD−1 + b>z δ(z xj)dµSD−1 (3.8) z SD−1 − z SD−1 − j=1 Z ∈ j=1 Z ∈ X X M N = b>z δ(z oj) dµSD−1 + b>z δ(z xj) dµSD−1 (3.9) z SD−1  −  z SD−1  −  Z ∈ j=1 Z ∈ j=1 X X     = M EµO (fb)+ N EµX (fb). (3.10)

Hence, the optimization problem (3.3), which we repeat here for convenience,

˜ min X >b s.t. b>b =1, (3.11) b 1

is equivalent to the problem

min [M EµO (fb)+ N EµX (fb)] s.t. b>b =1. (3.12) b

Similarly, the recursion (3.4), repeated here for convenience,

˜> nk+1 = argmin X b s.t. b>nˆ k =1, (3.13) b 1

is equivalent to the recursion

nk+1 = argmin [M EµO (fb)+ N EµX (fb)] s.t. b>nˆ k =1. (3.14) b

Now, the discrete measures µO,µX of (3.5), are discretizations of the continuous measures

D 1 µSD−1 , and µSD−1 respectively, where the latter is the uniform measure on S − . ∩S ∩S Hence, for the purpose of understanding the properties of the global minimizer of (3.12) and the limiting point of (3.14), it is meaningful to replace in (3.12) and (3.14) the discrete

39 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

measures µO and µX by their continuous counterparts µSD−1 and µSD−1 , and study the ∩S resulting continuous problems

min M Eµ (fb)+ N Eµ (fb) s.t. b>b =1, (3.15) b SD−1 SD−1∩S  

n = argmin M E (f )+ N E (f ) s.t. b>nˆ =1. (3.16) k+1 µSD−1 b µSD−1∩S b k b   It is important to note that if these two continuous problems have the geometric proper- ties of interest, i.e., if every global solution of (3.15) is a vector orthogonal to the inlier subspace, and similarly, if the sequence of vectors n produced by (3.16) converges { k} to a vector nk∗ orthogonal to the inlier subspace, then this correctness of the continuous problems can be viewed as a first theoretical verification of the correctness of the discrete formulations (3.3) and (3.4). The objective of the rest of this section is to establish that this is precisely the case.

Before stating and proving our main two results in this direction, we note that the con- tinuous objective function

(b)= M Eµ (fb)+ N Eµ (fb) (3.17) J SD−1 SD−1∩S can be re-written in a simple form. Writing b = b bˆ, and letting R be a rotation that k k2 takes bˆ to the first standard basis vector e1, we see that the first expectation in (3.17)

40 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

becomes equal to

D−1 EµSD−1 (fb)= fb(z)dµS (3.18) z SD−1 Z ∈

= b>z dµSD−1 (3.19) z SD−1 Z ∈

ˆ> = b 2 b z dµSD−1 (3.20) k k z SD−1 Z ∈

1 ˆ = b 2 z>R− Rb dµSD−1 (3.21) k k z SD−1 Z ∈

= b 2 z>e1 dµSD−1 (3.22) k k z SD−1 | | Z ∈

= b 2 z1 dµSD−1 = b 2 cD, (3.23) k k z SD−1 | | k k Z ∈ where z = (z1,...,zD)> is the coordinate representation of z, and cD is the mean height of the unit hemisphere of RD, given in closed form by

2 (D 2)!! π if D even, cD = −  (3.24) (D 1)!! ·  −  1 if D odd,  where the double factorial is defined as 

k(k 2)(k 4) 4 2 if k even, k!! :=  − − ··· · (3.25)   k(k 2)(k 4) 3 1 if k odd. − − ··· ·  To see what the second expectation in (3.17) evaluates to, decompose b as b = π (b)+ S

π ⊥ (b), and note that because the support of the measure µSD−1 is contained in , we S ∩S S

41 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

must have that

E (f )= b>z dµ D−1 (3.26) µSD−1∩S b S z SD−1 ∩S Z ∈

= b>z dµSD−1 (3.27) z SD−1 ∩S Z ∈ ∩S

= (π (b))> z dµSD−1 (3.28) z SD−1 S ∩S Z ∈ ∩S

\ > = π (b) 2 π (b) z dµSD−1 . (3.29) k S k z SD−1 S ∩S Z ∈ ∩S  

Writing z0 and b0 for the coordinate representation of z and π\( b) with respect to a basis S

of , and noting that µSD−1 = µSd−1 , we have that S ∩S ∼

\ > π (b) z dµSD−1 = z0>b0 dµSd−1 = cd, (3.30) z SD−1 S ∩S z0 Sd−1 Z ∈ ∩S Z ∈   d where now cd is the average height of the unit hemisphere of R . Finally, noting that

π (b) = b cos(φ), (3.31) k S k2 k k2

where φ [0,π/2] is the principal angle of b from the subspace , we have that ∈ S

Eµ (fb)= b cd cos(φ). (3.32) SD−1∩S k k2

Putting everything together, we arrive at the final form of our continuous objective function:

(b)= M Eµ (fb)+ N Eµ (fb)= b (McD + Ncd cos(φ)) . (3.33) J SD−1 SD−1∩S k k2

3.2.2.2 Conditions for global optimality and convergence

We are now in a position to state and prove our main results about the continuous problems

(3.15) and (3.16).

42 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Theorem 3.3. Any global solution to problem (3.15) must be orthogonal to . S

Proof. Because of the constraint b>b = 1 in (3.15), and using (3.33), problem (3.15) can

be written as

min [McD + Ncd cos(φ)] s.t. b>b =1, (3.34) b where φ [0,π/2] is the principal angle of b from . It is then immediate that the global ∈ S minimum is equal to McD and it is attained if and only if φ = π/2, which corresponds to b . ⊥S

D 1 Theorem 3.4. Consider the sequence nk k 0 generated by recursion (3.16), nˆ 0 S − . { } ≥ ∈ Let φ be the principal angle of n from , and define α := Nc /Mc . Then, as long 0 0 S d D

as φ0 > 0, the sequence nk k 0 converges to a unit `2-norm element of ⊥ in a finite { } ≥ S

number k∗ of iterations, where k∗ = 0 if φ = π/2, k∗ = 1 if tan(φ ) 1/α, and 0 0 ≥ −1 tan (1/α) φ0 k∗ −1 − +1 otherwise. ≤ sin (α sin(φ0))   Proof. At iteration k the optimization problem associated with (3.16) is

min (b)= b 2 (McD + Ncd cos(φ)) s.t. b>nˆ k =1, (3.35) b RD J k k ∈

where φ [0,π/2] is the principal angle of b from the subspace . ∈ S Let φ be the principal angle of nˆ from (Figure 3.1), and let n be a global k k S k+1 minimizer of (3.35), with principal angle from equal to φ [0,π/2]. We show that S k+1 ∈

43 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

⊥ S

nˆ k⊥

nˆ k†

nˆ k ψk φ hˆ S k k

Figure 3.1: Various vectors and angles appearing in the proof of Theorem 3.4.

φ φ . To see this, note that the decrease in the objective function at iteration k is k+1 ≥ k

(nˆ ) (n ) :=Mc nˆ + Nc nˆ cos(φ ) J k −J k+1 D k kk2 d k kk2 k Mc n Nc n cos(φ ). (3.36) − D k k+1k2 − d k k+1k2 k+1

Since n> nˆ = 1, we must have that n 1 = nˆ . Now if φ < φ , k+1 k k k+1k2 ≥ k kk2 k+1 k then cos(φ ) > cos(φ ). But then (3.36) implies that (n ) > (nˆ ), which is a k+1 k J k+1 J k contradiction on the optimality of n . Hence it must be the case that φ φ , and so k+1 k+1 ≥ k the sequence φ is non-decreasing. In particular, since φ > 0 by hypothesis, we must { k}k 0 also have φ > 0, i.e., nˆ , k 0. k k 6∈ S ∀ ≥

Letting ψ be the angle of b from nˆ , the constraint b>nˆ =1 gives 0 ψ < π/2 and k k k ≤ k b =1/ cos(ψ ), and so we can write the optimization problem (3.35) equivalently as k k2 k

McD + Ncd cos(φ) min s.t. b>nˆ k =1. (3.37) b RD cos(ψ ) ∈ k

If nˆ is orthogonal to , i.e., φ = π/2, then (nˆ ) = Mc (b), b : b>nˆ = 1, k S k J k D ≤ J ∀ k with only if b = nˆ . As a consequence, n 0 = nˆ , k0 > k, and in particular if k k k ∀

φ0 = π/2, then k∗ =0.

44 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

So suppose that φk < π/2 and let nˆ k⊥ be the normalized orthogonal projection of nˆ k onto ⊥ (Figure 3.1). We will prove that every global minimizer of problem (3.37) must lie S in the two-dimensional plane := Span(nˆ , nˆ ⊥). To see this, let b have norm 1/ cos(ψ ) H k k k for some ψ [0,π/2). If ψ > π/2 φ , then such a b can not be a global minimizer of k ∈ k − k

(3.37), as the feasible vector nˆ ⊥/ sin(φ ) already gives a smaller objective, since k k ∈H

McD McD McD + Ncd cos(φ) (nˆ ⊥/ sin(φ )) = = < = (b). (3.38) J k k sin(φ ) cos(π/2 φ ) cos(ψ ) J k − k k

Thus, it must be the case that ψ π/2 φ . Denote by hˆ the normalized projection k ≤ − k k of nˆ onto and by nˆ † the vector that is obtained from nˆ by rotating it towards nˆ ⊥ by k S k k k

ψ (Figure 3.1). Note that both hˆ and nˆ † lie in . Letting Ψ [0,π] be the spherical k k k H k ∈

angle between the spherical arc formed by nˆ k, bˆ and the spherical arc formed by nˆ k, hˆ k,

and denoting by ∠b, hˆ k the angle between b and hˆ k, the first spherical law of cosines gives

cos(∠b, hˆ k) = cos(φk)cos(ψk) + sin(φk)sin(ψk) cos(Ψk). (3.39)

Now, Ψ is equal to π if and only if nˆ , hˆ , b are coplanar, i.e., if and only if b . k k k ∈ H Suppose that b . Then Ψ <π, and so cos(Ψ ) > 1, which implies that 6∈ H k k −

cos(∠b, hˆ ) > cos(φ )cos(ψ ) sin(φ )sin(ψ ) = cos(φ + ψ ). (3.40) k k k − k k k k

This in turn implies that the principal angle φ of b from is strictly smaller than φ + ψ , S k k and so

McD + Ncd cos(φ) McD + Ncd cos(φk + ψk) (b)= > = (nˆ k† / cos(ψk)), (3.41) J cos(ψk) cos(ψk) J

45 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

i.e., the feasible vector nˆ † / cos(ψ ) gives strictly smaller objective than b. k k ∈H

To summarize, for the case where φk < π/2, we have shown that any global minimizer b of (3.37) must i) have angle ψ from nˆ less or equal to π/2 φ , and ii) it must lie in k k − k

Span(nˆ k, nˆ k⊥). Hence, we can rewrite (3.37) in the equivalent form

McD + Ncd cos(φk + ψ) min k(ψ) := , (3.42) ψ [ π/2+φk,π/2 φk] J cos(ψ ) ∈ − − k where now ψk takes positive values as b approaches nˆ k⊥ and negative values as it approaches hˆ . The function is continuous and differentiable in the interval [ π/2+ φ ,π/2 φ ], k Jk − k − k with derivative given by

∂ Mc sin(ψ) Nc sin(φ ) Jk = D − d k . (3.43) ∂ψ cos2(ψ)

Setting the derivative to zero gives

sin(ψ)= α sin(φk). (3.44)

If α sin(φ ) sin(π/2 φ ) = cos(φ ), or equivalently tan(φ ) 1/α, then is strictly k ≥ − k k k ≥ Jk decreasing in the interval [ π/2+φ ,π/2 φ ], and so it must attain its minimum precisely − k − k at ψ = π/2 φ , which corresponds to the choice n = nˆ ⊥/ sin(φ ). Then by an earlier − k k+1 k k argument we must have that nˆ 0 , k0 k +1. If, on the other hand, tan(φ ) < 1/α, k ⊥S ∀ ≥ k then the equation (3.44) defines an angle

1 ψ∗ := sin− (α sin(φ )) (0,π/2 φ ), (3.45) k k ∈ − k at which must attain its global minimum, since Jk 2 ∂ k 1 ∗ (3.46) J2 (ψk)= > 0. ∂ψ cos(ψk∗)

46 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

As a consequence, if tan(φk) < 1/α, then

1 φk+1 = φk + sin− (α sin(φk)) < π/2. (3.47)

We then see inductively that as long as tan(φk) < 1/α, φk increases by a quantity which is

1 bounded from below by sin− (α sin(φ0)). Thus, φk will keep increasing until it becomes greater than the solution to the equation tan(φ)=1/α, at which point the global minimizer will be the vector n = nˆ ⊥/ sin(φ ), and so nˆ 0 = nˆ , k0 k +1. Finally, under k+1 k k k k+1 ∀ ≥ 1 the hypothesis that φk < tan− (1/α), we have

k 1 − 1 1 φ = φ + sin− (α sin(φ )) φ + k sin− (α sin(φ )), (3.48) k 0 j ≥ 0 0 j=0 X from where it follows that the maximal number of iterations needed for φk to become

−1 1 tan (1/α) φ0 larger than tan− (1/α) is −1 − , at which point at most one more iteration will sin (α sin(φ0))   be needed to achieve to . S

Notice the remarkable fact that according to Theorem 3.4, the continuous recursion

(3.16) converges to a vector orthogonal to the inlier subspace in a finite number of steps. S Moreover, if the relation

tan(φ ) 1/α, (3.49) 0 ≥ holds true, then this convergence occurs in a single step. One way to interpret (3.49) is to notice that as long as the angle φ0 of the initial estimate nˆ 0 from the inlier subspace is positive, and for any arbitrary but fixed number of outliers M, there is always a sufficiently

large number N of inliers, such that (3.49) is satisfied and thus convergence occurs in

47 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

one step. Conversely, for any fixed number of inliers N and outliers M, there is always a sufficiently large angle φ0 such that (3.49) is true, and thus (3.16) again converges in a

single step. More generally, even when (3.49) is not true, the larger φ0,N are, the smaller

the quantity

1 tan− (1/α) φ0 1 − (3.50) sin− (α sin(φ ))  0  is, and thus according to Theorem 3.3 the faster (3.16) converges.

3.2.3 Theoretical analysis of the discrete problem

In this section we analyze the discrete problem (3.3) and the associated discrete recursion

(3.4), where the adjective discrete refers to the fact that (3.3) and (3.4) depend on a finite set

˜ D 1 of points X =[X O]Γ sampled from the union of the space of outliers S − and the space

D 1 of inliers S − . In 3.2.2 we showed that these problems are discrete versions of the ∩S § continuous problems (3.15) and (3.16), for which we further showed that they possess the geometric property of interest, i.e., every global minimizer of (3.15) must be an element

D 1 of ⊥ S − (Theorem 3.3), and the recursion (3.16) produces a sequence of vectors S ∩ D 1 which converges to an element of ⊥ S − in a finite number of steps (Theorem 3.4). In S ∩ this section we show that under some deterministic conditions on the distribution of X =

[x1,..., xN ] and O = [o1,..., oM ], a similar statement holds for the discrete problems

(3.3) and (3.4). More specifically, in 3.2.3.1 we establish quantitative bounds that relate § the discrete objective function to its continuous counterpart, and in 3.2.3.2 and 3.2.3.3 § § we use these bounds to study the global minimizers of (3.3) and the limit point of (3.4).

48 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

3.2.3.1 Discrepancy bounds between continuous and discrete problems

The heart of our analysis framework is to bound the deviation of some underlying geometric

quantities, which we call the average outlier and the average inlier with respect to b, from

their continuous counterparts. To begin with, recall our discrete objective function

˜ discrete(b)= X >b = O>b + X >b (3.51) J 1 1 1

and its continuous counterpart

(b)= b (Mc + Nc cos(φ)) , (3.52) Jcontinuous k k2 D d

the latter derived in 3.2.2.1, equation (3.33). Now, notice that the term of the discrete § objective that depends on the outliers O can be written as

M M O>b o>b b> o>b o b> o (3.53) 1 = j = Sign( j ) j = M b, j=1 j=1 X X where Sign( ) is the sign function and · 1 M o := Sign(b>o )o (3.54) b M j j j=1 X D 1 D is the average outlier with respect to b. Defining a vector valued function f : S − R b → D 1 f b by z S − Sign(b>z)z, we notice that ∈ 7−→ 1 M o = f (o ), (3.55) b M b j j=1 X and so ob is a discrete approximation to the integral z SD−1 f b(z)dµSD−1 . The value of ∈ that integral is given by the next Lemma. R

49 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

D 1 Lemma 3.5. For any b S − we have ∈

f b(z)dµSD−1 = Sign(b>z)zdµSD−1 = cD b, (3.56) z SD−1 z SD−1 Z ∈ Z ∈

where cD is defined in (3.24).

Proof. Letting R be a rotation that takes b to the first canonical vector e1, i.e., Rb = e1, we have that

Sign(b>z)zdµSD−1 = Sign(b>R>Rz)zdµSD−1 (3.57) z SD−1 z SD−1 Z ∈ Z ∈

= Sign(e1>z)R>zdµSD−1 (3.58) z SD−1 Z ∈

= R> Sign(z1)zdµSD−1 , (3.59) z SD−1 Z ∈

where z1 is the first cartesian coordinate of z. Recalling the definition of cD in equation

(3.24), we see that

Sign(z1)z1dµSD−1 = z1 dµSD−1 = cD. (3.60) z SD−1 z SD−1 | | Z ∈ Z ∈ Moreover, for any i> 1, we have

Sign(z1)zidµSD−1 =0. (3.61) z SD−1 Z ∈ Consequently, the integral in (3.59) becomes

Sign(b>z)zdµSD−1 = R> Sign(z1)zdµSD−1 (3.62) z SD−1 z SD−1 Z ∈ Z ∈

= R> (cDe1)= cDb. (3.63)

50 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Now we define O to be the maximum among all possible approximation errors as b varies

D 1 on S − , i.e.,

O := max cD b ob 2 , (3.64) b SD−1 k − k ∈ and we establish that the more uniformly distributed O is the smaller O becomes.

D 1 The notion of uniformity of O = [o ,..., o ] S − that we employ here is a 1 M ⊂ deterministic one, and is captured by the spherical cap discrepancy of the set O, defined

as [40, 41]

1 M SD(O) := sup I (oj) µSD−1 ( ) . (3.65) M C − C C j=1 X D 1 In (3.65) the supremum is taken over all spherical caps of the sphere S − , where a C D 1 D spherical cap is the intersection of S − with a half-space of R , and I ( ) is the indicator C · function of , which takes the value 1 inside and zero otherwise. The spherical cap C C discrepancy SD(O) is precisely the supremum among all errors in approximating integrals of indicator functions of spherical caps via averages of such indicator functions on the point set O. Intuitively, SD(O) captures how close the discrete measure µO (see equation (3.5))

associated with O is to the measure µSD−1 , and we will be referring to O as being uniformly

D 1 distributed on S − , when SD(O) is small.

Remark 3.6. As a function of the number of points M, SD(O) decreases with a rate

of [5,22]

1 1 log(M)M − 2 − 2(D−1) . (3.66) p 51 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

As a consequence, to show that uniformly distributed points O correspond to small O, it suffices to bound the maximum integration error O from above by a quantity proportional

to the spherical cap discrepancy SD(O). Inequalities that bound from above the approx- imation error of the integral of a function in terms of the variation of the function and the discrepancy of a finite set of points (not necessarily the spherical cap discrepancy; there are several types of discrepancies) are widely known as Koksma-Hlawka inequalites [49, 58].

Even though such inequalities exist and are well-known for integration of functions on the unit hypercube [0, 1]D [44, 49, 58], similar inequalities for integration of functions on the

D 1 unit sphere S − seem not to be known in general [41], except if one makes additional as-

sumptions on the distribution of the finite set of points [10, 40]. Nevertheless, the function

fb : z b>z that is associated to O is simple enough to allow for a Koksma-Hlawka 7−→ | | inequality, as described in the next lemma.6

D 1 Lemma 3.7. Let O =[o1,..., oM ] be a finite subset of S − . Then

√ O O = max cDb ob 2 5SD( ), (3.67) b SD−1 k − k ≤ ∈

where cD, ob and SD(O) are defined in (3.24), (3.54) and (3.65) respectively.

D 1 Proof. For any b S − we can write ∈

c b ob = ρ b + ρ ζ, (3.68) D − 1 2

D 1 2 2 for some vector ζ S − orthogonal to b, and so it is enough to show that ρ + ρ ∈ 1 2 ≤ p √5S (O). Let us first bound from above ρ in terms of S (O). Towards that end, D | 1| D 6The author is grateful to Prof. Glyn Harman, who suggested that the proof of Theorem 1 in [44] can be easily adapted to establish the result of Lemma 3.7.

52 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT observe that

1 M ρ = b>(c b ob)= c b>o (3.69) 1 D − D − M j j=1 X 1 M = fb(z)dµSD−1 fb(oj), (3.70) D−1 − M z S j=1 Z ∈ X where the equality follows from the definition of cD in (3.24) and recalling that fb(z) =

D 1 b>z . In other words, ρ1 is the error in approximating the integral of fb on S − by the

average of fb on the point set O.

D 1 Now, notice that each super-level set z S − : fb(z) α for α [0, 1], is the ∈ ≥ ∈ union of two spherical caps, and also that 

sup fb(z) inf fb(z)=1 0=1. (3.71) D z SD−1 − z S −1 − ∈ ∈

We these in mind, repeating the entire argument of the proof of Theorem 1 in [44] that lead to inequality (9) in [44], but now for a measurable function with respect to µSD−1 (that would be fb), leads directly to

ρ S (O). (3.72) | 1|≤ D

For ρ2 we have that

ρ = ζ> (c b) ζ>ob (3.73) 2 D − 1 M = Sign b>z ζ>zdµSD−1 Sign b>oj ζ>oj (3.74) z SD−1 − M Z ∈ j=1  X  1 M = gb,ζ(z)dµSD−1 gb,ζ(oj), (3.75) D−1 − M z S j=1 Z ∈ X

53 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

D 1 where gb ζ : S − R is defined as gb ζ(z) = Sign b>z ζ>z. Then a similar argument , → ,  as for ρ1, with the difference that now

sup gb,ζ(z) inf gb,ζ(z)=1 ( 1)=2, (3.76) D z SD−1 − z S −1 − − ∈ ∈ leads to

ρ 2S (O). (3.77) | 2|≤ D

In view of (3.72), inequality (3.77) establishes that ρ2 + ρ2 √5S (O), which con- 1 2 ≤ D p cludes the proof of the lemma.

We now turn our attention to the inlier term X˜>b of the discrete objective function 1

(3.51), which is slightly more complicated than the outlier term. We have

N N X >b x>b b> x>b x b> x (3.78) 1 = j = Sign( j ) j = N b, j=1 j=1 X X where

1 N 1 N x := Sign(b>x )x = f (x ) (3.79) b N j j N b j j=1 j=1 X X

is the average inlier with respect to b. Thus, xb is a discrete approximation of the integral

x SD−1 f b(x)dµSD−1 , whose value is given by the next lemma. ∈ ∩S R D 1 Lemma 3.8. For any b S − we have ∈

f b(x)dµSD−1 = Sign(b>x)xdµSD−1 = cd vˆ, (3.80) x SD−1 x SD−1 Z ∈ ∩S Z ∈ ∩S

where cd is given by (3.24) after replacing D with d, and v is the orthogonal projection of

v onto . S 54 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Proof. Since x lies in , we have f (x)= f (x)= f (x), so that S b v vˆ

Sign(b>x)xdµSD−1 = Sign(vˆ>x)xdµSD−1 . (3.81) x SD−1 x SD−1 Z ∈ ∩S Z ∈ ∩S Now express x and vˆ on a basis of , use Lemma 3.5 replacing D with d, and then switch S back to the standard basis of RD.

Next, we define X to be the maximum among all possible approximation errors as b

D 1 varies on S − , which is the same as the maximum of all approximation errors as b varies

D 1 on S − , i.e., ∩S

\ X := max cd π (b) xb = max cd b xb 2 . (3.82) b SD−1 S − 2 b SD−1 k − k ∈ ∈ ∩S

Then an almost identical argument as the one that established Lemma 3.7 gives that

X √5S (X ), (3.83) ≤ d

where now the discrepancy Sd(X ) of the inliers X is defined exactly as in (3.65) with the

D 1 d 1 only difference that the supremum is taken over all spherical caps of S − = S − . ∩S ∼

As a consequence of our definitions of O and X we obtain lower and upper bounds of our discrete objective function (3.51) in terms of its continuous counterpart (3.52), that will be repeatedly used in the sequel.

D 1 Lemma 3.9. Let b S − ⊥, and let φ [0,π/2) be its principal angle from . Then ∈ \S ∈ S

M(c + O) O>b M(c O), (3.84) D ≥ 1 ≥ D −

N(c + X ) cos( φ) X >b N(c X ) cos(φ). (3.85) d ≥ 1 ≥ d −

55 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Proof. We only prove the second inequality as the first is even simpler. Let v = 0 be the 6

orthogonal projection of b onto . By definition of X , there exists a vector ξ of ` S ∈ S 2 norm less or equal to X , such that

1 N x = x = Sign(b>x )x = c vˆ + ξ. (3.86) v b N j j d j=1 X Taking inner product of both sides with b gives

1 X >b = c cos(φ)+ b>ξ. (3.87) N 1 d

Now, the result follows by noting that b>ξ X cos(φ), since the principal angle of b ≤ from Span(ξ) can not be less then φ.

3.2.3.2 Conditions for global optimality of the discrete problem

In this section we give our main theorem regarding the global minimizers of (3.3). We

begin with a definition.

D 1 Definition 3.10. Given a set Y = [y ,..., y ] S − and an integer K L, define 1 L ⊂ ≤

Y to be the maximum circumradius among all polytopes of the form R ,K

K α y : α [ 1, 1] , (3.88) ji ji ji ∈ − ( i=1 ) X

where j1,...,jK are distinct in [L], and the circumradius of a bounded subset of

RD is the infimum over the radii of all Euclidean balls of RD that contain that subset.

Next, we state a result already known in [85], and for completeness we also give the proof.

56 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Proposition 3.11 ( [85]). Let Y = [y ,..., y ] be a D L matrix of rank D. Then any 1 L × global solution b∗ to

Y >b (3.89) min 1 , b>b=1

must be orthogonal to (D 1) linearly independent columns of Y . −

Proof. Let b∗ be an optimal solution of (3.89). Then b∗ must satisfy the first order optimal- ity relation

0 Y Sgn(Y >b∗)+ λb∗, (3.90) ∈ where λ is a Lagrange multiplier parameter, and Sgn is the sub-differential of the

`1 norm. Without loss of generality, let y1,..., yK be the columns of Y to which b∗ is orthogonal. Then equation (3.90) implies that there exist real numbers α ,...,α 1 K ∈ [ 1, 1] such that − K L 0 αjyj + Sign(yj>b∗)yj + λb∗ = . (3.91) j=1 j=K+1 X X Now, suppose that the span of y ,..., y is of dimension less than D 1. Then there exists 1 K − D 1 a unit norm vector ζ S − that is orthogonal to all y ,..., y , b∗, and multiplication of ∈ 1 K

(3.91) from the left by ζ> gives

L

Sign(yj>b∗)ζ>yj =0. (3.92) j=K+1 X Furthermore, we can choose a sufficiently small ε> 0, such that

Sign(y>b∗ + εy>ζ) = Sign(y>b∗), j [L]. (3.93) j j j ∀ ∈

57 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

The above equation is trivially true for all j such that yj>b∗ = 0, because in that case y>ζ =0 by the definition of ζ. On the other hand, if y>b∗ =0, then a small perturbation j j 6

 will not change the sign of yj>b∗. Consequently, we can write

y>(b∗ + εζ) = y>b∗ + ε Sign(y>b∗)y>ζ, j [L] (3.94) j j j j ∀ ∈

and so

L Y > b∗ ζ Y >b∗ y>b∗ ζ>y Y >b∗ (3.95) ( + ε ) 1 = 1 + ε Sign( j ) j = 1 . j=K+1 X

However,

b∗ + εζ = √1+ ε2 > 0, (3.96) k k2

and normalizing b∗ + εζ to have unit `2 norm, we get a contradiction on b∗ being a global

solution.

We are now ready to establish the main theorem of this section. Theorem 3.12 is the

discrete analogue of Theorem 3.3, and it says that if both inliers and outliers are sufficiently

uniformly distributed, then every global solution of (3.3) must be orthogonal to the inlier

subspace . More precisely, S

Theorem 3.12. Suppose that the condition

M c X c X ( O + X ) /N γ := < min d − , d − − R ,K1 R ,K2 , (3.97) N 2 O O  

holds for all non-negative integers K ,K such that K + K = D 1,K d 1. Then 1 2 1 2 − 2 ≤ − any global solution b∗ to (3.3) must be orthogonal to Span(X ).

58 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Proof. Let b∗ be an optimal solution of (3.3). Then b∗ must satisfy the first order optimality

relation

˜ ˜ 0 X Sgn(X >b∗)+ λb∗, (3.98) ∈

where λ is a scalar Lagrange multiplier parameter, and Sgn is the sub-differential of the `1 norm. For the sake of contradiction, suppose that b∗ . If b∗ , then using Lemma 6⊥ S ∈ S 3.9 we have

O O>b O>b∗ X >b∗ McD + M min 1 1 + 1 ≥ b ,b>b=1 ≥ ⊥S

Mc MO + Nc NX , (3.99) ≥ D − d −

which violates hypothesis (3.97). Hence, we can assume that b∗ . 6∈ S

Now, by the general position hypothesis as well as Proposition 3.11, b∗ will be orthog-

onal to precisely D 1 points, among which K points belong to O, say o ,..., o , and − 1 1 K1 0 K d 1 points belong to X , say x ,..., x . Then there must exist real numbers ≤ 2 ≤ − 1 K2 1 α , β 1, such that − ≤ j j ≤ K1 M K2 N 0 αjoj + Sign(oj>b∗)oj + βjxj + Sign(xj>b∗)xj + λb∗ = . (3.100) Xj=1 j=XK1+1 Xj=1 j=XK2+1

Since Sign(o>b∗)=0, j K and similarly Sign(x>b∗)=0, j K , we can equiva- j ∀ ≤ 1 j ∀ ≤ 2 lently write

K1 M K2 N 0 αjoj + Sign(oj>b∗)oj + βjxj + Sign(xj>b∗)xj + λb∗ = (3.101) j=1 j=1 j=1 j=1 X X X X or more compactly

ξO + M ob∗ + ξX + N xvˆ∗ + λb∗ = 0, (3.102)

59 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

where

K1 K2

ξO := αjoj and ξX := βjxj, (3.103) j=1 j=1 X X vˆ∗ is the normalized projection of b∗ onto (which is nonzero since b∗ by hypothesis), S 6⊥ S and

1 M 1 N o ∗ := Sign(o>b∗)o , x ∗ := Sign(x>v∗)x . (3.104) b M j j v N j j j=1 j=1 X X

Now, from the definitions of O and X in (3.64) and (3.82) respectively, we have that

ob∗ = c b∗ + ηO, ηO O (3.105) D || ||2 ≤

xv∗ = c vˆ∗ + ηX , ηX X , (3.106) ˆ d || ||2 ≤

and so (3.102) becomes

ξO + McD b∗ + M ηO + ξX + Ncd vˆ∗ + N ηX + λb∗ = 0. (3.107)

Since b∗ , we have that b∗, v∗ are linearly independent. Define := Span(b∗, vˆ∗) and 6∈ S U project (3.107) onto to get U

π (ξO)+ McD b∗ + Mπ (ηO)+ π (ξX )+ Ncd vˆ∗ + Nπ (ηX )+ λb∗ = 0. U U U U (3.108)

Notice that every vector u in the image of π can be written as a linear combination of U

60 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

b∗ and vˆ∗: u =[u]b∗ b∗ +[u]vˆ∗ vˆ∗. Using this decomposition, we can write (3.108) as

[π (ξO)]b∗ b∗ +[π (ξO)]vˆ∗ vˆ∗ + McD b∗ U U

+ M [π (ηO)]b∗ b∗ + M [π (ηO)]vˆ∗ vˆ∗ U U

+[π (ξX )]b∗ b∗ +[π (ξX )]vˆ∗ vˆ∗ + Ncd vˆ∗ U U 0 + N [π (ηX )]b∗ b∗ + N [π (ηX )]vˆ∗ vˆ∗ + λb∗ = . (3.109) U U

Since is a two-dimensional space, there exists a vector ζˆ that is orthogonal to b∗ but U ∈U not orthogonal to vˆ∗. Projecting the above equation onto the line spanned by ζˆ, we obtain the one-dimensional equation

ˆ> ([π (ξO)]vˆ∗ + M [π (ηO)]vˆ∗ +[π (ξX )]vˆ∗ + Ncd + N [π (ηX )]vˆ∗ ) ζ vˆ∗ = 0. (3.110) U U U U ·

Since ζˆ is not orthogonal to vˆ∗, the above equation implies that

[π (ξO)]vˆ∗ + M [π (ηO)]vˆ∗ +[π (ξX )]vˆ∗ + Ncd + N [π (ηX )]vˆ∗ =0, (3.111) U U U U

which, together with ξO O , ξX X , ηO O and ηX X , k k2 ≤ R ,K1 k k2 ≤ R ,K2 k k2 ≤ k k2 ≤ implies that

Nc O + MO + X + NX , (3.112) d ≤R ,K1 R ,K2 which violates the hypothesis (3.97). Consequently, the initial hypothesis of the proof that b∗ can not be true, and the theorem is proved. 6⊥ S

A Geometric View of the Proof of Theorem 3.12. Let us provide some geometric intu- ition that underlies the proof of Theorem 3.12. It is instructive to begin our discussion by

61 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

considering the case d = 1,D = 2, i.e. the inlier space is simply a line and the ambient

space is a 2-dimensional plane. Since all points have unit `2-norm, every column of X will

be of the form xˆ or xˆ for a fixed vector xˆ 1 that spans the inlier space . In this − ∈ S S

setting, let us examine a global solution b∗ of the optimization problem (3.3). We will start by assuming that such a b∗ is not orthogonal to , and intuitively arrive at the conclusion S that this can not be the case as long as there are sufficiently many inliers.

We will argue on an intuitive level that if b∗ , then the principal angle φ of b∗ from 6⊥ S

needs to be small. To see this, suppose b∗ ; then b∗ will be non-orthogonal to every S 6⊥ S inlier, and by Proposition 3.11 orthogonal to D 1=1 outlier, say o . The optimality − 1 condition (3.98) specializes to

M N 0 α1o1 + Sign(oj>b∗)oj + Sign(xj>b∗)xj + λb∗ = , (3.113) j=1 j=1 X X Mob∗

where 1 α 1. Notice| that{z the third} term is simply N Sign(xˆ>b∗)xˆ, and so − ≤ 1 ≤

α o + M ob∗ + λb∗ = N Sign(xˆ>b∗)xˆ. (3.114) 1 1 −

Now, what (3.114) says is that the point N Sign(xˆ>b∗)xˆ must lie inside the set −

Conv( o )+ Mob∗ + Span(b∗)= α o + Mob∗ + λb∗ : α 1,λ R , ± 1 { } { 1 1 | 1|≤ ∈ } (3.115) where the + operator on sets is the Minkowski sum. Notice that the set Conv( o )+Mob∗ ± 1 is the translation of the line segment (polytope) Conv( o ) by Mob∗ . Then (3.114) says ± 1 that if we draw all affine lines that originate from every point of Conv( o )+ Mob∗ and ± 1 62 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

have direction b∗, then one of these lines must meet the point N Sign(xˆ>b∗)xˆ. Let us −

illustrate this for the case where M = N = 5 and say it so happens that b∗ has a rather

large angle φ from , say φ = 45◦. Recall that ob∗ is concentrated around c b∗ and for S D 2 the case D = 2 we have cD = π . As illustrated in Figure 3.2, because φ is large, the

Mob∗ + Conv( o ) + Span(b∗) ± 1

Mob∗

Mob∗ + Conv( o ) ± 1 o1 b∗ φ N Sign(xˆ>b∗)xˆ S −

Figure 3.2: Geometry of the optimality condition (3.98) and (3.114) for the case d =

1,D = 2,M = N = 5. The polytope Mob∗ + Conv( o ) + Span(b∗) misses the point ± 1

N Sign(xˆ>b∗)xˆ and so the optimality condition can not be true for both b∗ = − 6⊥ S Span(xˆ) and φ large.

unbounded polytope Mob∗ + Conv( o ) + Span(b∗) misses the point N Sign(xˆ>b∗)xˆ ± 1 − thus making the optimality equation (3.114) infeasible. This indicates that critical vectors b∗ having large angles from are unlikely to exist. 6⊥ S S

On the other hand, critical points b∗ may exist, but their angle φ from needs 6⊥ S S to be small, as illustrated in Figure 3.3. However, such critical points can not be global

63 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Mob∗ + Conv( o ) + Span(b∗) ± 1 o1 o ∗ b∗ M b N Sign(xˆ>b∗)xˆ S −

Mob∗ + Conv( o ) ± 1

Figure 3.3: Geometry of the optimality condition (3.98) and (3.114) for the case d =

1,D = 2,M = N = 5. A critical b∗ exists, but its angle from is small, so that 6⊥ S S the polytope Mob∗ + Conv( o ) + Span(b∗) can contain the point N Sign(xˆ>b∗)xˆ. ± 1 −

However, b∗ can not be a global minimizer, since small angles from yield large objective S values. minimizers, because small angles from yield large objective values.7 S

Hence the only possibility that critical points b∗ that are also global minimizers 6⊥ S do exist is that the number of inliers is significantly less than the number of outliers, i.e.

N <

We should note here that the picture for the general setting is analogous to what we de- scribed above, albeit harder to visualize: with reference to equation (3.259), the optimality condition says that every feasible point b∗ must have the following property: there 6⊥ S exist 0 K d 1 inliers x ,..., x and 0 K D 1 K outliers o ,..., o ≤ 2 ≤ − 1 K2 ≤ 1 ≤ − − 2 1 K1

K1 to which b∗ is orthogonal, and two points ξO α o : α [ 1, 1] + ob∗ and ∈ j=1 j j j ∈ − n o 7On a more techincal level, it can be verified that if suchP a critical point is a global minimizer, then its angle φ must be large in the sense that cos(φ) 2O, contradicting the necessary condition that φ be small in the first place. ≤

64 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Mob∗ + Conv( o ) + Span(b∗) ± 1

Mob∗

Mob∗ + Conv( o ) ± 1 o1 b∗

N Sign(xˆ>b∗)xˆ −

Figure 3.4: Geometry of the optimality condition (3.98) and (3.114) for the case d =

1,D =2,N <b∗)xˆ. Moreover, such critical points can be global mini- − mizers. Condition (3.97) of Theorem 3.12 prevents such cases from occuring.

K2 ξX α x : α [ 1, 1] + xb∗ that are joined by an affine line that is parallel to ∈ j=1 j j j ∈ − nP o the line spanned by b∗. In fact in our proof of Theorem 3.12 we reduced this general case to

the case d =1,D =2 described above: this reduction is precisely taking place in equation

(3.108), where we project the optimality equation onto the 2-dimensional subspace . The U arguments that follow this projection consist of nothing more than a technical treatment of

the intuition given above.

Discussion of Theorem 3.12. Towards interpreting Theorem 3.12, consider first the asymp-

totic case where we allow N and M to go to infinity, while keeping the ratio γ constant,

and notice that the quantities O , X are always bounded from above by D 1. R ,K1 R ,K2 −

65 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Assuming that both inliers and outliers are perfectly well distributed in the limit, i.e., un- der the hypothesis that limN Sd(X )=0 and limM SD(O)=0, Lemma 3.7 and →∞ →∞ inequality (3.83) give that limN X = 0 and limM O = 0, in which case (3.97) is →∞ →∞ satisfied. This suggests the interesting fact that (3.3) is possible to give a normal to the inliers even for arbitrarily many outliers, and irrespectively of the subspace dimension d.

Along the same lines, for a given γ and under the point set uniformity hypothesis, we can always increase the number of inliers and outliers (thus decreasing X and O), while keep- ing γ constant, until (3.97) is satisfied, once again indicating that (3.3) is possible to yield a normal to the space of inliers irrespectively of their intrinsic dimension. Notice that the intrinsic dimension d of the inliers manifests itself through the quantity cd, which we recall is a decreasing function of d. Consequently, the smaller d is the larger the RHS of (3.97) becomes, and so the easier it is to satisfy (3.97).

3.2.3.3 Conditions for convergence of the discrete recursive algorithm

In this section we study the discrete recursion (3.4). We begin by stating and proving two results already known in [85].

Lemma 3.13 ( [85]). Let Y = [y ,..., y ] be a D L matrix of rank D. Then problem 1 L × min > Y >b admits a computable solution n that is orthogonal to (D 1) b nˆ k=1 1 k+1 − linearly independent points of Y .

Proof. Let n be a solution to min > Y >b that is orthogonal to less than D 1 k+1 b nˆ k=1 1 − linearly independent points of Y . Then we can find a unit norm vector ζ that is orthogonal

66 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT to the same points of Y that n is orthogonal to, and moreover ζ n . In addition, k+1 ⊥ k+1 by similar arguments as in the proof of Proposition 3.11, we can find a sufficiently small

ε> 0 such that

Y >(nk+1 + εζ) = Y >nk+1 + ε Sign(yj>nk+1)ζ>yj, (3.116) 1 1 j: nk+1 yj X6⊥ where

Sign(yj>nk+1)ζ>yj =0. (3.117) j: nk y X+16⊥ j

Since nk+1 is optimal, it must be the case that

Sign(yj>nk+1)ζ>yj =0, (3.118) j: nk y X+16⊥ j and so

Y > n ζ Y >n (3.119) ( k+1 + ε ) 1 = k+1 1 .

By (3.119) we see that as we vary ε the objective remains unchanged. Notice also that varying ε preserves all zero entries appearing in the vector Y >nk+1. Furthermore, because of (3.118), it is always possible to either decrease or increase ε until an additional zero is achieved, i.e., until nk+1 + εζ becomes orthogonal to a point of Y that nk+1 is not orthogonal to. Then we can replace nk+1 with nk+1 + εζ and repeat the process, until we get some n that is orthogonal to D 1 linearly independent points of Y . k+1 −

Theorem 3.14 ( [85]). Let Y = [y ,..., y ] be a D L matrix of rank D. Suppose 1 L × that for each problem min > Y >b , a solution n is chosen such that n is b nˆ k=1 1 k+1 k+1

67 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

orthogonal to D 1 linearly independent points of Y , in accordance with Lemma 3.13. −

Then the sequence n converges to a critical point of problem min > Y >b in a { k} b b=1 1

finite number of steps.

Proof. If nk+1 = nˆ k, then inspection of the first order optimality conditions of the two problems, reveals that nˆ is a critical point of min > Y >b . If n = nˆ , then k b b=1 1 k+1 6 k

n > 1, and so Y >nˆ < Y >nˆ . As a consequence, if n = nˆ , then k k+1k2 k+1 1 k 1 k+1 6 k nˆ k can not arise as a solution for some k0 > k. Now, because of Lemma 3.13, for each k,

there is a finite number of candidate directions nk+1. These last two observations imply

that the sequence n must converge in a finite number of steps to a critical point of { k}

> Y >b . minb b=1 1

We are now ready for the main theorem of this section, Theorem 3.15. Before stating

and proving the theorem, note that according to Theorem 3.3, the continuous recursion

converges in a finite number of iterations to a vector that is orthogonal to Span(X ) = , S as long as the initialization nˆ does not lie in (equivalently φ > 0). Intuitively, one 0 S 0 should expect that in passing to the discrete case, the conditions for the discrete recursion

(3.4) to converge to an element of ⊥ should be at least as strong as the conditions of S

Theorem 3.12, and strictly stronger than the condition φ0 > 0 of Theorem 3.4. Our next result formalizes this intuition.

Theorem 3.15. Suppose that condition (3.97) holds true and consider the vector sequence

nˆ k k 0 generated by the recursion (3.4). Let φ0 be the principal angle of nˆ 0 from { } ≥

68 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Span(X ) and suppose that

1 cd X 2 γO φ0 > cos− − − . (3.120) c + X  d 

Then after a finite number of iterations the sequence nˆ k k 0 converges to a unit `2-norm { } ≥ vector that is orthogonal to Span(X ).

Proof. We start by establishing that nˆ does not lie in the inlier space . For the sake of k S contradiction suppose that nˆ for some k > 0. Note that k ∈S

˜> ˜> ˜> ˜> X nˆ 0 X n1 X nˆ 1 X nˆ k . (3.121) 1 ≥ 1 ≥ 1 ≥···≥ 1

Suppose first that nˆ . Then (3.121) gives 0 ⊥S

O>nˆ O>nˆ + X >vˆ , (3.122) 0 1 ≥ k 1 k 1

where vˆ is the normalized projection of nˆ onto (and since nˆ , these two are k k S k ∈ S equal). Using Lemma 3.9, we take an upper bound of the LHS and a lower bound of the

RHS of (3.122), and obtain

Mc + MO Mc MO + Nc NX , (3.123) D ≥ D − d − or equivalently

c X γ d − , (3.124) ≥ 2 O which contradicts (3.97). Consequently, nˆ . Then (3.121) implies that 0 6⊥ S

O>nˆ + X >nˆ O>nˆ + X >nˆ , (3.125) 0 1 0 1 ≥ k 1 k 1

69 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

or equivalently

O>nˆ + cos(φ ) X >vˆ O>nˆ + X >vˆ , (3.126) 0 1 0 0 1 ≥ k 1 k 1

where vˆ is the normalized projection of nˆ onto . Once again, Lemma 3.9 is used to 0 0 S furnish an upper bound of the LHS and a lower bound of the RHS of (3.126), and yield

Mc + MO +(Nc + NX )cos(φ ) Mc MO + Nc NX , (3.127) D d 0 ≥ D − d −

which contradicts (3.120).

Now let us complete the proof of the theorem. We know by Theorem 3.14 that the

sequence n converges to a critical point n ∗ of problem (3.3) in a finite number of { k} k

steps, and we have already shown that n ∗ (see the beginning of the proof). k 6∈ S

Then an identical argument as in the proof of Theorem 3.12 (with nk∗ in place of b∗) shows

that n ∗ must be orthogonal to . k S

Discussion of Theorem 3.15. First note that if (3.97) is true, then the expression of (3.120)

always defines an angle between 0 and π/2. Moreover, Theorem 3.15 can be interpreted

using the same asymptotic arguments as Theorem 3.12; notice in particular that the lower

bound on the angle φ0 tends to zero as M,N go to infinity with γ constant, i.e., the more uniformly distributed inliers and outliers are, the closer n0 is allowed to be to Span(X ).

We also emphasize that Theorem 3.15 asserts the correctness of the linear program- ming recursions (3.4) as far as recovering a vector n ∗ orthogonal to := Span(X ) is k S concerned. Even though this was our initial motivation for posing problem (3.3), Theorem

3.15 does not assert in general that nk∗ is a global minimizer of problem (3.3). However,

70 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT this is indeed the case, when the inlier subspace is a hyperplane, i.e., d = D 1. This is S − D 1 because, up to a sign, there is a unique vector b S − that is orthogonal to (the normal ∈ S vector to the hyperplane), which, under conditions (3.97) and (3.120), is the unique global minimizer of (3.3), as well as the limit point nk∗ of Theorem 3.15.

3.2.4 Algorithmic contributions

In this section we discuss algorithmic formulations based on the ideas presented so far.

Specifically, 3.2.4.1 contains the basic Dual Principal Component Pursuit and Analysis § algorithms, which can be implemented by linear programming, while 3.2.4.2- 3.2.4.4 dis- § § cuss alternative DPCP algorithms suitable for noisy or high-dimensional data.

3.2.4.1 Relaxed DPCP and DPCA algorithms

Theorem 3.15 suggests a mechanism for obtaining an element b of ⊥, where = 1 S S Span(X ): run the sequence of linear programs (3.4) until the sequence nˆ converges { k} and identify the limit point with b1. Due to computational constraints, in practice one usu-

˜> ally terminates the recursions when the objective value X nˆ k converges within some 1

small ε, or a maximal number Tmax of recursions is reached, and obtains an approximate normal b1. The resulting Algorithm 3.1 is referred to as DPCP-r, standing for relaxed

DPCP.

We emphasize that step 6 of Algorithm 3.1 can be canonically solved by linear pro-

71 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Algorithm 3.1 Relaxed Dual Principal Component Pursuit ˜ 1: procedure DPCP-R(X ,ε,Tmax)

2: k 0;∆ ; ← J ← ∞

X˜> 3: nˆ 0 argmin b =1 b ; ← k k2 2

4: while kε do max J 5: k k +1; ←

X˜> 6: nk argminb>nˆ =1 b ; ← k−1 1

7: nˆ n / n ; k ← k k kk2

˜> ˜> ˜> 9 8: ∆ X nˆ k 1 X nˆ k / X nˆ k 1 + 10− ; J ← − 1 − 1 − 1    

9: end while

10: return nˆ k;

11: end procedure gramming. More specifically, we can rewrite it in the form

u+ min 1 1   , such that (3.128) b,u+,u− 1 N 1 N  × ×  u−     u+ ˜> IN IN X   0N 1 − − × +   u =   , u , u− 0, (3.129)  − ≥ 0 0    1 N 1 N nˆ k>     1   × ×         b        and solve it efficiently with an optimized general purpose linear programming solver, such as Gurobi [43]. Having computed a b1 with Algorithm 3.1, there are two possibilities: either is a hyperplane of dimension D 1 or dim < D 1. In the first case we can S − S −

72 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Algorithm 3.2 Relaxed Dual Principal Component Analysis ˜ 1: procedure DPCA-R(X ,c,ε,Tmax)

2: ; B ← ∅ 3: for i =1: c do

4: k 0;∆ ; ← J ← ∞

X˜> 5: nˆ 0 argminb b ; ← ⊥B 2

6: while k T and ∆ >ε do ≤ max J 7: k k +1; ← ˜ 8: n > X >b ; k argminb nˆ k =1,b ← −1 ⊥B 1

9: nˆ n / n ; k ← k k kk2

˜> ˜> ˜> 9 10: ∆ X nˆ k 1 X nˆ k / X nˆ k 1 + 10− ; J ← − 1 − 1 − 1    

11: end while

12: b nˆ ; i ← k 13: b ; B ← B ∪ { i} 14: end for

15: return ; B 16: end procedure

identify our subspace model with the hyperplane defined by the normal b1. If on the other hand dim < D 1, we can proceed to find a b b that is approximately orthogonal S − 2 ⊥ 1 to , and so on; this naturally leads to the relaxed Dual Principal Component Analysis of S Algorithm 3.2.

73 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

In Algorithm 3.2, c is an estimate for the codimension D d of the inlier subspace − Span( ). If c is rather large, then in the computation of each b , it is more efficient to be X i reducing the coordinate representation of the data to D (i 1) coordinates, by project- − −

ing the data orthogonally onto Span(b1,..., bi 1)⊥ and solving the linear program in the − projected space.

Notice further how the algorithm initializes n0: This is precisely the right singular vector of X˜> that corresponds to the smallest singular value, after projection of X˜ onto

Span(b1,..., bi 1)⊥. As it will be demonstrated in section 3.2.5.1, this choice has the −

effect that the angle of n0 from the inlier subspace is typically large; more precisely, it

is often larger than the smallest initial angle (3.120) required for the success of the linear

programming recursions.

3.2.4.2 Relaxed and denoised DPCP

When the data are corrupted by noise, one no longer expects the product X˜>b to be sparse, even if b is orthogonal to the underlying inlier space. Instead, our expectation is that, for such a b, X˜>b should be the sum of a sparse vector with a dense vector of small euclidean norm. This motivates us to replace the optimization problem

˜> min X b , s.t. b>nˆ k 1 =1, (3.130) b 1 −

which appears in Algorithm 3.1, with the denoised optimization problem

1 2 X˜> min τ y 1 + y b , s.t. b>nˆ k 1 =1, (3.131) y,b k k 2 − 2 −  

74 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

where τ is a positive parameter. Observe, that if the optimal b were known, then y would ˜ be given by the element-wise soft thresholding S (X >b), where the function S : R R τ τ → is defined by S (α) = Sign(α) max(0, α τ). If on the other hand y were known, then τ · | |−

b would be given by b = nˆ k 1 + U k 1z, where U k 1 is a D (D 1) matrix containing − − − × −

in its columns an orthonormal basis for the orthogonal complement of nˆ k 1, and z is the − solution to the standard least-squares problem

2 ˜> ˜> min y X nˆ k 1 X U k 1z . (3.132) z RD−1 − − − − 2 ∈

Observe that solving problem (3.132) requires solving a linear system of equations with

˜> > ˜> coefficient matrix X U k 1 X U k 1. The dependence of this matrix on the iteration − −   index k, may become a computational issue when D is large. This dependence can be

circumvented as follows. First, we treat the computation of b given y as a constrained

problem

2 ˜> min y X b , , s.t. b>nˆ k 1 =1, (3.133) b − 2 −

and thinking in terms of a Lagrange multiplier λ associated to the constraint b>nˆ k 1 = 1, − we see that λ, b must satisfy the relation

˜ ˜ ˜> X y X X b λnˆ k 1 =0. (3.134) − − −

Noting that X˜X˜> is always invertible under our data model, we must have that

1 1 ˜ ˜> − ˜ ˜> − b = X X y λ X X nˆ k 1. (3.135) − −    

75 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Now, multiplying the above equation from the left with nˆ k> 1, we obtain −

1 1 X˜X˜> − X˜X˜> − 1= nˆ k> 1 y λnˆ k> 1 nˆ k 1, (3.136) − − − −     or equivalently

1 X˜X˜> − nˆ k> 1 y 1 λ = − − , (3.137)   1 X˜X˜> − nˆ k> 1 nˆ k 1 − −   which upon substitution into (3.135) gives our final formula for b:

1 X˜X˜> − 1 nˆ k> 1 y 1 1 ˜ ˜> − − − ˜ ˜> − b = X X y X X nˆ k 1. (3.138)    1  − − X˜X˜> −   nˆ k> 1 nˆ k 1    − −      1 1 ˜ ˜> − ˜ ˜> − Notice that the quantities X X y and X X nˆ k 1 can be obtained as solutions −     to linear systems of equations with common coefficient matrix X˜X˜>, which themselves can be solved very efficiently by backward and forward substitution, assuming that a pre- computed Cholesky factorization of X˜X˜> is available.

To summarize, we have shown how to approximately solve problem (3.130) by a very efficient alternating minimization scheme, which involves soft-thresholding and forward- backward substitution; we will be referring to the resulting DPCP algorithm as DPCP-r-d, which stands for relaxed and denoised DPCP.

3.2.4.3 Denoised DPCP

Interestingly, DPCP-r-d is very closely related to the scheme proposed in [80], where the authors study problem (3.3) in the very different context of dictionary learning. To approx-

76 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT imately solve (3.3), the authors of [80] solve its denoised version

1 2 X˜> min τ y 1 + y b , (3.139) b,y: b 2=1 k k 2 − 2 || ||   by alternating minimization. Given b, the optimal y is given by X˜>b , where is the Sτ Sτ   soft-thresholding operator applied element-wise on the vector X˜>b. Given y the optimal b is a solution to the quadratically constrained least-squares problem

2 X˜> min y b , s.t. b 2 =1. (3.140) b RD − 2 k k ∈

In the context of [80], the coefficient matrix of the least-squares problem (X˜> in our no- tation) has orthonormal columns. As a consequence, the solution to (3.140) is obtained in closed form by projecting the solution of the unconstrained least-squares problem

˜ min y X >b (3.141) b RD − 2 ∈ onto the unit sphere. However, in our context the assumption that X˜> has orthonormal columns is strongly violated, so that the optimal b is no longer available in closed form.

In fact, problem (3.140) is well known in the literature [26,35,38], and the standard way to solve it is by means of Lagrange multipliers. This involves solving a non- for the Lagrange multiplier, which is known to be challenging [26]. For this reason we leave exact approaches for solving (3.140) to future investigations, and we instead propose to obtain an approximate b as in [80]. We will call the resulting Algorithm 3.3 DPCP-d, which stands for denoised DPCP.

Notice that DPCP-d is very efficient, since the least-squares problems that appear in the various iterations have the same coefficient matrix X˜X˜>, a factorization of which can be

77 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Algorithm 3.3 Denoised Dual Principal Component Pursuit ˜ 1: procedure DPCP-D(X ,ε,Tmax,δ,τ)

˜ ˜> 2: Compute a Cholesky factorization LL> = X X + δID;

3: k 0;∆ ; ← J ← ∞

X˜> 4: b argminb RD: b =1 b ; ← ∈ k k2 2

˜> 5: 0 τ X b ; J ← 1

6: while kε do max J 7: k k +1; ← ˜ 8: y X >b ; ←Sτ   ˜ 9: b solution of LL>ξ = X y by backward/forward propagation; ← 10: b b/ b ; ← k k2 2 1 X˜> 11: k τ y 1 + 2 y b ; J ← k k − 2

9 12: ∆ ( k 1 k) / ( k 1 + 10− ); J ← J − −J J − 13: end while

14: return (y, b);

15: end procedure

precomputed 8. As we will see in section 3.2.5 , the performance of DPCP-d is remarkably close to that of DPCP-r, for which we have guarantees of global optimality, suggesting that

DPCP-d converges to a global minimum. We leave theoretical investigations of DPCP-d to future research. 8The parameter δ in Algorithm 3.3 is a small positive number, typically 10−6, which helps avoiding solving ill-conditioned linear systems.

78 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Finally, notice that Algorithm 3.3 computes a single normal vector b1. As with DPCP-r, it is trivial to adjust Algorithm 3.3 to compute a second normal vector b2, since one only needs to incorporate the linear constraint b>b1 =0, and so on. We will again refer to such an algorithm that computes c 1 normals as DPCP-d. ≥

3.2.4.4 DPCP via iteratively reweighted least-squares

Even though DPCP-d and DPCP-r-d are very attractive from a computational point of view,

they only produce approximate normal vectors (even in the absence of noise), and have the

additional disadvantage that their performance depends on the parameter τ, whose tuning

is not well understood. On the other hand, DPCP-r was shown to produce exact normal

vectors, yet solving linear programs of the form (3.128)-(3.129) can be inefficient when the

data are high-dimensional. This motivates us to propose a direct IRLS algorithm for solving

the DPCP problem (3.3). In fact, since we are interested in obtaining an orthonormal basis

for the orthogonal complement of the inlier subspace, we propose an Iteratively Reweighted

Least-Squares (IRLS) scheme for solving problem

˜> min X B , s.t. B>B = Ic, (3.142) B RD×c 1,2 ∈

which is a generalization of the DPCP problem for multiple normal vectors. More specifi-

cally, given a D c orthonormal matrix Bk 1, we define for each point x˜j a weight × −

1 wj,k := , (3.143) max δ, Bk> 1x˜j − 2 

79 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

where δ > 0 is a small constant that prevents division by zero. Then we obtain Bk as the solution to the quadratic problem

L 2 min wj,k B>x˜j , s.t. B>B = Ic, (3.144) B RD×c 2 ∈ j=1 X which is readily seen to be the c right singular vectors corresponding to the c smallest

˜> singular values of the weighted data matrix W kX , where W is a diagonal matrix with

√wj,k at position (j,j). We refer to the resulting Algorithm 3.4 as DPCP-IRLS; a study

of its theoretical properties is deferred to future work. We note here that the technique

of solving an optimization problem (convex or non-convex) through IRLS is a common

one. In fact, REAPER [62] solves through IRLS a convex relaxation of precisely problem

(3.142). Other prominent instances of IRLS schemes from compressed sensing are [13,14,

19].

3.2.5 Experimental evaluation

In this section we evaluate the proposed algorithms experimentally. In 3.2.5.1 we inves- § tigate the performance of our principal algorithmic proposal, i.e., DPCP-r, described in

Algorithm 3.1. In 3.2.5.2 we compare DPCP-r, DPCP-d, DPCP-r-d 9 and DPCP-IRLS § with state-of-the-art robust PCA algorithms using synthetic data, and similarly in 3.2.5.3 § using real images.

9To avoid confusion, we will slightly abuse our terminology and refer to DPCA-r (DPCA-d, DPCA-r-d) as DPCP-r (DPCP-d, DPCP-r-d), even when c> 1. The distinction of whether a single versus multiple dual principal components are being computed will be clear from the context.

80 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Algorithm 3.4 Dual Principal Component Pursuit via Iteratively Reweighted Least Squares ˜ 1: procedure DPCP-IRLS(X ,c,ε,Tmax,δ)

2: k 0;∆ ; ← J ← ∞

X˜> 3: B0 argminB RD×c B , s.t. B>B = c; ← ∈ 2 I

4: while kε do max J 5: k k +1; ← X˜ 6: wx˜ 1/ max δ, Bk> 1x˜ , x˜ ; ← − 2 ∈  2 7: B D×c ˜ B>x s.t. B>B ; k argminB R x˜ X wx˜ ˜ 2 = c ← ∈ ∈ I

˜> P ˜> ˜> 9 8: ∆ X Bk 1 X Bk / X Bk 1 + 10− ; J ← − 1 − 1 − 1    

9: end while

10: return Bk;

11: end procedure

3.2.5.1 Computing a single dual principal component

We begin by investigating the behavior of the DPCP-r Algorithm 3.1 in the absence of

noise, for random subspaces of varying dimensions d =1:1:29 and varying outlier S percentages R := M/(M + N)=0.1:0.1:0.9. We fix the ambient dimension D = 30,

D 1 sample N = 200 inliers uniformly at random from S − and M outliers uniformly S ∩ D 1 3 at random from S − . We set  = 10− and Tmax = 10 in Algorithm 3.1. Our main interest is in examining the ability of DPCP-r in recovering a single normal vector to the subspace (c =1). The results over 10 independent trials are shown in Figure 3.5, in which

81 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

the vertical axis denotes the relative dimension of the subspace, defined as d/D.

Figure 3.5(a) shows whether the theoretical condition (3.97) is satisfied (white) or not

(black). In checking this condition, we estimate the abstract quantities

O,X , O , X (3.145) R ,K1 R ,K2

by Monte-Carlo simulation. Whenever the condition is true, we choose nˆ 0 in a controlled fashion, so that its angle φ0 from the subspace is larger than the minimal angle φ0∗ of (3.120); then we run DPCP-r. If on the other hand (3.97) is not true, we do not run DPCP-r and report a 0 (black). Fig 3.5(b) shows the angle of nˆ 10 from the subspace. We see that whenever

(3.97) is true, DPCP-r returns a normal after only 10 iterations. Fig 3.5(c) shows that if we initialize randomly nˆ 0, then its angle φ0 from the subspace tends to become less than the

minimal angle φ0∗, as d increases. Even so, Figure 3.5(d) shows that DPCP-r still yields a

numerical normal, except for the regime where both d and R are very high. Notice that this

is roughly the regime where we have no theoretical guarantees, according to Figure 3.5(a).

˜> Figure 3.5(e) shows that if we initialize nˆ 0 as the right singular vector of X corresponding

to the smallest singular value, then φ0 > φ0∗ is true for most cases, and the corresponding

performance of DPCP-r in Figure 3.5(f) improves further. Finally, Figure 3.5(g) plots φ . 0∗ We see that for very low d this angle is almost zero, i.e. DPCP-r does not depend on the

initialization, even for large R. As d increases though, so does φ0∗, and in the extreme case

o of the upper rightmost regime, where d and R are very high, φ0∗ is close to 90 , verifying

our expectation that DPCP-r will succeed only if nˆ is very close to ⊥. 0 S

82 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

0.97 0.97 0.97 0.83 0.83 0.83

0.67 0.67 0.67

0.5 0.5 0.5

0.33 0.33 0.33 relative dimension relative dimension 0.17 relative dimension 0.17 0.17 0.03 0.03 0.03 0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 outlier ratio outlier ratio outlier ratio

∗ ∗ (a) eq. (3.97) (b) DPCP-r(φ0 >φ0) (c) φ0 >φ0

0.97 0.97 0.97 0.83 0.83 0.83

0.67 0.67 0.67

0.5 0.5 0.5

0.33 0.33 0.33

relative dimension 0.17 relative dimension 0.17 relative dimension 0.17 0.03 0.03 0.03 0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 outlier ratio outlier ratio outlier ratio

∗ (d) DPCP-r(random φ0) (e) φ0,SVD >φ0 (f) DPCP-r(φ0,SVD)

0.97 0.83

0.67

0.5

0.33

relative dimension 0.17 0.03 0.1 0.5 0.9 outlier ratio

∗ (g) φ0

Figure 3.5: Various quantities associated to the performance of DPCP-r; see 3.2.5.1. Fig- § ure 3.5(a) shows whether condition (3.97) is true (white) or not (black). Figure 3.5(b) shows the angle from of nˆ after 10 iterations of DPCP-r when (3.97) is true; the other S 10 cases are mapped to black. Figure 3.5(c) shows whether a random nˆ 0 satisfies φ0 > φ0∗,

where φ∗ is as in Thm. 3.15. Figure 3.5(d) shows the angle from of nˆ for random nˆ . 0 S 10 0 Figure 3.5(e) shows the angle from of the right singular vector of X˜ corresponding to S

the smallest singular value, Figure 3.5(f) shows the corresponding angle of nˆ 10, and Figure

3.5(g) plots φ . 0∗ 83 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

3.2.5.2 Outlier detection using synthetic data

In this section we begin by using the same synthetic experimental set-up as in 3.2.5.1 § (except that now N = 300) to demonstrate the behavior of several methods relative to each other under uniform conditions, in the context of outlier rejection in single subspace learning. In particular, we test DPCP-r, DPCP-d, DPCP-r-d, DPCP-IRLS, the IRLS version of REAPER [62], RANSAC [34], SE-RPCA [84], and `2,1-RPCA [118]; see Chapter 2 for details on existing methods.

For the methods that require an estimate of the subspace dimension d, such as REAPER,

RANSAC, and all DPCP variants, we provide as input the true subspace dimension. The

3 convergence accuracy of all methods is set to 10− . For REAPER we set the regularization

6 parameter δ equal to 10− and the maximal number of iterations equal to 100. For DPCP-

r we set τ = 1/√N + M as suggested in [80] and maximal number of iterations 1000.

3 For RANSAC we set its thresholding parameter equal to 10− , and for fairness, we do not let it terminate earlier than the running time of DPCP-r. Both SE-RPCA and `2,1-

RPCA are implemented with ADMM, with augmented Lagrange parameters 1000 and 100 respectively. For `2,1-RPCA λ is set to 3/(7√M), as suggested in [118]. DPCP variants are initialized as in Algorithm 3.2, and the parameters of DPCP-r are as in 3.2.5.1. § Absence of noise. We investigate the potential of each of the above methods to per- fectly distinguish outliers from inliers in the absence of noise 10. Note that each method

10We do not include the results of DPCP-d and DPCP-r-d for this experiment, since they only approx- imately solve the DPCP optimization problem, and hence they can not be expected to perfectly separate inliers from outliers, even when there is no noise.

84 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

0.97 0.97 0.97 0.83 0.83 0.83

0.67 0.67 0.67

0.5 0.5 0.5

0.33 0.33 0.33 relative dimension relative dimension relative dimension 0.17 0.17 0.17 0.03 0.03 0.03 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 outlier ratio outlier ratio outlier ratio

(a) REAPER (b) RANSAC (c) SE-RPCA

0.97 0.97 0.97 0.83 0.83 0.83

0.67 0.67 0.67

0.5 0.5 0.5

0.33 0.33 0.33 relative dimension relative dimension relative dimension 0.17 0.17 0.17 0.03 0.03 0.03 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 outlier ratio outlier ratio outlier ratio

(d) `2,1-RPCA (e) DPCP-IRLS (f) DPCP-r

Figure 3.6: Outlier/Inlier separation in the absence of noise over 10 independent trials.

The horizontal axis is the outlier ration defined as M/(N + M), where M is the number

of outliers and N is the number of inliers. The vertical axis is the relative inlier subspace

dimension d/D; the dimension of the ambient space is D = 30. Success (white) is declared

by the existence of a threshold that, when applied to the output of each method, perfectly

separates inliers from outliers.

returns a signal α RN+M , which can be thresholded for the purpose of declaring outliers ∈ +

and inliers. For SE-RPCA, α is the `1-norm of the columns of the coefficient matrix C,

while for `2,1-RPCA it is the `2-norm of the columns of E. Since REAPER, RANSAC,

DPCP-r and DPCP-IRLS directly return subspace models, for these methods α is simply

the distances of all points to the estimated subspace.

85 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

In Figure 3.6 we depict success (white) versus failure (black), where success is inter-

preted as the existence of a threshold on α that perfectly separates outliers and inliers.

First observe that, as expected, SE-RPCA and `2,1-RPCA succeed only when the subspace dimension d is small. In particular, the more outliers are present the lower the dimension of the subspace needs to be for the methods to succeed. The same is true for RANSAC, except when there are only very few outliers (10%), in which case the probability of sam- pling outlier-free points is high. Finally notice that SE-RPCA is the best method among these three in dealing with large percentages of outliers (> 70%). This is not a surprise, because the theoretical guarantees of SE-RPCA do not place an explicit upper bound on the number of outliers, in contrast to `2,1-RPCA and RANSAC. Next, notice that REAPER per- forms uniformly better than RANSAC, SE-RPCA and `2,1-RPCA. In particular REAPER can handle higher dimensions and higher outlier percentages; for example it succeeds over all trials for hyperplanes when there are 10% outliers.

In summary, none of REAPER, RANSAC, SE-RPCA, and `2,1-RPCA can deal with hyperplanes with more than 20% outliers or with subspaces of medium relative dimension

(d > 13) for as many as 90% outliers. This gap is filled by the two proposed meth- ods DPCP-r and DPCP-IRLS. Notice that DPCP-r is the only method that succeeds irre- spectively of subspace dimension with almost 70% outliers, while DPCP-IRLS is the only method that succeeds when d 19 and R = 90%. ≤ Presence of Noise. Next, we keep D = 30 and investigate the performance of the

methods, adding DPCP-d and DPCP-r-d in the mix, in the presence of varying levels of

86 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

noise and outliers for two cases of high-dimensional subspaces, i.e., d = 25 and d =

29. The inliers are corrupted by additive white gaussian noise of zero mean and standard deviation σ = 0.02, 0.06, 0.1, with support in the orthogonal complement of the inlier subspace, while the percentage of outliers varies as R = 20%, 33%, 50%. Finally, for

DPCP-d and DPCP-r-d we set τ = max σ, 1/√N + M , while for RANSAC we set its  threshold equal to σ.

We evaluate the performance of each method by its corresponding ROC curve. Each

point of an ROC curve corresponds to a certain value of a threshold, with the vertical

coordinate of the point giving the percentage of inliers being correctly identified as inliers

(True Positives), and the horizontal coordinate giving the number of outliers erroneously

identified as inliers (False Positives). As a consequence, an ideal ROC curve should be

concentrated to the top left of the first quadrant, i.e., the area over the curve should be zero.

The ROC curves for the case d = 25 are given in Figure 3.7, where for each curve we also report the area over the curve. As expected, the low-rank methods RANSAC,

SE-RPCA and `2,1-RPCA perform poorly with RANSAC being the worst method for 50% outliers and SE-RPCA being the weakest method otherwise. On the other hand REAPER,

DPCP-d, DPCP-IRLS, DPCP-r-d and DPCP-r perform almost perfectly well, with DPCP-

IRLS actually giving zero error across all cases. Notice the interesting fact that DPCP-r performs slightly better across all cases than both DPCP-d and DPCP-r-d, despite the fact that DPCP-d and DPCP-r-d are intuitively more suitable for noisy data than DPCP-r.

The ROC curves for the case d = 29 are given in Figure 3.8. As expected, the low-rank

87 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

1 1 1

0.8 0.8 0.8 RANSAC(0.038) (0.147) (0.401) ℓ 0.6 21-RPCA(0.067) 0.6 (0.132) 0.6 (0.202) SE-RPCA(0.109) (0.173) (0.230) REAPER(0) (0.000) (0.010) 0.4 DPCP-r(0) 0.4 (0) 0.4 (0.001) DPCP-r-d(0) (0.000) (0.012) 0.2 DPCP-d(0) 0.2 (0) 0.2 (0.007) DPCP-IRLS(0) (0) (0) 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1

(a) σ = 0.02,R = 20% (b) σ = 0.02,R = 33% (c) σ = 0.02,R = 50%

1 1 1

0.8 0.8 0.8 (0.054) (0.143) (0.269) (0.069) 0.6 0.6 (0.138) 0.6 (0.201) (0.112) (0.179) (0.233) (0) (0.000) (0.009) 0.4 (0) 0.4 (0) 0.4 (0.000) (0) (0.000) (0.006) 0.2 (0) 0.2 (0) 0.2 (0.002) (0) (0) (0.000) 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1

(d) σ = 0.06,R = 20% (e) σ = 0.06,R = 33% (f) σ = 0.06,R = 50%

1 1 1 0.8 0.8 (0.077) 0.8 (0.163) (0.278) 0.6 (0.080) (0.139) 0.6 (0.207) (0.125) 0.6 (0.181) (0.236) (0.000) (0.001) (0.013) 0.4 0.4 (0.000) 0.4 (0.000) (0.005) (0.000) (0.001) (0.015) (0.000) (0.007) 0.2 0.2 (0.000) 0.2 (0.000) (0.000) (0.000) 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1

(g) σ = 0.1,R = 20% (h) σ = 0.1,R = 33% (i) σ = 0.1,R = 50%

Figure 3.7: ROC curves as a function of noise standard deviation σ and outlier percentage

R, for subspace dimension d = 25 in ambient dimension D = 30. The horizontal axis

is False Positives ratio and the vertical axis is True Positives ratio. The number associ-

ated with each curve is the area above the curve; smaller numbers reflect more accurate

performance.

88 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

1 1 1

0.8 0.8 0.8 RANSAC(0.276) (0.417) (0.468) ℓ (0.406) 0.6 21-RPCA(0.355) 0.6 0.6 (0.427) SE-RPCA(0.381) (0.421) (0.436) REAPER(0.033) (0.111) (0.175) 0.4 DPCP-r(0.016) 0.4 (0.014) 0.4 (0.016) DPCP-r-d(0.017) (0.019) (0.058) 0.2 DPCP-d(0.016) 0.2 (0.015) 0.2 (0.018) DPCP-IRLS(0.018) (0.016) (0.020) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(a) σ = 0.02,R = 20% (b) σ = 0.02,R = 33% (c) σ = 0.02,R = 50%

1 1 1

0.8 0.8 0.8 (0.302) (0.388) (0.448) (0.424) 0.6 (0.365) 0.6 (0.397) 0.6 (0.387) (0.409) (0.430) (0.053) (0.111) (0.190) 0.4 (0.039) 0.4 (0.042) 0.4 (0.045) (0.038) (0.047) (0.085) 0.2 (0.038) 0.2 (0.040) 0.2 (0.048) (0.040) (0.045) (0.052) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(d) σ = 0.06,R = 20% (e) σ = 0.06,R = 33% (f) σ = 0.06,R = 50%

1 1 1

0.8 0.8 0.8 (0.312) (0.392) (0.456) 0.6 (0.369) 0.6 (0.404) 0.6 (0.428) (0.392) (0.416) (0.434) (0.086) (0.129) (0.188) 0.4 (0.071) 0.4 (0.072) 0.4 (0.077) (0.072) (0.077) (0.105) 0.2 (0.068) 0.2 (0.069) 0.2 (0.077) (0.073) (0.075) (0.084) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(g) σ = 0.1,R = 20% (h) σ = 0.1,R = 33% (i) σ = 0.1,R = 50%

Figure 3.8: ROC curves as a function of noise standard deviation σ and outlier percentage

R, for subspace dimension d = 29 in ambient dimension D = 30. The horizontal axis

is False Positives ratio and the vertical axis is True Positives ratio. The number associ-

ated with each curve is the area above the curve; smaller numbers reflect more accurate

performance.

89 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

methods RANSAC, SE-RPCA and `2,1-RPCA fail even for low noise (σ =0.02) and mod-

erate outliers (20%). Notice that for 50% outliers and for any threshold, there is roughly an

equal chance of identifying a point as inlier or as outlier, i.e., the performance of the meth-

ods is almost the same as that of a random guess. On the other hand, DPCP-r, DPCP-d,

and DPCP-IRLS are very robust to variations of the noise level and outlier percentages and

are the best methods. DPCP-r-d performs almost identically with these methods, except in

the case of 50% outliers, where it performs less accurately. Finally, notice that the perfor-

mance of REAPER degrades significantly as soon as the outlier percentage exceeds 20%,

indicating that REAPER is not the best method for subspaces of very low codimension.

3.2.5.3 Outlier detection using real face and object images

In this section we use the Extended-Yale-B real face dataset [36] as well as the real image

dataset Caltech101 [32] to compare the proposed algorithms DPCP-r, DPCP-d, DPCP-r-d

and DPCP-IRLS to REAPER, RANSAC, `2,1-RPCA, and SE-RPCA. We recall that the

Extended-Yale-B dataset contains 64 face images for each of 38 distinct individuals. We

use the first 19 individuals from the Extended-Yale-B dataset and the first half images of

each category in Caltech101 for gaining intuition and tuning the parameters of each method

(training set), while the remaining part of the datasets serve as a test set.

In the Extended-Yale-B dataset, all face images correspond to the same fixed pose,

while the illumination conditions vary. Such images are known to lie in a 9-dimensional

linear subspace, with each individual having its own corresponding subspace [4]. In this

90 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

1 1 1

0.8 0.8 0.8 RANSAC(0.256) (0.250) (0.260) ℓ -RPCA(0.043) 0.6 21 0.6 (0.069) 0.6 (0.101) SE-RPCA(0.078) (0.095) (0.153) REAPER(0.051) (0.061) (0.072) 0.4 DPCP-r(0.070) 0.4 (0.068) 0.4 (0.080) DPCP-r-d(0.049) (0.065) (0.087) 0.2 DPCP-d(0.049) 0.2 (0.065) 0.2 (0.087) DPCP-IRLS(0.068) (0.069) (0.074) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(a) R = 20%,C0 N0 (b) R = 33%,C0 N0 (c) R = 50%,C0 N0 − − −

1 1 1

0.8 0.8 0.8 (0.233) (0.238) (0.239) 0.6 (0.137) 0.6 (0.157) 0.6 (0.197) (0.165) (0.182) (0.238) (0.125) (0.137) (0.165) 0.4 (0.166) 0.4 (0.151) 0.4 (0.172) (0.127) (0.136) (0.173) 0.2 (0.141) 0.2 (0.145) 0.2 (0.176) (0.149) (0.141) (0.164) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(d) R = 20%,C0 N1 (e) R = 33%,C0 N1 (f) R = 50%,C0 N1 − − −

1 1 1

0.8 0.8 0.8 (0.136) (0.133) (0.161) 0.6 (0.045) 0.6 (0.060) 0.6 (0.091) (0.050) (0.083) (0.150) (0.055) (0.050) (0.064) 0.4 (0.069) 0.4 (0.052) 0.4 (0.075) (0.050) (0.050) (0.076) 0.2 (0.050) 0.2 (0.050) 0.2 (0.076) (0.084) (0.055) (0.067) 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0.8

(g) R = 20%,C1 N0 (h) R = 33%,C1 N0 (i) R = 50%,C1 N0 − − −

1 1 1

0.8 0.8 0.8 (0.129) (0.133) (0.141) 0.6 (0.051) 0.6 (0.065) 0.6 (0.088) (0.047) (0.078) (0.146) (0.059) (0.062) (0.064) 0.4 (0.070) 0.4 (0.065) 0.4 (0.071) (0.063) (0.060) (0.075) 0.2 (0.072) 0.2 (0.062) 0.2 (0.078) (0.079) (0.066) (0.068) 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0.8

(j) R = 20%,C1 N1 (k) R = 33%,C1 N1 (l) R = 50%,C1 N1 − − −

Figure 3.9: Average ROC curves and areas over the curves for different percentages R of outliers; see 3.2.5.3. Both inliers and outliers come from EYaleB. C1 means data are centered (C0 not centered), N1 means data are normalized (N0 not normalized). 91 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

1 1 1

0.8 0.8 0.8 RANSAC(0.184) (0.152) (0.090) ℓ 0.6 21-RPCA(0.008) 0.6 (0.003) 0.6 (0.048) SE-RPCA(0.337) (0.028) (0.033) REAPER(0.059) (0.011) (0.006) 0.4 DPCP-r(0.170) 0.4 (0.036) 0.4 (0.039) DPCP-r-d(0.045) (0.007) (0.043) 0.2 DPCP-d(0.045) 0.2 (0.007) 0.2 (0.043) DPCP-IRLS(0.165) (0.034) (0.011) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8

(a) R = 20%,C0 N0 (b) R = 33%,C0 N0 (c) R = 50%,C0 N0 − − −

1 1 1

0.8 0.8 0.8 (0.212) (0.198) (0.189) 0.6 (0.047) 0.6 (0.100) 0.6 (0.224) (0.395) (0.178) (0.177) (0.093) (0.126) (0.143) 0.4 (0.150) 0.4 (0.156) 0.4 (0.172) (0.090) (0.134) (0.185) 0.2 (0.114) 0.2 (0.143) 0.2 (0.187) (0.118) (0.138) (0.147) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(d) R = 20%,C0 N1 (e) R = 33%,C0 N1 (f) R = 50%,C0 N1 − − −

1 1 1

0.8 0.8 0.8 (0.093) (0.067) (0.042) (0.002) (0.004) (0.040) 0.6 0.6 0.6 (0.003) (0.019) (0.030) (0.052) (0.012) (0.004) 0.4 (0.108) 0.4 (0.037) 0.4 (0.033) (0.035) (0.008) (0.035) 0.2 (0.035) 0.2 (0.008) 0.2 (0.035) (0.159) (0.043) (0.009) 0 0.1 0.2 0.3 0 0.05 0.1 0.15 0 0.05 0.1 0.15

(g) R = 20%,C1 N0 (h) R = 33%,C1 N0 (i) R = 50%,C1 N0 − − −

1 1 1

0.8 0.8 0.8 (0.089) (0.060) (0.043) 0.6 (0.025) 0.6 (0.013) 0.6 (0.082) (0.014) (0.018) (0.043) (0.022) (0.018) (0.012) 0.4 (0.086) 0.4 (0.031) 0.4 (0.064) (0.031) (0.022) (0.070) 0.2 (0.045) 0.2 (0.023) 0.2 (0.070) (0.079) (0.021) (0.016) 0 0.2 0.4 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 0.25

(j) R = 20%,C1 N1 (k) R = 33%,C1 N1 (l) R = 50%,C1 N1 − − −

Figure 3.10: Average ROC curves and areas over the curves for different percentages R of outliers; see 3.2.5.3. Inliers come from EYaleB, outliers from Caltech101. C1 means data are centered (C0 not centered), N1 means data are normalized (N0 not normalized). 92 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

1 1 1

0.8 0.8 0.8 RANSAC(0.211) (0.133) (0.145) ℓ 0.6 21-RPCA(0.165) 0.6 (0.060) 0.6 (0.492) SE-RPCA(0.181) (0.083) (0.078) REAPER(0.069) (0.050) (0.055) 0.4 DPCP-r(0.113) 0.4 (0.052) 0.4 (0.059) DPCP-r-d(0.137) (0.050) (0.053) 0.2 DPCP-d(0.137) 0.2 (0.050) 0.2 (0.053) DPCP-IRLS(0.100) (0.055) (0.060) 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0.8 1

(a) D = 15 (b) D = 50 (c) D = 150

Figure 3.11: ROC curves for three different projection dimensions, when there are 33% face outliers; data are centered but not normalized (C N ). 1 − 0

1 1 1

0.8 0.8 0.8 RANSAC(0.078) (0.067) (0.059) ℓ21-RPCA(0.089) (0.004) (0.435) 0.6 0.6 0.6 SE-RPCA(0.123) (0.019) (0.008) REAPER(0.003) (0.012) (0.011) 0.4 DPCP-r(0.064) 0.4 (0.037) 0.4 (0.033) DPCP-r-d(0.077) (0.008) (0.008) 0.2 DPCP-d(0.077) 0.2 (0.008) 0.2 (0.008) DPCP-IRLS(0.039) (0.043) (0.046) 0 0.1 0.2 0.3 0.4 0 0.05 0.1 0.15 0 0.2 0.4 0.6 0.8

(a) D = 15 (b) D = 50 (c) D = 150

Figure 3.12: ROC curves for three different projection dimensions, when there are 33% outliers from Caltech101; data are centered but not normalized (C N ). 1 − 0 experiment we use the 42 48 cropped images that were also used in [30]. Thus, the images × 42 48 2016 of each individual lie approximately in a 9-dimensional linear subspace of R × ∼= R . It is then natural to take as inliers all the images of one individual. For outliers, we consider two possibilities: either the outliers consist of images randomly chosen from the rest of the individuals, or they consist of images randomly chosen from Caltech101; we recall here that Caltech101 is a database of 101 image categories, such as images of airplanes or

93 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

images of brains. We will consider three different levels of outliers: 20%, 33% and 50%.

There are three important preprocessing steps that one may or may not chose to per- form. The first such step is dimensionality reduction. In our case, we choose to use pro- jection of the data matrix X˜ onto its first D = 50 principal components. This choice is justified by noting that, since the inliers and outliers are each at most 64, the data matrix is at most of rank 128. The choice d = 50 avoids situations where the entire data matrix is of low-rank to begin with, or cases where the outliers themselves span a low-dimensional sub- space; e.g., this would be true if we were working with the original dimensionality of 2016.

These instances would be particularly unfavorable for methods such as SE-RPCA and `2,1-

RPCA. On the other hand, methods such as REAPER, DPCP-IRLS, DPCP-d, DPCP-r-d and DPCP-r work with the orthogonal complement of the subspace, and so the lower the codimension the more efficient these methods are. We will shortly see how the methods behave for various projection dimensions.

Another pre-processing step is that of centering the data, i.e., forcing the entire dataset ˜ X to have zero mean; we will be writing C0 to denote that no centering takes place and C1 otherwise. Note here that we do not consider centering the inliers separately, as was done in [62]; this is unrealistic, since it requires knowledge of the ground truth, i.e., which image is an inlier and which is an outlier. Finally, one may normalize the columns of X˜ to have unit norm or not. We will denote these two possibilities as N1 and N0 respectively.

Various possibilities for the parameters of all algorithms were considered by experi- menting with the training set, and the ones that minimize the average area under the cor-

94 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

responding ROC curves were chosen, for the case of 33% outliers, for 50 independent experimental instances (which individual plays the role of the inlier subspace is a random event). The results for the two different choices of outliers and all four pre-processing

C N ,C N ,C N and C N , over 50 independent trials on the test set, are 0 − 0 0 − 1 1 − 0 1 − 1 reported in Figs. 3.9 and 3.10.

Our first observation is that, on the average, all methods perform about the same, with

REAPER being the best method and RANSAC the worse. Evidently, normalizing the data

without centering them, leads to uniform performance degradation for all methods, for both

outlier scenarios. On the other hand, all methods seem to be robust to the remaining three

combinations of centering and normalization, with perhaps C N being the best for 1 − 0 this experiment. A second observation is that the ROC curves are better for all methods,

when the outliers come from Caltech101. This is indeed expected, since, in that case, not

only are the outliers chosen from a different dataset, but also their content is very different

from that of the inliers. On the other hand, when the outliers are face images themselves,

it is intuitively expected that the inlier/outlier separation problem becomes harder, simply

because the outliers are of similar nature with the inliers.

Notice also the interesting phenomenon of all methods behaving worse when the out-

liers are minimal. For example, when R = 20%, the area over the curves for all methods in

Figure 3.10(a) is bigger than when R = 33% (Figure 3.10(b)). In fact, REAPER, RANSAC and DPCP-IRLS are becoming better as the number of outliers increases from 20% to 50%

(Figs. 3.10(a)-3.10(c)). This phenomenon is partially explained by our theory: separating

95 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

inliers from outliers is easier when the outliers are uniformly distributed in the ambient

space, and this latter condition is more easily achieved when the outliers are large in num-

ber.

Finally, figures 3.11 and 3.12 show how the methods behave if we vary the projection

dimension to D = 15 or D = 150, without adjusting any parameters. Evidently, REAPER is once again the most robust method, while `2,1-RPCA is the least robust. Interestingly, when going from D = 50 to D = 150 for the case of face outliers, only SE-RPCA shows a slight improvement; the rest of the methods become slightly worse.

3.3 Learning a hyperplane arrangement via DPCP

3.3.1 Problem overview

In 3.2 we saw that when the data consist of a set of points uniformly distributed in a § deterministic sense on the unit sphere of a linear subspace , together with a set of outliers S uniformly distributed on the unit sphere of the ambient space, then the DPCP problem

(3.3) has the remarkable property that every global solution is a vector b orthogonal to the subspace . A natural question that arises is whether the DPCP problem has the same S property when the data comes from a subspace arrangement, i.e., whether every global solution of the DPCP problem is orthogonal to one of the subspaces of the arrangement. If this were to be true, then DPCP could prove a useful tool for clustering subspaces of high relative dimension, in particular for learning arrangements of hyperplanes.

96 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Hyperplane arrangements arise in several applications, with important examples being

projective motion segmentation [108,113], 3D point cloud analysis [81] and hybrid system

identification [2]. In such applications the dataset X consists of N points of RD, such

D Ni that N points of X , say X R × , lie close to a hyperplane = x : b>x =0 , i i ∈ Hi i  where bi is the normal vector to the hyperplane. It is then of interest to learn the underlying hyperplane arrangement, i.e., learn a set of n normal vectors b1,..., bn, and cluster the

data points X according to their hyperplane membership. Clearly, this is a special case of

the subspace clustering problem, which we will be referring to as hyperplane clustering.

The purpose of this section is to establish both theoretically and experimentally that

DPCP a useful new tool for clustering hyperplanes.11 Indeed, a large part of this section

is devoted to understanding under what conditions is every solution of the DPCP problem

the normal vector to one of the hyperplanes associated to the data. As we will see, the

conditions entail the hyperplanes to be sufficiently separated or one of the hyperplanes to

be sufficiently dominant. The rest of the section explores hyperplane clustering algorithms

using synthetic as well as real 3D data.

3.3.2 Data model

The data structure of interest in this section is an arrangement = n RD of A i=1 Hi ⊂ D D S D 1 n hyperplanes of R , given by = x R : x>b =0 , i [n], where b S − Hi ∈ i ∈ i ∈  is the normal vector to hyperplane . We assume that we are given a collection of N Hi 11The general case is left to future investigations.

97 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

D N D 1 data points X = [x ,..., x ] R × that are in general position in S − . By 1 N ∈ A∩ general position we mean two things. First, we mean that there are no relations among the

points other than the ones induced by their membership to the hyperplanes; in particular,

every (D 1) points of X are linearly independent. Second, we mean that the points − X uniquely define the arrangement , in the sense that is the only arrangement of A A n hyperplanes that contains X . We assume that for every i [n], precisely N points ∈ i of X , denoted by X = [x(i),..., x(i)], belong to , with n N = N. With that i 1 Ni Hi i=1 i P notation, X = [X 1,..., X n]Γ, where Γ is an unknown permutation matrix, indicating

that the hyperplane membership of the points is unknown. Finally, we assume an ordering

N N N , and we refer to as the dominant hyperplane. 1 ≥ 2 ≥···≥ n H1

3.3.3 Theoretical analysis of the continuous problem

Similarly to the case of data from a single subspace corrupted with outliers ( 3.2.2.1), § certain important insights regarding the DPCP problem (3.3) with respect to hyperplane

clustering can be gained by examining an associated continuous problem. As it turns out,

this problem has a fascinating geometric interpretation, that is interesting in its own right,

and this section is devoted to its study. In 3.3.3.1 we derive the continuous problem, § discuss its geometric interpretation and give a basic property of its global minimizers. In

3.3.3.2 we completely characterize the global minimizers of the continuous problem for § the special cases of i) two hyperplanes and ii) orthogonal hyperplanes, while in 3.3.3.3 § we completely characterize the global minimizers of an arrangement of three equiangular

98 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

hyperplanes. Finally, in 3.3.3.4 we give optimality conditions for an arbitrary hyperplane § arrangement.

3.3.3.1 Derivation, interpretation and basic properties of the continuous problem

To see what is that the continuous problem associated with DPCP for the case of hyperplane

D 1 D 1 arrangements, let ˆ = S − , and note first that for any b S − we have Hi Hi ∩ ∈

Ni 1 (i) b>x b>x (3.146) j dµ ˆi , Ni ' x ˆi H j=1 Z H X 1 where the LHS of (3.146) is precisely X >b and can be viewed as a discretization Ni i 1

via the point set X of the integral on the RHS of (3.146), with denoting the uniform i µ ˆi H measure on ˆ .12 Letting θ be the principal angle between b and b , for any x we Hi i i ∈ Hi have

ˆ > b>x = b>π i (x)=(π i (b))> x = hi,>bx = sin(θi) hi,bx. (3.147) H H

Hence,

b>x hˆ > x (3.148) dµ ˆi = i,b dµ ˆi sin(θi) x ˆi H x ˆi H Z ∈H Z ∈H 

= x1 dµSD−2 sin(θi) (3.149) x SD−2 | | Z ∈ 

= c sin(θi), (3.150)

D 1 where c is the average height of the unit hemisphere of R − (in the notation of equation

(3.24) we have c = cD 1). As a consequence, we can view the objective function of (3.3), − 12See 3.2.2.1 for a detailed measure-theoretic discussion of 3.146 in the context of data from a single subspace§ corrupted with outliers. Similar arguments apply here but are omitted for the sake of brevity.

99 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

which is given by

n n Ni 1 (i) X >b = X >b = N b>x , (3.151) 1 i 1 i N j i=1 i=1 i j=1 ! X X X

as a discretization via the point set X of the function

n n (3.150) b b>x (3.152) ( ) := Ni dµ ˆi = Ni c sin(θi). J x ˆi H i=1 Z ∈H  i=1 X X

In that sense, the continuous counterpart of problem (3.3) is precisely

D 1 min (b)= N1 c sin(θ1)+ + Nn c sin(θn), s.t. b S − , (3.153) b J ··· ∈ where now the integer N is interpreted as a positive weight assigned to hyperplane , i Hi penalizing the distance of b from bi.

Remark 3.16. Geometrically, a solution b∗ to problem (3.153) can be interpreted as a weighted median of the lines spanned by b1,..., bn. Medians in Riemmannian , and in particular in the , are an active subject of research [24,37].

However, we are not aware of any work in the literature that defines a median by means of

(3.153), nor any work that studies (3.153).

The advantage of working with (3.153) instead of (3.3), is that the solution set of the continuous problem (3.153) depends solely on the weights N N N assigned 1 ≥ 2 ≥···≥ n to the hyperplane arrangement, as well as on the geometry of the arrangement, captured by the principal angles φij between bi and bj. In contrast, the solutions of the discrete problem (3.3) may also depend on the distribution of the points X . From that perspective, understanding when problem (3.153) has a unique solution that coincides with the normal

100 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

b to the dominant hyperplane , is essential for understanding the potential of (3.3) ± 1 H1 for hyperplane clustering. Towards that end, we provide a series of results pertaining to the

continuous problem (3.153).

To begin with, we note that the objective function of (3.153) is everywhere differen-

tiable except at the points b ,..., b , where its partial derivatives do not exist. For any ± 1 ± n D 1 b S − distinct from b , the gradient at b is given by ∈ ± i n bi>b b = 1 bi. (3.154) ∇ J − 2 i=1 1 (b>b)2 X − i

Now let b∗ be a global solution of (3.153) and suppose that b∗ = b , i [n]. Then b∗ 6 ± i ∀ ∈ must satisfy the first order optimality condition

b b∗ + λ∗ b∗ = 0, (3.155) ∇ J| where λ∗ is a Lagrange multiplier. Equivalently, we have

n 1 2 − 2 N b>b∗ 1 b>b∗ b + λ∗ b∗ = 0, (3.156) − i i − i i i=1 X     which implies that

n 1 n 1 2 2 − 2 2 − 2 N b>b∗ 1 b>b∗ b∗ = N b>b∗ 1 b>b∗ b , (3.157) i i − i i i − i i i=1 i=1 X     X     from which the next lemma follows.

Lemma 3.17. Let b∗ be a global solution of (3.153). Then b∗ Span(b ,..., b ). ∈ 1 n

Proof. If b∗ is equal to some b , then the statement of the lemma is certainly true. If ± i b∗ = b , i [n], then b∗ satisfies (3.157), from which again the statement is true. 6 ± i ∀ ∈ 101 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

3.3.3.2 The cases of i) two hyperplanes and ii) orthogonal hyperplanes

In this section we characterize the global minimizers of the continuous problem (3.153) for two special cases of hyperplane arrangements. The first configuration that we examine is that of two hyperplanes. As it turns out in that case, the weighted geometric median of the two lines spanned by the normals to the hyperplanes, always corresponds to one of the normals, as the next theorem shows.

Theorem 3.18. Let b , b be an arrangement of two hyperplanes in RD, with weights N 1 2 1 ≥

N2. Then the set B∗ of global minimizers of (3.153) satisfies:

1. If N = N , then B∗ = b , b . 1 2 {± 1 ± 2}

2. If N >N , then B∗ = b . 1 2 {± 1}

Proof. By Lemma 3.17 any global solution must lie in the plane Span(b1, b2), and so our problem becomes planar, i.e., we may as well assume that the hyperplane arrangement b , b is a line arrangement of R2. Note that b , b S1 partition S1 in two arcs, and 1 2 1 2 ∈ among these, only one arc has length φ strictly less than π; we denote this arc by a. Next, recall that the continuous objective function for two hyperplanes can be written as

1 1 2 2 2 2 1 (b)= N 1 (b>b) + N 1 (b>b) , b S . (3.158) J 1 − 1 2 − 2 ∈   Let b∗ be a global solution, and suppose that b∗ a. If b∗ a, then we can replace b , b 6∈ − ∈ 1 2 by b , b , an operation that does not change neither the arrangement nor the objective. − 1 − 2

After this replacement, we have that b∗ a. Finally suppose that neither b∗ nor b∗ are ∈ − 102 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

inside a. Then replacing either b with b or b with b , leads to b∗ a. Consequently, 1 − 1 2 − 2 ∈

without loss of generality we may assume that b∗ lies in a. Moreover, subject to a rotation

and perhaps exchanging b1 with b2, we can assume that b1 is aligned with the positive x-axis and that the angle φ between b1 and b2, measured counter-clockwise, lies in (0,π).

Then b∗ is a global solution to

1 1 2 2 2 2 1 (b)= N 1 (b>b) + N 1 (b>b) , b S a. (3.159) J 1 − 1 2 − 2 ∈ ∩   Now, for any vector b S1 a, let θ ,θ = φ θ be the angle between b and b , b ∈ ∩ 1 2 − 1 1 2 respectively. Then our objective can be written as

(b)= ˜(θ )= N sin(θ )+ N sin(φ θ ), θ [0,φ]. (3.160) J J 1 1 1 2 − 1 1 ∈

Taking first and second derivatives, we have

∂ ˜ J = N1 cos(θ1) N2 cos(φ θ1) (3.161) ∂θ1 − − ∂2 ˜ J2 = N1 sin(θ1) N2 sin(φ θ1). (3.162) ∂θ1 − − −

Since the second derivative is everywhere negative on [0,φ], ˜(θ ) is strictly concave on J 1

[0,φ] and so its minimum must be achieved at the boundary θ1 =0 or θ1 = φ. This means that either b∗ = b1 or b∗ = b2.

Notice that when N1 > N2, problem (3.153) recovers the normal b1 to the dominant

hyperplane, irrespectively of how separated the two hyperplanes are, since, according to

Proposition 3.18, the principal angle φ1,2 between b1, b2 does not play a role. The continu-

ous problem (3.153) is equally favorable in recovering normal vectors as global minimizers

103 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

in another extreme situation, where the arrangement consists of up to D perfectly separated

(orthogonal) hyperplanes, as asserted by the next theorem.

Theorem 3.19. Let b ,..., b be an orthogonal hyperplane arrangement, i.e, φ = π/2, i = 1 n ij ∀ 6 D j, of R , with n D, and weights N N N . Then the set B∗ of global mini- ≤ 1 ≥ 2 ≥···≥ n mizers of (3.153) can be characterized as follows:

1. If N = N , then B∗ = b ,..., b . 1 n {± 1 ± n}

2. If N = = N >N N , for some ` [n 1], then B∗ = b ,..., b . 1 ··· ` `+1 ≥··· n ∈ − {± 1 ± `}

Proof. For the sake of simplicity we assume n = 3, the general case follows in a similar

2 fashion. Letting x := b>b and y := 1 x , (3.157) can be written as i i i − i 2 2 2 p x1 x2 x3 x1 x2 x3 N + N + N b∗ = N b + N b + N b . (3.163) 1 y 2 y 3 y 1 y 1 2 y 2 3 y 3  1 2 3  1 2 3

Taking inner products of (3.163) with b1, b2, b3 we respectively obtain

2 2 2 x1 x2 x3 x1 x2 x3 N + N + N x = N + N (b>b )+ N (b>b ), (3.164) 1 y 2 y 3 y 1 1 y 2 y 1 2 3 y 1 3  1 2 3  1 2 3 2 2 2 x1 x2 x3 x1 x2 x3 N + N + N x = N (b>b )+ N + N (b>b ), (3.165) 1 y 2 y 3 y 2 1 y 2 1 2 y 3 y 2 3  1 2 3  1 2 3 2 2 2 x1 x2 x3 x1 x2 x3 N + N + N x = N (b>b )+ N (b>b )+ N . (3.166) 1 y 2 y 3 y 3 1 y 3 1 2 y 3 2 3 y  1 2 3  1 2 3

Since by Lemma 3.17 b∗ is a linear combination of b1, b2, b3, we can assume that D = 3.

Suppose that b∗ = b , i [n]. Now, suppose that x =0. Then we can not have either 6 ± i ∀ ∈ 3 x = 0 or x = 0, otherwise b∗ = b or b∗ = b respectively. Hence x ,x = 0. Then 1 2 2 1 1 2 6 equations (3.164)-(3.166) imply that

N1 N2 2 2 = and x1 + x2 =1. (3.167) y1 y2

104 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

2 2 Taking into consideration the relations xi + yi =1, we deduce that

N1 N2 y1 = , y2 = . (3.168) 2 2 2 2 N1 + N2 N1 + N2 p p Then

2 2 (b∗)= N y + N y + N y = N + N + N > (b )= N + N , (3.169) J 1 1 2 2 3 3 1 2 3 J 1 2 3 q which is a contradiction on the optimality of b∗. Similarly, none of the x1,x2 can be zero, i.e. x ,x ,x =0. Then equations (3.164)-(3.166) imply that 1 2 3 6

2 2 2 N1 N2 N3 x1 + x2 + x3 =1, = = , (3.170) y1 y2 y3 which give

Ni√2 yi = , i =1, 2, 3. (3.171) 2 2 2 N1 + N2 + N3

2 p2 2 But then (b∗) = 2(N + N + N ) > (b ) = N + N . This contradiction shows J 1 2 3 J 1 2 3 p that our hypothesis b∗ = b , i [n] is not valid, i.e., B∗ b , b , b . The rest 6 ± i ∀ ∈ ⊂ {± 1 ± 2 ± 3} of the theorem follows by comparing the values (b ), i [3]. J i ∈

3.3.3.3 The case of three equiangular hyperplanes

Theorems 3.18 and 3.19 were not hard to prove, since for two hyperplanes the objective

function is strictly concave, while for orthogonal hyperplanes the objective function is sep-

arable. In contrast, the problem becomes considerably harder for n > 2 non-orthogonal

hyperplanes. Even when n =3, characterizing the global minimizers of (3.153) as a func-

tion of the geometry and the weights seems hard. Nevertheless, when the three hyperplanes

105 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

are equiangular and their weights are equal, the symmetry of the configuration allows us to

fully characterize the geometric median as a function of the angle of the arrangement.

To begin with, without loss of generality, we can describe an equiangular arrangement

of three hyperplanes of RD, with an equiangular arrangement of three planes of R3, with

normals b1, b2, b3 given by

> b1 := µ 1+ α α α (3.172)   > b2 := µ α 1+ α α (3.173)   > b3 := µ α α 1+ α (3.174)   1 µ := (1 + α)2 +2α2 − 2 , (3.175)   with α a positive that determines the angle φ (0,π/2] of the arrangement, ∈ given by

2α(1 + α)+ α2 2α +3α2 cos(φ) := = . (3.176) (1 + α)2 +2α2 1+2α +3α2

Since N1 = N2 = N3, so our objective function essentially becomes

1 1 1 2 2 2 2 2 2 2 (b)= 1 (b>b) + 1 (b>b) + 1 (b>b) , b S (3.177) J − 1 − 2 − 3 ∈    = sin(θ1) + sin(θ2) + sin(θ3), (3.178)

where θi is the principal angle of b from bi. The next lemma shows that any global mini- mizer b∗ must have equal principal angles from at least two of the b1, b2, b3.

106 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

3 Lemma 3.20. Let b1, b2, b3 be an arrangement of equiangular planes in R , with angle

φ and weights N1 = N2 = N3. Let b∗ be a global minimizer of (3.153) and let xi :=

2 b>b∗, y = 1 x i =1, 2, 3. Then either y = y or y = y or y = y . i i − i 1 2 1 3 2 3 p Proof. If b∗ is one of b , b , b , then the statement clearly holds, since if say b∗ = b , ± 1 ± 2 ± 3 1

then y = y = sin(φ). So suppose that b∗ = b , i [3]. Then x ,y must satisfy 2 3 6 ± i ∀ ∈ i i 2 2 equations (3.164)-(3.166), together with xi + yi = 1. Allowing for yi to take the value

zero, the xi,yi must satisfy

p := x y y y + x y [z x x ]+ x y [z x x ]=0, (3.179) 1 1 1 2 3 2 3 − 1 2 3 2 − 1 3 p := x y [z x x ]+ x y y y + x y [z x x ]=0, (3.180) 2 1 3 − 1 2 2 1 2 3 3 1 − 2 3 p := x y [z x x ]+ x y [z x x ]+ x y y y =0 (3.181) 3 1 2 − 1 3 2 1 − 2 3 3 1 2 3

q := x2 + y2 1, (3.182) 1 1 1 − q := x2 + y2 1, (3.183) 2 2 2 − q := x2 + y2 1, (3.184) 3 3 3 − where z := cos(φ). Viewing the above system of equations as polynomial equations in

the variables x1,x2,x3,y1,y2,y3,z, standard Groebner basis computations reveal that the

polynomial

g := (1 z)(y2 y2)(y2 y2)(y2 y2)(y + y + y ) (3.185) − 1 − 2 1 − 3 2 − 3 1 2 3

lies in the ideal generated by pi,qi, i = 1, 2, 3. In simple terms, this means that b∗ must

satisfy g(xi,yi,z = cos(φ))=0. However, the yi are by construction non-negative and can

107 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

not be all zero. Moreover, φ> 0 so 1 z =0. This implies that − 6

(y2 y2)(y2 y2)(y2 y2)=0, (3.186) 1 − 2 1 − 3 2 − 3

which in view of the non-negativity of the yi implies

(y y )(y y )(y y )=0. (3.187) 1 − 2 1 − 3 2 − 3

The next lemma says that a global minimizer of (b) is not far from the arrangement. J

3 Lemma 3.21. Let b1, b2, b3 be an arrangement of equiangular planes in R , with angle

φ and weights N1 = N2 = N3. Let Ci be the spherical cap with center bi and radius φ.

Then any global minimizer of (3.178) must lie (up to a sign) either on the boundary or the

interior of C C C . 1 ∩ 2 ∩ 3

Proof. First of all notice that b , b , b lie on the boundary of C C C . Let b∗ be 1 2 3 1 ∩ 2 ∩ 3

a global minimizer. If φ = π/2, we have already seen in Theorem 3.19 that b∗ has to

be one of the vertices b1, b2, b3 (up to a sign); so suppose that φ < π/2. Let θi∗ be the

principal angle of b∗ from bi. Then at least two of θ1∗,θ2∗,θ3∗ must be less or equal to φ; for if say θ1∗,θ2∗ > φ, then b3 would give a smaller objective than b∗. Hence, without loss of generality we may assume that θ∗,θ∗ φ. In addition, because of Lemma 3.20, we can 1 2 ≤ further assume without loss of generality that θ1∗ = θ2∗. Let ζ be the vector in the small arc

1 that joins 1 and b and has angle from b , b equal to θ∗. Since (b∗) (ζ), it must √3 3 1 2 1 J ≤J

be the case that the principal angle θ3∗ is less or equal to φ (because the angle of ζ from b3

108 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

is φ). We conclude that θ∗,θ∗,θ∗ φ. Consequently, there exist i = j such that up to a ≤ 1 2 3 ≤ 6

sign b∗ C C . Let us assume without loss of generality that b∗ C C , i.e., θ∗,θ∗ ∈ i ∩ j ∈ 1 ∩ 2 1 2

are the angles of b∗ from b1, b2 (notice that now it may no longer be the case that θ1∗ = θ2∗).

Notice that the boundaries of C1 and C2 intersect at two points: b3 and its reflection b˜3 with respect to the plane spanned by b , b . In fact, divides C C in two halves, H1,2 1 2 H1,2 1 ∩ 2 , ˜, with being the reflection of ˜ with respect to . Letting C˜ be the spherical cap Y Y Y Y H1,2 3

of radius φ around b˜3, we can write

C C =(C C C ) (C C C˜ ). (3.188) 1 ∩ 2 1 ∩ 2 ∩ 3 ∪ 1 ∩ 2 ∩ 3

If b∗ C C C we are done, so let us assume that b∗ C C C˜ . Let b˜∗ be the ∈ 1 ∩ 2 ∩ 3 ∈ 1 ∩ 2 ∩ 3

reflection of b∗ with respect to . This reflection preserves the angles from b and b . H1,2 1 2

˜∗ ˜ We will show that b has a smaller principal angle θ3∗ from b3 than b∗. In fact the spherical

˜∗ ˜ ˜ angle of b from b3 is θ3∗ itself, and this is precisely the angle of b∗ from b3. Denote by H3,3˜

˜ ¯∗ the plane spanned by b3 and b3, b the spherical projection of b∗ onto H3,3˜, γ the angle

between b¯∗ and b∗, α the angle between b¯∗ and b3, and α˜ the angle between b¯∗ and b˜3.

Then the spherical law of cosines gives

˜ cos(θ3∗) = cos(˜α)cos(γ), (3.189)

cos(θ3∗) = cos(α)cos(γ). (3.190)

Letting 2ψ be the angle between b3 and b˜3, we have that

α = ψ +(ψ α˜). (3.191) −

109 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

By hypothesis α<ψ˜ and so α>ψ. If 2ψ π/2, then α is an acute angle and cos(˜α) > ≤ cos(α). If 2ψ >π/2, then cos(˜α) cos(α) only when π (2ψ α˜) α˜ ψ π/2. ≤ − − ≤ ⇔ ≥ But by construction ψ π/2 and equality is achieved only when φ = π/2. Hence, we ≤ conclude that cos(˜α) > cos(α) , which implies that cos(θ˜ ) > cos(θ ) . This in turn | | 3 | 3 | means that (b˜∗) < (b∗), which is a contradiction. J J

The next lemma ensures a every global minimizer satisfies a certain symmetry property.

3 Lemma 3.22. Let b1, b2, b3 be an arrangement of equiangular planes in R , with angle

φ and weights N1 = N2 = N3. Let b∗ be a global minimizer of (3.153) and let xi :=

bi>b∗, i =1, 2, 3. Then either x1,x2,x3 are all non-negative or they are all non-positive.

Proof. By Lemma 3.21, we know that either b∗ C C C or b∗ C C C . In ∈ 1 ∩ 2 ∩ 3 − ∈ 1 ∩ 2 ∩ 3

the first case, the angles of b∗ from b , b , b are less or equal to φ π/2. 1 2 3 ≤

Finally, we arrive at our main result about three equiangular hyperplanes.

Theorem 3.23. Let b , b , b be an equiangular hyperplane arrangement of RD, D 3, 1 2 3 ≥ with φ = φ = φ = φ (0,π/2] and weights N = N = N . Let B∗ be the set of 1,2 13 23 ∈ 1 2 3 global minimizers of (3.153). Then B∗ satisfies the following phase transition:

1. If φ> 60◦, then B∗ = b , b , b . {± 1 ± 2 ± 3}

1 2. If φ = 60◦, then B∗ = b , b , b , 1 . ± 1 ± 2 ± 3 ± √3 n o 1 3. If φ< 60◦, then B∗ = 1 . ± √3 n o

110 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Proof. Lemmas 3.20 and 3.22 show that b∗ is a global minimizer of problem

2 2 2 min 1 (b1>b) + 1 (b2>b) + 1 (b3>b) (3.192) b S2 − − − ∈ q q q  if and only if it is a global minimizer of problem

2 2 2 min 1 (b1>b) + 1 (b2>b) + 1 (b3>b) , (3.193) b S2 − − − ∈ q q q 

s.t. b>b = b>b, for some i = j [3]. (3.194) i j 6 ∈

So suppose without loss of generality that b∗ is a global minimizer of (3.194) corresponding to indices i =1,j =2. Then b∗ lives in the vector space

1 0     = Span , , (3.195) V1,2 1 0             0 1         which consists of all vectors that have equal angles from b1 and b2. Taking into considera-

2 tion that b∗ also lies in S , we have the parametrization

v 1   b∗ = . (3.196) √ 2 2 v  2v + w       w     The choice v = 0, corresponding to b∗ = e3 (the third standard basis vector), can be excluded, since b3 always results in a smaller objective: moving b from e3 to b3 while staying in the plane results in decreasing angles of b from b , b , b . Consequently, we V1,2 1 2 3 can assume v =1, and our problem becomes an unconstrained one, with objective

2[(2 + w2)(1+2α +3α2) (αw +2α + 1)2]1/2 + √2 a aw +1 (w)= − | − |. (3.197) J [(2 + w2)(1+2α +3α2)]1/2

111 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Now, it can be shown that:

The following quantity is always positive •

u := (2+ w2)(1+2α +3α2) (αw +2α + 1)2. (3.198) −

The choice w =1+1/α corresponds to b∗ = b , and that is precisely the only point • 3 where (w) is non-differentiable. J

1 The choice w =1 corresponds to b∗ = 1. • √3

The choice α =1/3 corresponds to φ = 60◦. •

(b )= 1 1 precisely for α =1/3. • J 3 J √3   Since for α = 0 the theorem has already been proved (orthogonal case), we will assume that α> 0. We proceed by showing that for α (0, 1/3) and for w =1+1/a, it is always ∈ 6 the case that (w) > (1+1/a). Expanding this last inequality, we obtain J J 1/2 2 (2 + w2)(1 + 2α + 3α2) (αw + 2α + 1)2 + √2 a aw + 1 2√1 + 4α + 6α2 − | − | > , 2 2 1/2 α α2  [(2 + w )(1 + 2α + 3α )] 1 + 2 + 3 (3.199) which can be written equivalently as

p < 4√2u1/2 α αw +1 (1+2α +3α2), where (3.200) 1 | − | p := 4(2+ w2)(1+4α +6α2) (1+2α +3α2) 4u + 2(α αw + 1)2 , (3.201) 1 − −   and u has been defined in (3.198). Viewing p1 as a polynomial in w, p1 has two real roots

given by

1 7α + α2 + 15α3 r(1) :=1+1/α > r(2) := − − . (3.202) p1 p1 α(7+22α + 15α2)

112 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Since the leading coefficient of p1 is always a negative function of α (for α > 0), (3.200)

(2) (1) will always be true for w [rp ,rp ], in which interval p is strictly negative. Conse- 6∈ 1 1 1 (2) (1) quently, we must show that as long as α (0, 1/3), (3.200) is true for every w [rp ,rp ). ∈ ∈ 1 1

For such w, p1 is non-negative and by squaring (3.200), we must show that

p >0, w [r(2),r(1)), α (0, 1/3), (3.203) 2 ∀ ∈ p1 p1 ∀ ∈ p :=32u(α αw + 1)2(1+2α +3α2) p2. (3.204) 2 − − 1

Interestingly, p2 admits the following factorization

p = 4( 1 α + αw)2p , (3.205) 2 − − − 3 p := 7 18α 49α2 204α3 441α4 162α5 + 81α6 3 − − − − − − + (30α + 238α2 + 612α3 + 468α4 162α5 162α6)w − − +( 8 48α 111α2 12α3 + 270α4 + 324α5 + 81α6)w2 (3.206) − − − −

The discriminant of p3 is the following 10-degree polynomial in α:

∆(p ) = 32( 7 60α 226α2 312α3 + 782α4 + 5160α5 + 13500α6+ 3 − − − − + 21816α7 + 22761α8 + 14580α9 + 4374α10). (3.207)

By Descartes rule of signs, ∆(p3) has precisely one positive root. In fact this root is equal to

1/3. Since the leading coefficient of ∆(p ) is positive, we must have that ∆(p ) < 0, α 3 3 ∀ ∈

(0, 1/3), and so for such α, p3 has no real roots, i.e. it will be either everywhere negative

or everywhere positive. Since p (α = 1/4,w = 1) = 80327/4096, we conclude that as 3 − long as α (0, 1/3), p is everywhere negative and as long as w =1+1/α, p is positive, ∈ 3 6 2 113 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

i.e. we are done. Moving on to the case α =1/3, we have

128 p (α =1/3,w)= (w 4)2(w 1)2, (3.208) 2 9 − −

which shows that for such α the only global minimizers are b and 1 1. In a similar ± 3 ± √3 fashion, we can proceed to show that (w) > 1 1 , for all w = 1 and all α J J √3 6 ∈   (1/3, ). However, the roots of the polynomials that arise are more complicated functions ∞ of α and establishing the inequality (w) > 1 1 analytically, seems intractable; this J J √3   can be done if one allows for numeric computation of polynomial roots.

3.3.3.4 Conditions of global optimality for an arbitrary hyperplane arrangement

Theorem 3.23 suggests that when the hyperplanes are sufficiently separated, then only nor-

mals can be global minimizers, otherwise the only global minimizer lies at the center of the arrangement. This is in striking similarity with the results regarding the Fermat point of a planar or even spherical triangle [37]. We note that when the symmetry of the ar- rangement in Theorem 3.23 is removed, either by not requiring the principal angles φij

to be equal or/and by not requiring the weights Ni to be equal, then our proof technique

no longer applies, and the problem seems even harder. However, under sufficiently strong

conditions (which are not too strong), b is the unique solution of (3.153). To see what ± 1 these conditions are, we need two lemmas.

D 1 Lemma 3.24. Let b1,..., bn be vectors of S − , with pairwise principal angles φij. Then

1/2 2 max N1 b1>b + + Nn bn>b Ni +2 NiNj cos(φij) . (3.209) b>b=1 ··· ≤ " i i=j #   X X6

114 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Proof. Let b† be a maximizer of N b>b + + N b>b . Then b† must satisfy the first 1 1 ··· n n order optimality condition, which is

λ†b† = Ni Sgn(bi>b†)bi, (3.210) i X where λ† is a Lagrange multiplier and Sgn(bi>b†) is the subdifferential of bi>b† . Then

λ†b† = N s†b + + N s† b , (3.211) 1 1 1 ··· n n n where s† = Sign(b>b†), if b>b† =0, and s† [ 1, 1] otherwise. Recalling that b† =1, i i i 6 i ∈ − 2 and taking equality of 2-norms on both sides of (3.212), we get

N1s1†b1 + + Nnsn† bn b† = ··· . (3.212) N1s1†b1 + + Nnsn† bn ··· 2

Now

> Ni bi>b† = Ni bi>b† = Nisi†bi>b† = b† Nisi†bi † †  †  i i:b bi i:b bi i:b bi X X6⊥ X6⊥  X6⊥  

= b† > N s†b + N s†b = b† > N s†b  i i i i i i i i i † † ! i:b bi i:b bi i  X6⊥ X⊥  X   (3.212) = N1s1†b1 + + Nnsn† bn ··· 2 1/2 2

= si†Ni +2 NiNjsi†sj†bi>bj " i i=j # X   X6 1/2 N 2 +2 N N cos(φ ) . (3.213) ≤ i i j ij " i i=j # X X6

115 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

D Lemma 3.25. Let b1,..., bn be a hyperplane arrangement of R with integer weights

D 1 N ,...,N assigned. For b S − , let θ be the principal angle between b and b . Then 1 n ∈ i i

2 min Ni sin(θi) Ni σmax [NiNj cos(φij)]i,j , (3.214) b SD−1 ≥ − ∈ r X X   where σ [N N cos(φ )] denotes the maximal eigenvalue of the n n matrix, whose max i j ij i,j ×   (i, j) entry is N N cos(φ ) and 1 i, j n. i j ij ≤ ≤

Proof. For any vector ξ we have that ξ ξ . Let ψ [0, 180◦] be the angle between k k1 ≥ k k2 i ∈ b and bi. Then

1 2 N sin(θ )= N sin(ψ ) k·k ≥k·k N 2 sin2(ψ ) (3.215) i i i | i | ≥ i i X X qX = N 2 N 2 cos2(ψ ). (3.216) i − i i qX X 2 2 Hence Ni sin(θi) is minimized when Ni cos (ψi) is maximized. But P 2 2 2 Ni cos (ψi)= b> Ni bibi> b, (3.217) X X  2 2 and the maximum value of Ni cos (ψi) is equal to the maximal eigenvalue of the matrix P 2 > Ni bibi> = N b N b N b N b , (3.218) 1 1 ··· n n 1 1 ··· n n X    which is the same as the maximal eigenvalue of the matrix

> N1b1 Nnbn N1b1 Nnbn =[NiNj cos(ψij)]i,j , (3.219)  ···   ···  where ψ is the angle between b , b . Now, if A is a matrix and we denote by A the ij i j | | matrix that arises by taking absolute values of each element in the matrix A, then it is known that σ ( A ) σ (A). Hence the result follows by recalling that cos(ψ ) = max | | ≥ max | ij | cos(φij).

116 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

The next result gives sufficient conditions under which there is a unique global mini-

mizer of the continuous problem, equal to the normal vector of the dominant hyperplane.

Theorem 3.26. Let b ,..., b be an arrangement of n 3 hyperplanes in RD, with pair- 1 n ≥ wise principal angles φ . Let N N N be positive integer weights assigned ij 1 ≥ 2 ≥···≥ n to the arrangement. Suppose that N1 is large enough, in the sense that

2 2 N1 > α + β , where (3.220) p

α := N sin(φ ) N 2 σ [N N cos(φ )] 0, (3.221) i 1,i − i − max i j ij i,j>1 ≥ i>1 s i>1 X X   2 β := Ni +2 NiNj cos(φij), (3.222) i>1 i=j, i,j>1 sX 6 X with σ [N N cos(φ )] denoting the maximal eigenvalue of the (n 1) (n 1) max i j ij i,j>1 − × −   matrix, whose (i 1,j 1) entry is N N cos(φ ) and 1 < i, j n. Then any global − − i j ij ≤ minimizer b∗ of problem (3.153) must satisfy b∗ = b , for some i [n]. If in addition, ± i ∈

γ := min Ni sin(φi0,i) Ni sin(φ1,i) > 0, (3.223) i0=1 − 6 i=i0 i>1 X6 X then problem (3.153) admits a unique up to sign global minimizer b∗ = b . ± 1

Proof. Let b∗ be a global solution of (3.153). Suppose for the sake of a contradiction that b∗ , i [n], i.e., b∗ = b , i [n]. Consequently, is differentiable at b∗ and so 6⊥ Hi ∀ ∈ 6 ± i ∀ ∈ J b∗ must satisfy (3.156), which we repeat here for convenience:

n 1 2 − 2 N b>b∗ 1 b>b∗ b + λ∗ b∗ = 0. (3.224) − i i − i i i=1 X     117 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Projecting (3.224) orthogonally onto the hyperplane ∗ defined by b∗ we get H n 1 2 − 2 0 Ni bi>b∗ 1 bi>b∗ π ∗ (bi)= . (3.225) − − H i=1 X     Since b∗ = bi, i [n], it will be the case that hi := π ∗ (bi) = 0, i [n]. Since 6 ∀ ∈ H 6 ∀ ∈ 1 2 2 π ∗ (bi) = 1 bi>b∗ > 0, (3.226) k H k2 −    equation (3.225) can be written as

n ˆ 0 Ni bi>b∗ hi = , (3.227) i=1 X  which in turn gives

ˆ N1 b1>b∗ Ni bi>b∗ hi (3.228) ≤ i>1 2 X 

N b>b∗ (3.229) ≤ i i i>1 X

max Ni bi>b (3.230) ≤ b>b=1 i>1 X Lem.3.24 β. (3.231) ≤

Since by hypothesis N1 > β, we can define an angle θ1† by

β cos(θ1†) := , (3.232) N1 and so (3.231) says that θ can not drop below θ†. Hence (b∗) can be bounded from 1 1 J below as follows:

(b∗)= N1 sin(θ1∗)+ Ni sin(θi∗) N1 sin(θ1†)+ min Ni sin(θi) (3.233) J ≥ b>b=1 i>1 i>1 X X Lem.3.25 2 N sin(θ†)+ N σ [N N cos(φ )] . (3.234) ≥ 1 1 i − max i j ij i,j>1 s i>1 X  

118 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

By the optimality of b∗, we must also have (b ) (b∗), which in view of (3.234) gives J 1 ≥J

2 N sin(φ ) N sin(θ†)+ N σ [N N cos(φ )] . (3.235) i 1i ≥ 1 1 i − max i j ij i,j>1 i>1 s i>1 X X   Now, a little algebra reveals that this latter inequality is precisely the negation of hypothesis

N > α2 + β2. This shows that b∗ has to be b , for some i [n]. For the last 1 ± i ∈ p statement of the theorem, notice that condition γ > 0 is equivalent to saying that (b ) < J 1 (b ), i> 1. J i ∀

Let us provide some intuition about the meaning of the quantities α, β and γ in Theorem

3.26. To begin with, the first term in α is precisely equal to (b ), while the second J 1 term in α is a lower bound on the objective function N sin(θ )+ + N sin(θ ), if one 2 2 ··· n n β discards hyperplane 1. Moving on, the quantity admits a nice geometric interpretation: H N1 1 β cos− is a lower bound on how small the principal angle of a critical point b† from b N1 1   can be, if b† = b . Interestingly, the larger N is, the larger this minimum angle is, which 6 ± 1 1 shows that critical hyperplanes † (i.e., hyperplanes defined by critical points b†) that are H distinct from , must be sufficiently separated from . Finally, the second term in γ is H1 H1 (b ), while the first term is the smallest objective value that corresponds to b = b ,i> 1, J 1 i and so (3.223) simply guarantees that (b ) < (b ), i> 1. J 1 J i ∀

2 2 Next, notice that condition N1 > α + β of Theorem 3.26 is easier to satisfy when p is close to the rest of the hyperplanes (which leads to small α), while the rest of the H1 hyperplanes are sufficiently separated (which leads to small α and small β). Regardless,

119 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

one can show that

√2 N α2 + β2, (3.236) i ≥ i>1 X p and so if

N1 > √2 Ni, (3.237) i>1 X then any global minimizer of (3.153) has to be one of the normals, irrespectively of the φij.

Finally, condition (3.223) is consistent with condition (3.220) in that it requires to be H1 close to ,i> 1 and , to be sufficiently separated for i, j > 1. Once again, (3.223) Hi Hi Hj can always be satisfied irrespectively of the φij, by choosing N1 sufficiently large, since only the positive term in the definition of γ depends on N1.

3.3.4 Theoretical analysis of the discrete problem

We now turn our attention to the discrete problem of hyperplane clustering via DPCP, i.e., to

problems (3.3) and (3.4), for the case where X =[X 1,..., X n]Γ, with X i being Ni points

in , as described in section 3.2.1.1. As we did in the case of single subspace learning Hi D 1 with outliers ( 3.2.3), for any i [n] and b S − , we write the quantity X >b as § ∈ ∈ || i ||1

Ni Ni (i) (i) (i) X >b b>x b> b>x x b>x (3.238) i 1 = j = Sign j j = Ni i,b, j=1 j=1   X X

where

Ni 1 (i) (i) x := Sign b>x x (3.239) i,b N j j i j=1 X  

120 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

X is the average point of i with respect to the orthogonal projection hi,b := π i (b) of b H onto . Then by similar arguments as the ones that established Lemma 3.8 and inequality Hi

(3.83), we have that xi,b is a discrete approximation to an integral that evaluates to chi,b, which leads us to define b

i := max xi,b c hi,b (3.240) b SD−1 − 2 ∈ b D 1 as the maximum approximation error as b varies on S − . In turn, i is bounded above as

i √5SD 1(X i), (3.241) ≤ −

where SD 1(X i) is the spherical cap discrepancy of X i; see the defining equation (3.65) − and the discussion surrounding (3.83), which adapts it for points that lie in a proper linear subspace. As a consequence, the more uniformly distributed the points X i are, the smaller

SD 1(X i) is (by definition), and hence the same is true for the uniformity parameter i. − Before stating our main result regarding the properties of problem (3.3) when the data lie in a union of hyperplanes, we need a definition analogous to Definition 3.10.

D 1 Definition 3.27. For a set Y =[y ,..., y ] − and integer K L, define Y to 1 L ⊂S ≤ R ,K be the maximum circumradius among the circumradii of all polytopes of the form

K α y : α [ 1, 1] , (3.242) ji ji ji ∈ − ( i=1 ) X where j1,...,jK are distinct integers in [L]. Using this notation, we now define

n

:= max i,Ki . (3.243) R K1+ +Kn=D 1 RX 0 ···Ki D 2− i=1 ≤ ≤ − X 121 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

We note that it is always the case that i,Ki Ki, with this upper bound achieved when RX ≤ contains K colinear points. Combining this fact with the constraint K = D 1 Xi i i i − in (3.243), we get that D 1, and the more uniformly distributedP are the points X R ≤ − inside the hyperplanes, the smaller is (even though does not go to zero). R R The theorem that follows is the discrete counterpart of Theorem 3.26, and it says that if 1) the dominant hyperplane is sufficiently dominant, 2) the remaining hyperplanes are sufficiently separated, and 3) the points are uniformly distributed in their respective hyper- planes, then there is a unique up to sign global minimizer of problem (3.3), the normal vector to the dominant hyperplane.

Theorem 3.28. Let b∗ be a solution of (3.3) with X = [X 1,..., X n]Γ, and suppose that c> √21. If

2 2 N1 > α¯ + β¯ , where (3.244) q 1 α¯ := α + c− 1N1 +2 iNi , and (3.245) i>1 ! X 1 β¯ := β + c− +  N , (3.246) R i i  X 

with α, β as in Theorem 3.26, then b∗ = b for some i [n]. Furthermore, b∗ = b , if ± i ∈ ± 1

1 γ¯ := γ c−  N +  N +2  N > 0. (3.247) − 1 1 2 2 i i i>2 ! X (1) Proof. Let us first derive an upper bound θmax on how large θ1∗ can be. Towards that end,

we derive a lower bound on the objective function (b) in terms of θ : For any vector J 1

122 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

D 1 b S − we can write ∈

(b)= X >b = X >b = N b>x b (3.248) J 1 i 1 i i, X X = cN sin( θ )+ N b>η , η  (3.249) i i i i,b i,b 2 ≤ i X X c N sin(θ )  N (3.250) ≥ i i − i i X X = cN sin(θ )+ c N sin(θ )  N (3.251) 1 1 i i − i i i>1 X X

cN1 sin(θ1)+ c min Ni sin(θi) iNi (3.252) ≥ b>b=1 − " i>1 # X X Lem.3.25 cN sin(θ )+ c N 2 σ [N N cos(φ )]  N . (3.253) ≥ 1 1 i − max i j ij i,j>1 − i i s i>1 X   X Next, we derive an upper bound on (b ): J 1

(b )= X >b = N b>x b (3.254) J 1 i 1 1 i 1 i, 1 i>1 i>1 X X

= cN sin(φ )+ N b>η , η  (3.255) i 1i i 1 i,b1 i,b1 2 ≤ i i>1 i>1 X X

c N sin(φ )+  N . (3.256) ≤ i 1i i i i>1 i>1 X X Since any vector b for which the corresponding lower bound (3.253) on (b) is strictly J larger than the upper bound (3.256) on (b ), can not be a global minimizer (because it J 1 (1) gives a larger objective than b1), θ1∗ must be bounded above by θmax, where the latter is defined, in view of (3.244), by

1 α + c− 1N1 +2 iNi sin θ(1) := i>1 , (3.257) max N 1 P   where α is as in Theorem 3.28 . Now let b∗ be a global minimizer, and suppose for the sake

(1) of contradiction that b∗ , i [n]. We will show that there exists a lower bound θ 6⊥ Hi ∀ ∈ min 123 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

(1) (1) on θ1, such that θmin >θmax, which is of course a contradiction. Towards that end, the first

order optimality condition for b∗ can be written as

0 X Sgn(X >b∗)+ λb∗, (3.258) ∈

where λ is a Lagrange multiplier and Sgn(α) = Sign(α) if α =0 and Sgn(0) = [ 1, 1], is 6 − the subdifferential of the function . Since the points X are general, any hyperplane of |·| H RD spanned by D 1 points of X such that at most D 2 points come from X , i [n], − − i ∀ ∈

does not contain any of the remaining points of X . Consequently, by Proposition 3.11 b∗

X will be orthogonal to precisely D 1 points ξ1,..., ξD 1 , from which at most − − ⊂  K D 2 lie in . Thus, we can write relation (3.258) as i ≤ − Hi

D 1 n − 0 αjξj + Ni xi,b∗ + λb∗ = , (3.259) j=1 i=1 X X for real numbers 1 α 1, j [D 1]. Using the definition of  , we can write − ≤ j ≤ ∀ ∈ − i

x b∗ = c hˆ b∗ + η ∗ , i [n], (3.260) i, i, i,b ∀ ∈

with η ∗  . Note that since b∗ , i [n], we have hˆ b∗ = 0. Substituting i,b 2 ≤ i 6⊥ Hi ∀ ∈ i, 6

(3.260) in (3.259) we get

D 1 n n − ˆ 0 αjξj + c Ni hi,b∗ + Ni ηi,b∗ + λb∗ = , (3.261) j=1 i=1 i=1 X X X and projecting (3.261) onto the hyperplane b∗ with normal b∗, we obtain H

D 1 n n − ˆ ∗ 0 π b∗ αjξj + c Niπ b∗ hi,b + Ni π b∗ ηi,b∗ = . (3.262) H H H j=1 ! i=1 i=1 X X   X 

124 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

ˆ ∗ Let us analyze the term π b∗ hi,b . We have H   π i (b∗) b∗ bi>b∗ bi ˆ ∗ π b∗ hi,b = π b∗ H = π b∗ − (3.263) H H H π i (b∗) 2 b∗ bi>b∗ bi !   k H k  −  2 2 b∗ bi>b∗ bi b∗ bi>b∗ bi 1 cos (θi) = π b∗ − = − − b∗ (3.264) H sin(θ ) sin(θ ) − sin(θ ) i  ! i   i  b>b∗ b>b∗ b∗ b i i i ˆ = − = bi>b∗ ζi, ζi = π b∗ (bi). (3.265) sin(θi)  − H  Using (3.265), (3.262) becomes

D 1 n n − ˆ 0 π b∗ αjξj Ni c bi>b∗ ζi + Ni π b∗ ηi,b∗ = . (3.266) H − H j=1 ! i=1 i=1 X X  X  Isolating the term that depends on i = 1 to the LHS and moving everything else to the

RHS, and taking norms, we get

n cN cos(θ ) cN cos(θ )+ 1 1 ≤ i i i>1 X K n

+ π b∗ αjξj + Ni π b∗ ηi,b∗ . (3.267) H H 2 j=1 ! i=1 X 2 X  K Since ηi,b∗ i, we have that π b∗ ηi,b∗ i. Next, the quantity αjξj 2 ≤ H 2 ≤ j=1  P can be decomposed along the index i, based on the hyperplane membership of the ξj.

For instance, if ξ , then replace the term α ξ with α(1)ξ(1), where the superscript 1 ∈ H1 1 1 1 1 (1) denotes association to hyperplane . Repeating this for all ξ and after a possible · H1 j re-indexing, we have

D 1 n Ki − (i) (i) αjξj = αj ξj . (3.268) j=1 i=1 j=1 X X X Now, by Definition 3.27 we have that

Ki (i) (i) αj ξj i,Ki . (3.269) ≤R j=1 2 X

125 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

As a consequence, the upper bound (3.267) can be extended to

n cN cos(θ ) cN cos(θ )+  N + . (3.270) 1 1 ≤ i i i i R i>1 i X X Finally, Lemma 3.24 provides a bound

n N cos(θ ) β, (3.271) i i ≤ i>1 X where β is as in Theorem 3.26. In turn, this can be used to extend (3.270) to

1 β + c− ( + iNi) (1) cos(θ1) R =: cos θmin . (3.272) ≤ N1 P   (1) ¯ Note that the angle θmin of (3.272) is well-defined, since by hypothesis N1 > β, and that

(1) what (3.272) effectively says, is that θ1 never drops below θmin. It is then straightforward

2 ¯2 (1) (1) to check that hypothesis N1 > α¯ + β implies θmin > θmax, which is a contradiction. p In other words, b∗ must be equal up to sign to one of the bi, which proves the first part of the theorem. The second part follows from noting that condition γ¯ > 0 guarantees that

(b ) < min (b ). J 1 i>1 J i

2 2 Notice the similarity of conditions N1 > α¯ + β¯ , γ¯ > 0 of Theorem 3.28 with

2 2 p conditions N1 > α + β ,γ > 0 of Theorem 3.26. In fact α¯ >α, β¯ >β, γ<γ¯ , which p implies that the conditions of Theorem 3.28 are strictly stronger than those of Theorem

3.26. This is no surprise since, as we have already remarked, the solution set of (3.3) depends not only on the geometry (φij) and the weights (Ni) of the arrangement, but also on the distribution of the data points (parameters  and ). i R

We note that in contrast to condition (3.220) of Theorem 3.26, N1 now appears in both sides of condition (3.244) of Theorem 3.28 . Nevertheless, under the assumption

126 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

c > √21, (3.244) is equivalent to the positivity of a quadratic polynomial in N1, whose leading coefficient is positive, and hence it can always be satisfied for sufficiently large N1.

Another interesting connection of Theorem 3.26 to Theorem 3.28 , is that the former can be seen as a limit version of the latter: dividing (3.244) and (3.247) by N1, letting

N ,...,N go to infinity while keeping each ratio N /N fixed, and recalling that  0 1 n i 1 i → as N and D 1, we recover the conditions of Theorem 3.26. i → ∞ R≤ − Next, we consider the linear programming recursion (3.4). At a conceptual level, the main difference between the linear programming recursion in (3.4) and the continuous and discrete problems (3.153) and (3.3), respectively, is that the behavior of (3.4) depends highly on the initialization nˆ 0. Intuitively, the closer nˆ 0 is to b1, the more likely the recur- sion will converge to b1, with this likelihood becoming larger for larger N1. The precise technical statement is as follows.

Theorem 3.29. Let nˆ be the sequence generated by the linear programming recursion { k} D 1 (3.4) by means of the simplex method, where nˆ S − is an initial estimate for b , with 0 ∈ 1 principal angle from b equal to θ . Suppose that c> √5 , and let φ(1) = min φ . i i,0 1 min i>1 { 1i}

If θ1,0 is small enough, i.e.,

(1) 1 2 1 sin(θ ) < min sin φ 2 , 1 (c  ) 2c−  , (3.273) 1,0 min − 1 − − 1 − 1 n   p o

and N1 is large enough in the sense that

ν + ν2 +4ρτ N1 > max µ, , where (3.274) ( p2τ )

127 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

1 1 i>1 Ni sin(θi,0)+ c− jNj + i=1,j Ni 2c− i sin(φij) µ := max 6 − , (3.275) j=1 sin(φ ) sin(θ ) 2c 1 6 (P 1j − P1,0 − − 1 )

1 1 1 ν :=2c− 1 β + c− + c− iNi R ! Xi>1 1 1 + 2 sin(θ1,0) + 2c− 1 α + 2c− iNi , (3.276) i>1 !  2  X 2 1 1 1 ρ := α + 2c− iNi + β + c− + c− iNi , (3.277) ! R ! Xi>1 Xi>1 2 1 1 2 τ := cos (θ ) 4c−  sin(θ ) 5(c−  ) , (3.278) 1,0 − 1 1,0 − 1

with α, β as in Theorem 3.26, then n converges to b in a finite number of steps. { k} ± 1

Proof. First of all, it follows from the theory of the simplex method, that if nk+1 is obtained via the simplex method, then it will satisfy the conclusion of Lemma 3.13. Then Theorem

3.14 guarantees that n converges to a critical point of problem (3.3) in a finite number { k} of steps; denote that point by nk∗ . In other words, nk∗ will satisfy equation (3.224) and it will have unit ` norm. Now, if n ∗ = b for some j > 1, then 2 k ± j

(nˆ ) (b ), (3.279) J 0 ≥J j or equivalently

N nˆ >x n N b>x b . (3.280) i 0 i,ˆ 0 ≥ i j i, j i=j X X6 Substituting the concentration model

\ xi,nˆ 0 = cπ i (n0)+ ηi,0, ηi,0 i, (3.281) H 2 ≤

\ xi,bj = cπ i (bj)+ ηij, ηij i, (3.282) H 2 ≤

128 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

into (3.280), we get

N c sin(θ )+ N nˆ >η N c sin(φ )+ N b>η . (3.283) i i,0 i 0 i,0 ≥ i ij i j ij i=j X X X6 X Bounding the LHS of (3.283) from above and the RHS from below, we get

N c sin(θ )+  N N c sin(φ )  N . (3.284) i i,0 i i ≥ i ij − i i i=j X X X6 X But this very last relation is contradicted by hypothesis N > µ, i.e., none of the b for 1 ± j

j > 1 can be n ∗ . We will show that n ∗ has to be b . So suppose for the sake of a k k ± 1

contradiction that that n ∗ is not colinear with b , i.e., n ∗ , i [n]. Since n ∗ k 1 k 6⊥ Hi ∀ ∈ k satisfies (3.224), we can use part of the proof of Theorem 3.28 , according to which the

(1) (1) principal angle θ1, of nk∗ from b1 does not become less than θmin, where θmin is as in ∞ (3.272). Consequently, and using once again the concentration model, we obtain

Ni c sin(θi,0)+ iNi (nˆ 0) (nk∗ ) Ni c sin(θi, ) iNi ≥J ≥J ≥ ∞ − X X X X N c sin θ(1) + c N 2 σ [N N cos(φ )]  N . ≥ 1 min i − max i j ij i,j>1 − i i s i>1   X   X (3.285)

A little algebra reveals that the outermost inequality in (3.285) contradicts (3.274).

The quantities appearing in Theorem 3.29 are harder to interpret than those of Theorem

3.28 , but we can still give some intuition about their meaning. To begin with, the two

inequalities in (3.274) represent two distinct requirements that we enforced in our proof,

which when combined, guarantee that the limit point of (3.4) is b . ± 1

129 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

The first requirement is that no b can be the limit point of (3.4) for i > 1; this is ± i captured by a linear inequality of the form

µN1 +(terms not depending on N1) > 0, (3.286)

which is satisfied either for N1 sufficiently large (if µ > 0) or for N1 sufficiently small (if

µ < 0). To avoid pathological situations where N1 is required to be negative or less than

D 1, it is natural to enforce µ to be positive. This is precisely achieved by inequality − sin(θ ) < sin φ(1) 2 in (3.273), which is a quite natural condition itself: the initial 1,0 min − 1   estimate nˆ 0 needs to be closer to b1 than any other normal bi for i > 1, and the more well-distributed the data X are inside (smaller  ), the further nˆ can be from b . 1 H1 1 0 1 The second requirement that we employed in our proof is that the limit point of (3.4) is one of the b ,..., b ; this is captured by requiring that a certain quadratic polynomial ± 1 ± n

p(N ) := τN 2 νN ρ (3.287) 1 1 − 1 −

in N1 is positive. To avoid situations where the positivity of this polynomial contradicts the relation N1 > µ, it is important that we ask its leading coefficient τ to be positive, so that the second requirement is satisfied for N1 large enough, and thus is compatible with

N1 > µ. As it turns out, τ is positive only if the data X 1 are sufficiently well distributed in , which is captured by condition c > √5 of Theorem 3.29. Even so, this latter H1 1

1 2 1 condition is not sufficient; instead sin(θ ) < 1 (c  ) 2c−  is needed (as in 1,0 − − 1 − 1 p (3.273)), which is once again very natural: the more well-distributed the data X 1 are inside

(smaller  ), the further nˆ from b can be. H1 1 0 1 130 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Next, notice that the conditions of Theorem 3.29 are not directly comparable to those

of Theorem 3.28 . Indeed, it may be the case that b is not a global minimizer of the ± 1

non-convex problem (3.3), yet the recursions (3.4) do converge to b1, simply because nˆ 0 is close to b1. In fact, by (3.273) nˆ 0 must be closer to b1 than bi to b1 for any i > 1, i.e.,

(1) φmin > θ1,0. Similarly to Theorems 3.26 and 3.28 , the more separated the hyperplanes

, are for i, j > 1, the easier it is to satisfy condition (3.274). In contrast, needs Hi Hj H1 to be sufficiently separated from for i > 1, since otherwise µ becomes large. This has Hi an intuitive explanation: the less separated is from the rest of the hyperplanes, the less H1

resolution the linear program (3.4) has in distinguishing b1 from bi,i> 1. To increase this

resolution, one needs to either select nˆ 0 very close to b1, or select N1 very large. The acute

reader may recall that the quantity α appearing in (3.277) becomes larger when becomes H1 separated from ,i> 1. Nevertheless, there are no inconsistency issues in controlling Hi the size of µ and ρ. This is because α is always bounded from above by i>1 Ni, i.e., α P does not increase arbitrarily as the φ1i increase. Another way to look at the consistency

of condition (3.274), is that its RHS does not depend on N1; hence one can always satisfy

(3.274) by selecting N1 large enough.

3.3.5 Algorithms

There are at least two ways in which DPCP can be used to learn a hyperplane arrangement;

either through a sequential (RANSAC-style) scheme, or through a parallel (K-Subspaces-

style) scheme. These two cases are described next.

131 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

3.3.5.1 Learning a hyperplane arrangement sequentially

Since at its core DPCP is a single subspace learning method, we may as well use it to learn

n hyperplanes in the same way that RANSAC [34] is used: learn one hyperplane from

the entire dataset, remove the points close to it, then learn a second hyperplane and so on.

The main weakness of this technique is well known, and consists of its sensitivity to the

thresholding parameter, which is necessary in order to remove points.

To alleviate the need of knowing a good threshold, we propose to replace the process of

removing points by a process of appropriately weighting the points. In particular, suppose

we solve the DPCP problem (3.3) on the entire dataset X and obtain a unit `2-norm vector b1. Now, instead of removing the points of X that are close to the hyperplane with normal vector b1 (which would require a threshold parameter), we weight each and every point xj

X of by its distance b1>xj from that hyperplane. Then to compute a second hyperplane

with normal b2 we apply DPCP on the weighted dataset b1>xj xj . To compute a third  hyperplane, the weight of point xj is chosen as the smallest distance of xj from the already

computed two hyperplanes, i.e., DPCP is now applied to mini=1,2 bi>xj xj . After   n hyperplanes have been computed, the clustering of the points is obtained based on their

distances to the n hypeprlanes. The resulting scheme is listed in Algorithm 3.5.

3.3.5.2 K-Hyperplanes via DPCP

Another way to do hyperplane clustering via DPCP, is to modify the classic K-Subspaces

[9,102,121] (see also Chapter 2) by computing the normal vector of each cluster by DPCP;

132 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Algorithm 3.5 Sequential Hyperplane Learning via DPCP D N 1: procedure SHL-DPCP(X =[x , x ,..., x ] R × ,n) 1 2 N ∈ 2: i 0; ← 3: w 1, j =1,...,N; j ← 4: for i =1: n do

5: Y [w x w x ]; ← 1 1 ··· N N

6: bi argminb RD Y >b , s.t. b>b =1 ; ← ∈ 1   7: w min b> x , j =1,...,N; j ← k=1,...,i k j

8: end for

9: C x X : i = argmin b>x , i =1,...,n; i ← j ∈ k=1,...,n k j  n 10: return (b , C ) ; { i i }i=1 11: end procedure

see Algorithm 3.6. It is worth noting that since DPCP minimizes the `1-norm of the dis- tances of the points to a hyperplane, consistency dictates that the stopping criterion for KH-

DPCP be governed by the sum over all points of the distance of each point to its assigned hyperplane (instead of the traditional sum of squares [9, 102]); in other words the global objective function minimized by KH-DPCP is the same as that of Median K-Flats [121].

Finally, it is important to note that K-Hyperplanes algorithms require in principle a good initialization. Interestingly, such initialization may be provided by the output of Alg. 3.5.

133 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Algorithm 3.6 K-Hyperplanes via Dual Principal Component Pursuit

1: procedure KH-DPCP(X =[x1, x2,..., xN ] , b1,..., bn,ε,Tmax)

2: , ∆ , t 0; Jold ← ∞ J ← ∞ ← 3: while tε do max J 4: 0, t = t +1; Jnew ←

5: C x X : i = argmin b>x , i =1,...,n; i ← j ∈ k=1,...,n k j  n 6: b>x ; new = i=1 xj Ci i j J ∈ P P 9 7: ∆ ( )/ ( + 10− ), ; J ← Jold −Jnew Jold Jold ←Jnew

8: b argmin C>b , s.t. b>b =1 , i =1,...,n; i ← b i 1   9: end while

n 10: return (b , C ) ; { i i }i=1 11: end procedure

3.3.6 Experimental evaluation

In this section we evaluate experimentally Algorithms 3.5 and 3.6 using both synthetic

( 3.3.6.1) and real data ( 3.3.6.2). § §

3.3.6.1 Synthetic data

Dataset design. We begin by evaluating experimentally Algorithm 3.5 using synthetic data. The coordinate dimension D of the data is inspired by major applications where hyperplane arrangements appear. In particular, we recall that

134 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

0.97 0.97 0.97

0.89 0.89 0.89 d/D d/D d/D

0.75 0.75 0.75

2 3 4 2 3 4 2 3 4 n n n

(a) RANSAC, α = 1 (b) RANSAC, α = 0.8 (c) RANSAC, α = 0.6

0.97 0.97 0.97

0.89 0.89 0.89 d/D d/D d/D

0.75 0.75 0.75

2 3 4 2 3 4 2 3 4 n n n

(d) REAPER, α = 1 (e) REAPER, α = 0.8 (f) REAPER, α = 0.6

0.97 0.97 0.97

0.89 0.89 0.89 d/D d/D d/D

0.75 0.75 0.75

2 3 4 2 3 4 2 3 4 n n n

(g) DPCP-r, α = 1 (h) DPCP-r, α = 0.8 (i) DPCP-r, α = 0.6

0.97 0.97 0.97

0.89 0.89 0.89 d/D d/D d/D

0.75 0.75 0.75

2 3 4 2 3 4 2 3 4 n n n

(j) DPCP-IRLS, α = 1 (k) DPCP-IRLS, α = 0.8 (l) DPCP-IRLS, α = 0.6

Figure 3.13: Clustering accuracy as a function of the number of hyperplanes n vs relative dimension d/D vs data balancing (α). White corresponds to 1, black to 0.

135 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

0.97 0.97 0.97

0.89 0.89 0.89 d/D d/D d/D

0.75 0.75 0.75

2 3 4 2 3 4 2 3 4 n n n

(a) RANSAC (b) REAPER (c) DPCP-r

0.97 0.97 0.97

0.89 0.89 0.89 d/D d/D d/D

0.75 0.75 0.75

2 3 4 2 3 4 2 3 4 n n n

(d) DPCP-r-d (e) DPCP-d (f) DPCP-IRLS

Figure 3.14: Clustering accuracy as a function of the number of hyperplanes n vs relative dimension d/D. Data balancing parameter is set to α =0.8.

In 3D point cloud analysis, the coordinate dimension is 3, but since the various planes • do not necessarily pass through a common origin, i.e., they are affine, one may work

with homogeneous coordinates, which increases the coordinate dimension of the data

by 1 (see 4.5), i.e., D =4.

In two-view geometry one works with correspondences between pairs of 3D points. • Each such correspondence is treated as a point itself, equal to the of

the two 3D corresponding points, thus having coordinate dimension D =9.

136 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

0.5 0.5 0.5 ) ) ) M M M

+ 0.3 + 0.3 + 0.3 N N N ( ( ( M/ M/ M/ 0.1 0.1 0.1

2 3 4 2 3 4 2 3 4 n n n

(a) RANSAC, α = 1 (b) RANSAC, α = 0.8 (c) RANSAC, α = 0.6

0.5 0.5 0.5 ) ) ) M M M

+ 0.3 + 0.3 + 0.3 N N N ( ( ( M/ M/ M/ 0.1 0.1 0.1

2 3 4 2 3 4 2 3 4 n n n

(d) REAPER, α = 1 (e) REAPER, α = 0.8 (f) REAPER, α = 0.6

0.5 0.5 0.5 ) ) ) M M M

+ 0.3 + 0.3 + 0.3 N N N ( ( ( M/ M/ M/ 0.1 0.1 0.1

2 3 4 2 3 4 2 3 4 n n n

(g) DPCP-r, α = 1 (h) DPCP-r, α = 0.8 (i) DPCP-r, α = 0.6

0.5 0.5 0.5 ) ) ) M M M

+ 0.3 + 0.3 + 0.3 N N N ( ( ( M/ M/ M/ 0.1 0.1 0.1

2 3 4 2 3 4 2 3 4 n n n

(j) DPCP-IRLS, α = 1 (k) DPCP-IRLS, α = 0.8 (l) DPCP-IRLS, α = 0.6

Figure 3.15: Clustering accuracy as a function of the number of hyperplanes n vs outlier ratio vs data balancing (α). White corresponds to 1, black to 0.

137 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

0.5 0.5 0.5 ) ) ) M M M

+ 0.3 + 0.3 + 0.3 N N N ( ( ( M/ M/ M/ 0.1 0.1 0.1

2 3 4 2 3 4 2 3 4 n n n

(a) RANSAC (b) REAPER (c) DPCP-r

0.5 0.5 0.5 ) ) ) M M M + + 0.3 + 0.3 0.3 N N N ( ( ( M/ M/ M/ 0.1 0.1 0.1

2 3 4 2 3 4 2 3 4 n n n

(d) DPCP-r-d (e) DPCP-d (f) DPCP-IRLS

Figure 3.16: Clustering accuracy as a function of the number of hyperplanes n vs outlier

ratio. Data balancing parameter is set to α =0.8.

As a consequence, we choose D = 4, 9 as well as D = 30, where the choice of 30 is

a-posteriori justified as being sufficiently larger than 4 or 9, so that the clustering problem becomes more challenging. For each choice of D we randomly generate n = 2, 3, 4 hy- perplanes of RD and sample each hyperplane as follows. For each choice of n the total number of points in the dataset is set to 300n, and the number Ni of points sampled from

i 1 hyperplane i> 1 is set to Ni = α − Ni 1, so that −

n n 1 N = (1+ α + + α − )N = 300n. (3.288) i ··· 1 i=1 X Here α (0, 1] is a parameter that controls the balancing of the clusters: α = 1 means ∈

138 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

the clusters are perfectly balanced, while smaller values of α lead to less balanced clusters.

In our experiment we try α = 1, 0.8, 0.6. Having specified the size of each cluster, the points of each cluster are sampled from a zero-mean unit-variance Gaussian distribution with support in the corresponding hyperplane. To make the experiment more realistic, we corrupt points from each hyperplane by adding white Gaussian noise of standard deviation

σ = 0.01 with support in the direction orthogonal to the hyperplane. Moreover, we cor- rupt the dataset by adding 10% outliers sampled from a standard zero-mean unit-variance

Gaussian distribution with support in the entire ambient space.

Algorithms and parameters. In Algorithm 3.5 we solve the DPCP problem by using all four variations introduced in 3.2.4, i.e., DPCP-r, DPCP-r-d, DPCP-d and DPCP-IRLS, § thus leading to four different versions of the algorithm. All DPCP algorithms are set to terminate if either a maximal number of 20 iterations for DPCP-r or 100 iterations for

DPCP-r-d,DPDP-d, DPCP-IRLS is reached, or if the algorithm converges within accuracy

3 of 10− . We also compare with the REAPER analog of Algorithm 3.5, where the computa- tion of each normal vector is done by the IRLS version of REAPER (see chapter 2) instead of DPCP. As with the DPCP algorithms, its maximal number of iterations is 100 and its

3 convergence accuracy is 10− .

Finally, we compare with RANSAC, which is the predominant method for clustering hyperplanes in low ambient dimensions (e.g., for D = 4, 9). For fairness, we implement a version of RANSAC which involves the same weighting scheme as Algorithm 3.5, but instead of weighting the points, it uses the normalized weights as a discrete probability

139 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

distribution on the data points; thus points that lie close to some of the already computed

hyperplanes, have a low probability of being selected. Moreover, we control the running

time of RANSAC so that it is as slow as DPCP-r, the latter being the slowest among the

four DPCP algorithms.

Results. Since not all results can fit in a single figure, we show the mean clustering accuracy over 50 independent experiments in Figure 3.13 only for RANSAC, REAPER,

DPCP-r and DPDP-IRLS (i.e., not including DPCP-r-d and DPCP-d), but for all values

α = 1, 0.8, 0.6, as well as in Figure 3.14 for all methods but only for α = 0.8. The

accuracy is normalized to range from 0 to 1, with 0 corresponding to black, and 1 to white.

First, observe that the performance of almost all methods improves significantly as the

clusters become more unbalanced (α = 1 α = 0.6). This is intuitively expected, → as the smaller α is the more dominant is the i-th hyperplane with respect to hyperplanes i +1,...,n, and so the more likely it is to be identified at iteration i of the sequential

algorithm. The only exception to this intuitive phenomenon is RANSAC, which appears to

be insensitive to the balancing of the data. This is because RANSAC is configured to run

a relatively long amount of time, approximately equal to the running time of DPCP-r, and

as it turns out this compensates for the unbalancing of the data, since very many different

samplings take place, thus leading to approximately constant behavior across different α.

In fact, notice that RANSAC is the best among all methods when D = 4, with mean clustering accuracy ranging from 99% when n = 2, to 97% when n = 4. On the other hand, RANSAC’s performance drops dramatically when we move to higher coordinate

140 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

dimensions and more than 2 hyperplanes. For example, for α = 0.8 and n = 4, the mean clustering accuracy of RANSAC drops from 97% for D = 4, to 44% for D = 9, to 28% for D = 30. This is clearly due to the fact that the probability of sampling D 1 points − from the same hyperplane decreases dramatically as D increases.

Secondly, the proposed Algorithm 3.5 using DPCP-r is uniformly the best method (and using DPCP-IRLS is the second best), with the slight exception of D = 4, where its clus- tering accuracy ranges for α = 0.8 from 99% for n = 2 (same as RANSAC), to 89% for

n = 4, as opposed to the 97% of RANSAC for the latter case. In fact, all DPCP variants

were superior than RANSAC or REAPER in the challenging scenario of D = 30,n = 4, where for α =0.6, DPCP-r, DPCP-IRLS, DPCP-d and DPCP-r-d gave 86%, 81%, 74% and

52% accuracy respectively, as opposed to 28% for RANSAC and 42% for REAPER.

Table 3.1 reports running times in seconds for α = 1 and n = 2, 4. It is readily seen that DPCP-r is the slowest among all methods (recall that RANSAC has been configured to be as slow as DPCP-r). Remarkably, DPCP-d and REAPER are the fastest among all methods with a difference of approximately two orders of magnitude from DPCP-r. How- ever, as we saw above, none of these methods performs nearly as well as DPCP-r. From that perspective, DPCP-IRLS is interesting, since it seems to be striking a balance between running time and performance.

Moving on, we fix D = 9 and vary the outlier ratio as 10%, 30%, 50% (in the pre- vious experiment the outlier ratio was 10%). Then the mean clustering accuracy over 50 independent trials is shown in Figure 3.15 and Figure 3.16. In this experiment the hierar-

141 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Table 3.1: Mean running times in seconds, corresponding to the experiment of Figure 3.13 for data balancing parameter α =1.

n = 2 n = 4

D = 4 D = 9 D = 30 D = 4 D = 9 D = 30

RANSAC 1.18 1.25 1.76 8.27 11.61 16.00 REAPER 0.05 0.04 0.04 0.18 0.17 0.19 DPCP-r 1.18 1.24 1.75 8.21 11.55 15.89 DPCP-d 0.02 0.02 0.05 0.10 0.16 0.42 DPCP-IRLS 0.12 0.14 0.21 0.77 0.81 0.82 chy of the methods is clear: Algorithm 3.5 using DPCP-r and using DPCP-IRLS are the best and second best methods, respectively, while the rest of the methods perform equally poorly. As an example, in the challenging scenario of n = 4,D = 30 and 50% outliers, for α = 0.6, DPCP-r gives 74% accuracy, while the next best method is DPCP-IRLS with

58% accuracy; in that scenario RANSAC and REAPER give 38% and 41% accuracy re- spectively, while DPCP-r-d and DPCP-d give 41% and 40% respectively. Moreover, for n = 2,D = 30 and α = 0.8 DPCP-r and DPCP-IRLS give 95% and 86% accuracy, while all other methods give about 65%.

3.3.6.2 3D plane clustering of real Kinect data

Dataset and objective. In this section we explore various hyperplane clustering algorithms using the benchmark dataset NYUdepthV2 [83]. This dataset consists of 1449 RGBd data instances acquired using the Microsoft kinect sensor. Each instance of NYUdepthV2 corre-

142 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

sponds to an indoor scene, and consists of the 480 640 3 RGB data together with depth × × data for each pixel, i.e., a total of 480 640 depth values. In turn, the depth data can be used · to reconstruct a 3D point cloud associated to the scene. In this experiment we use such 3D point clouds to learn plane arrangements and segment the pixels of the corresponding RGB images based on their plane membership. This is an important problem in robotics, where estimating the geometry of a scene is essential for successful robot navigation.

Manual annotation. While the coarse geometry of most indoor scenes can be approx- imately described by a union of a few ( 9) 3D planes, many points in the scene do not lie ≤ in these planes, and may thus be viewed as outliers. Moreover, it is not always clear how many planes one should select or which these planes are. In fact, NYUdepthV2 does not contain any ground truth annotation based on planes, rather the scenes are annotated seman- tically with the task of object recognition in mind. For this reason, among a total of 1449 scenes, we manually annotated 92 scenes, in which the dominant planes are well-defined and capture most of the scene; see for example Figs. 3.19(a)-3.19(b) and 3.17(a)-3.17(b).

More specifically, for each of the 92 images, at most 9 dominant planar regions were man- ually marked in the image and the set of pixels within these regions were declared inliers, while the remaining pixels were declared outliers. For each planar region a ground truth normal vector was computed using DPCP-r. Finally, two planar regions that were consid- ered distinct during manual annotation, were merged if the absolute inner product of their corresponding normal vector was above 0.999.

Pre-processing. For computational reasons, the hyperplane clustering algorithms that

143 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

we use (to be described in the next paragraph) do not act directly on the original 3D point

cloud, rather on a weighted subset of it, corresponding to a superpixel representation of

each image. In particular, each 480 640 3 RGB image is segmented to about 1000 × × superpixels and the entire 3D point sub-cloud corresponding to each superpixel is replaced

by the point in the geometric center of the superpixel. To account for the fact that the

planes associated with an indoor scene are affine, i.e., they do not pass through a common

origin, we work in homogeneous coordinates, i.e., we append a fourth coordinate to each

3D point representing a superpixel and normalize it to unit `2-norm. Finally, a weight is assigned to each representative 3D point, equal to the number of pixels in the underlying

superpixel. The role of this weight is to regulate the influence of each point in the modeling

error (points representing larger superpixels should have more influence).

Algorithms. The first algorithm that we test is the sequential RANSAC algorithm,

also included in the experiments of 3.3.6.1. Secondly, we explore a family of variations § of the K-Hyperplanes (KH) algorithm based on SVD, DPCP, REAPER and RANSAC. In

particular, KH(2)-SVD indicates the classic KH algorithm which computes normal vectors

through the Singular Value Decomposition (SVD), and minimizes an `2 objective. KH(1)-

DPCP-r-d, KH(1)-DPCP-d and KH(1)-DPCP-IRLS, denote KH variations of DPCP ac- cording to Algorithm 3.6, depending on which method is used to solve the DPCP problem

(3.3) 13. Similarly, KH(1)-REAPER and KH(1)-RANSAC denote the obvious adaptation of KH where the normals are computed with REAPER and RANASC, respectively, and an

13KH(1)-DPCP-r was not included since it was slowing down the experiment considerably, while its per- formance was similar to the rest of the DPCP methods.

144 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

`1 objective is minimized.

A third method that we explore is a hybrid between Algebraic Subspace Clustering

(ASC), RANSAC and KH, (ASC-RANSAC-KH). First, the vanishing polynomial associ- ated to ASC (see Chapter 4) is computed with RANSAC instead of SVD, which is the traditional way; this ensures robustness to outliers. Then spectral clustering applied on the angle-based affinity associated to ASC (see equation (4.17)) yields n clusters. Finally, one iteration of KH-RANSAC refines these clusters and also yields a normal vector per cluster

(the normal vectors are necessary for the post-processing step).

Post-processing. The algorithms described above, are generic hyperplane clustering

algorithms. On the other hand, we know that nearby points in a 3D point cloud have a high

chance of lying in the same plane, simply because indoor scenes are spatially coherent.

Thus to associate a spatially smooth image segmentation to each algorithm, we use the

normal vectors b1,..., bn that the algorithm produced to minimize a Conditional-Random-

Field [92] type of energy function, given by

N E(y ,...,y ) := d(b , x )+ λ w(x , x )δ(y = y ). (3.289) 1 N yj j j k j 6 k j=1 k j X X∈N

In (3.289) y 1,...,n is the plane label of point x , d(by , x ) is a unary term that j ∈ { } j j j measures the cost of assigning 3D point xj to the plane with normal byj , w(xj, xk) is a

pairwise term that measures the similarity between points xj and xk, λ > 0 is a chosen

parameter, indexes the neighbors of x , and δ( ) is the indicator function. The unary Nj j ·

term is defined as d(b , x ) = b> x , which is the Euclidean distance from point x to yj j | yj j| j

145 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

the plane with normal byj , and the pairwise term is defined as

x x 2 k j − kk2 − 2σ2 w(xj, xk) := CBj,k e d , (3.290) where x x is the Euclidean distance from x to x , and CB is the length of the k j − kk2 j k j,k common boundary between superpixels j and k. The minimization of the energy function is done via Graph-Cuts [8].

Parameters. For the thresholding parameter of RANSAC, denoted by τ, we test the values 0.1, 0.01, 0.001. For the parameter τ of KH(1)-DPCP-d and KH(1)-DPCP-r-d we test the values 0.1, 0.01, 0.001. We also use the same values for the thresholding parameter of RANSAC, which we also denote by τ. The rest of the parameters of the DPCP variants and REAPER are set as in 3.3.6.1. § 3 The convergence accuracy of the KH algorithms is set to 10− . Moreover, the KH algorithms are configured to allow for a maximal number of 10 random restarts and 100 iterations per restart, but the overall running time of each KH algorithm should not exceed

5 seconds; this latter constraint is also enforced to the sequential RANSAC and ASC-

RANSAC-KH.

The parameter σd in (3.289) is set to the mean distance between 3D points representing neighboring superpixels. The parameter λ in (3.289) is set to the inverse of twice the maximal row-sum of the pairwise matrix w(x , x ) ; this is to achieve a balance between { j k } unary and pairwise terms.

Evaluation. Recall that none of the algorithms considered in this section is explicitly configured to detect outliers, rather it assigns each and every point to some plane. Thus

146 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

we compute the clustering error as follows. First, we restrict the output labels of each

algorithm to the indices of the dominant ground-truth cluster, and measure how far are

these restricted labels from being identical (identical labels would signify that the algorithm

identified perfectly well the plane); this is done by computing the ratio of the restricted

labels that are different from the dominant label. Then the dominant label is disabled and

a similar error is computed for the second dominant ground-truth plane, and so on. Finally

the clustering error is taken to be the weighted sum of the errors associated with each

dominant plane, with the weights proportional to the size of the ground-truth cluster.

We evaluate the algorithms in several different settings. First, we test how well the

algorithms can cluster the data into the first n dominant planes, where n is 2, 4 or equal to

the total number of annotated planes for each scene. Second, we report the clustering error

before spatial smoothing, i.e., without refining the clustering by minimizing (3.289), and

after spatial smoothing. The former case is denoted by GC(0), indicating that no graph-cuts

takes place, while the latter is indicated by GC(1). Finally, to account for the randomness

in RANSAC as well as the random initialization of KH, we average the clustering errors

over 10 independent experiments.

Results. The results are reported in Table 3.2, where the clustering error of the methods

that depend on the parameter τ are shown for each value of τ individually, as well as averaged over all three values.

As a first observation, notice that spatial smoothing improves the clustering accuracy considerably (GC(0) vs GC(1)); e.g., the clustering error of the traditional KH(2)-SVD for

147 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

all ground-truth planes drops from 26.22% to 16.71%, when spatial smoothing is employed.

Moreover, as it is intuitively expected, the clustering error increases when fitting more

planes (larger n) is required; e.g., for the GC(1) case, the error of KH(2)-SVD increases

from 9.96% for n =2 to 16.71% for all planes (n 9). ≈ Next, we note the remarkable insensitivity of the DPCP-based methods KH(1)-DPCP-d

and KH(1)-DPCP-r-d to variations of the parameter τ. In sharp contrast, RANSAC is very sensitive to τ; e.g., for τ =0.01 and n =2, RANSAC is the best method with 6.27%, while

for τ =0.1, 0.001 its error increases to 16.01% and 15.26% respectively. Interestingly, the

hybrid KH(1)-RANSAC is significantly more robust; in fact, in terms of clustering error

it is the best method. On the other hand, by looking at the lower part of Table 3.2, we

conclude that on average the rest of the methods have very similar behavior.

Figs. 3.17-3.20 show some segmentation results for two scenes, with and without spa-

tial smoothing. It is remarkable that, even though the segmentation in Figure 3.17 contains

artifacts, which are expected due to the lack of spatial smoothing, its quality is actually very

good, in that most of the dominant planes have been correctly identified. Indeed, applying

spatial smoothing (Figure 3.18) further drops the error for most methods only by about 1%.

3.4 Conclusions

In this Chapter of the thesis we introduced a new single subspace learning method termed

Dual Principal Component Pursuit (DPCP), which is based on a non-convex sparse op- timization problem on the sphere. In sharp contrast to current state-of-the-art subspace

148 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

Table 3.2: 3D plane clustering error for a subset of the real Kinect dataset NYUdepthV2. n is the number of fitted planes. GC(0) and GC(1) refer to clustering error without or with spatial smoothing, respectively.

n 2 n 4 all ≤ ≤ GC(0) GC(1) GC(0) GC(1) GC(0) GC(1)

1 (τ = 10− ) ASC-RANSAC-KH 16.25 12.68 28.22 19.56 32.95 19.83 RANSAC 16.01 12.89 29.63 23.96 36.00 28.70 KH(1)-RANSAC 12.73 9.12 22.16 14.56 28.61 18.22 KH(1)-DPCP-r-d 12.25 8.72 19.78 13.15 24.37 15.55 KH(1)-DPCP-d 12.45 8.95 19.91 13.30 24.66 16.04

2 (τ = 10− ) ASC-RANSAC-KH 9.50 5.19 19.72 10.29 25.15 12.18 RANSAC 6.27 3.29 12.84 6.75 19.17 9.81 KH(1)-RANSAC 7.97 5.06 14.37 8.78 20.34 12.43 KH(1)-DPCP-r-d 12.70 9.07 21.46 13.98 25.94 16.20 KH(1)-DPCP-d 12.70 9.08 21.50 14.03 26.04 16.22

3 (τ = 10− ) ASC-RANSAC-KH 8.75 4.80 20.35 10.72 24.46 11.95 RANSAC 15.26 8.34 25.89 11.49 33.08 13.90 KH(1)-RANSAC 7.48 4.79 13.86 8.79 19.39 12.07 KH(1)-DPCP-r-d 12.93 9.33 21.06 13.60 25.65 16.27 KH(1)-DPCP-d 12.93 9.33 21.06 13.59 25.63 16.23

(mean) ASC-RANSAC-KH 11.50 7.56 22.76 13.52 27.52 14.65 (mean) RANSAC 12.51 8.17 22.78 14.07 29.42 17.47 (mean) KH(1)-RANSAC 9.39 6.32 16.80 10.71 22.78 14.24 (mean) KH(1)-DPCP-r-d 12.63 9.04 20.77 13.58 25.32 16.01 (mean) KH(1)-DPCP-d 12.69 9.12 20.83 13.64 25.45 16.16 KH(2)-SVD 13.36 9.96 21.85 14.40 26.22 16.71 KH(1)-REAPER 12.45 8.98 20.94 13.71 25.52 16.27 KH(1)-DPCP-IRLS 12.47 9.01 20.77 13.64 25.38 16.10

149 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

(a) original image (b) annotation

(c) ASC-RANSAC-KH (24.7%) (d) RANSAC (9.11%) (e) KH(1)-RANSAC (11.9%)

(f) KH(1)-DPCP-r-d (9.88%) (g) KH(1)-DPCP-d (10.24%) (h) KH(2)-SVD (9.78%)

(i) KH(1)-REAPER (9.05%) (j) KH(1)-DPCP-IRLS (9.78%)

Figure 3.17: Segmentation into planes of image 5 in dataset NYUdepthV2 without spatial smoothing. Numbers are segmentation errors.

150 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

(a) original image (b) annotation

(c) ASC-RANSAC-KH (7.98%) (d) RANSAC (3.24%) (e) KH(1)-RANSAC (8.36%)

(f) KH-DPCP-ADM (8.36%) (g) KH(1)-DPCP-d (8.057%) (h) KH(2)-SVD (8.36%)

(i) KH(1)-REAPER (8.07%) (j) KH(1)-DPCP-IRLS (8.36%)

Figure 3.18: Segmentation into planes of image 5 in dataset NYUdepthV2 with spatial smoothing. Numbers are segmentation errors.

151 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

(a) original image (b) annotation

(c) ASC-RANSAC-KH (23.6%) (d) RANSAC (7.9%) (e) KH(1)-RANSAC (8.6%)

(f) KH(1)-DPCP-r-d (12.24%) (g) KH(1)-DPCP-d (12.73%) (h) KH(2)-SVD (43.0%)

(i) KH(1)-REAPER (22.82%) (j) KH(1)-DPCP-IRLS (13.9%)

Figure 3.19: Segmentation into planes of image 2 in dataset NYUdepthV2 without spatial smoothing. Numbers are segmentation errors.

152 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT

(a) original image (b) annotation

(c) ASC-RANSAC-KH (12.3%) (d) RANSAC (5.91%) (e) KH(1)-RANSAC (9.39%)

(f) KH(1)-DPCP-r-d (10.05%) (g) KH(1)-DPCP-d (10.05%) (h) KH(2)-SVD (32.37%)

(i) KH(1)-REAPER (13.70%) (j) KH(1)-DPCP-IRLS (10.0%)

Figure 3.20: Segmentation into planes of image 2 in dataset NYUdepthV2 with spatial smoothing. Numbers are segmentation errors.

153 CHAPTER 3. DUAL PRINCIPAL COMPONENT PURSUIT learning methods, which assume low-dimensional subspaces, DPCP was shown to be theo- retically applicable irrespectively of subspace dimension, and even more surprisingly, to be naturally suited for subspaces of high relative dimensions. DPCP was further extended in the context of multiple subspaces, leading to new DPCP-based hyperplane clustering algo- rithms. The correctness of DPCP was established by both extensive theoretical analysis as well as experiments on synthetic and real data; in the case of synthetic data the performance of DPCP-based algorithms was seen to be dramatically superior to existing methods.

There are many open research directions regarding DPCP across all levels of theory, algorithms and applications. For example, on the theoretical level, it is of interest to per- form a probabilistic analysis, which is expected to lead to tighter bounds on the tolerable level of outliers. In terms of optimization, it is important to prove convergence of the It- eratively Reweighted Least-Squares (IRLS) implementation of DPCP. As far as algorithms are concerned, it is crucial to have scalable algorithms suitable for high-dimensional data.

Finally, it is interesting to further explore potential applications, particularly in connection with deep networks.

154 Chapter 4

Advances in Algebraic Subspace

Clustering

From a theoretical point of view, Algebraic Subspace Clustering (ASC), also known as

Generalized Principal Component Analysis (GPCA) [110, 111], is one of the most ap- propriate methods for learning linear subspaces of high relative dimension. However, its practical applicability has been limited due to mainly two reasons. First, most existing im- plementations of ASC are sensitive to noise, and those which are not, are suitable only for hyperplanes. Second, fitting polynomials to the data is a task of exponential complexity, which becomes cumbersome even more moderately scaled data.

The main contribution of this chapter is to address the lack of robustness to noise in

ASC, which primarily stems from the need to estimate the dimension of the space of poly-

155 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING nomials of degree n that vanish on the data. The solution that this chapter proposes is based on the idea of filtrations of subspace arrangements, which can be used to compute an underlying linear subspace of the original subspace arrangement, in a sequential fashion, as the intersection of certain hyperplanes. A rigorous theory is developed leading to Fil- trated Algebraic Subspace Clustering (FASC). As it turns out, determining whether to take one more step in the filtration or to terminate, which is equivalent to deciding whether to add one more hyperplane in the intersection or not, is a numerically much easier task than estimating the dimension of vanishing polynomials of degree n. Such robust filtrations are then used as a means of computing a pairwise affinity between the data points, upon which spectral clustering gives the clusters. The resulting algorithm, called Filtrated Spectral Al- gebraic Subspace Clustering (FSASC), not only dramatically improves the performance of earlier ASC methods, but it exhibits state-of-the-art performance on real and synthetic data.

As a second contribution in this chapter, a rigorous algebraic-geometric study of the algebraic clustering of affine subspaces is performed. Traditionally, affine subspaces have been dealt with algebraically by reduction to the linear subspace case, by working with homogeneous coordinates. This chapter establishes the correctness of this approach.

4.1 Review of algebraic subspace clustering

This section reviews the main ideas behind ASC, which solves the following problem:

Definition 4.1 (Algebraic subspace clustering problem). Given a finite set of points X

156 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

= x ,..., x lying in general position1 inside a transversal subspace arrangement2 { 1 N } = n , decompose into its irreducible components, i.e., find the number of sub- A i=1 Si A spacesSn and a basis for each subspace , i =1,...,n. Si

For the sake of simplicity, we first discuss ASC in the case of hyperplanes ( 4.1.1) and § subspaces of equal dimension ( 4.1.2), for which a closed form solution can be found using § a single polynomial. In the case of subspaces of arbitrary dimensions, the picture becomes

more involved, but a closed form solution from multiple polynomials is still available when

the number of subspaces n is known ( 4.1.3) or an upper bound m for n is known ( 4.1.4). § § In 4.1.5 we discuss one limitation of ASC due to computational complexity and a partial § solution based on a recursive ASC algorithm. In 4.1.6 we discuss another limitation of § ASC due to sensitivity to noise and a practical solution based on spectral clustering. We

conclude in 4.1.7 with the main challenge that this chapter aims to address. §

4.1.1 Subspaces of codimension 1

The basic principles of ASC can be introduced more smoothly by considering the case

where the union of subspaces is the union of n hyperplanes = n in RD. Each A i=1 Hi hyperplane is uniquely defined by its unit length normal vectorS b RD as = Hi i ∈ Hi D x R : b>x = 0 . In the language of algebraic geometry this is equivalent to saying { ∈ i }

that is the zero set of the polynomial b>x or equivalently is the algebraic variety Hi i Hi

defined by the polynomial equation b>x =0, where b>x = b x + b x with b := i i i,1 1 ··· i,D D i 1We will define formally the notion of points in general position in Definition 4.12. 2We will define formally the notion of a transversal subspace arrangement in Definition 4.4.

157 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

(b ,...,b )>,x := (x ,...,x )>. We write this more succinctly as = (b>x). i,1 i,D 1 D Hi Z i We then observe that a point x of RD belongs to n if and only if x is a root of i=1 Hi S the polynomial p(x)=(b>x) (b>x), i.e., the union of hyperplanes is the algebraic 1 ··· n A variety = (p) (the zero set of p). Notice the important fact that p is homogeneous A Z of degree equal to the number n of distinct hyperplanes and moreover it is the product of linear homogeneous polynomials bi>x, i.e., a product of linear forms, each of which defines a distinct hyperplane via the corresponding normal vector b . Hi i Given a set of points X = x N in general position in the union of hyperplanes, { j}j=1 ⊂A the classic polynomial differentiation algorithm proposed in [109,111] recovers the correct number of hyperplanes as well as their normal vectors by

1. embedding the data into a higher-dimensional space via a polynomial map,

2. finding the number of subspaces by analyzing the rank of the embedded data matrix,

3. finding the polynomial p from the null space of the embedded data matrix,

4. finding the hyperplane normal vectors from the derivatives of p at a nonsingular point

x of .3 A

More specifically, observe that the polynomial p(x)=(b>x) (b>x) can be written as a 1 ··· n linear combination of the set of all monomials of degree n in D variables,

n n 1 n 1 n 1 n x ,x − x ,x − x ...,x x − ,...,x , i.e., (4.1) { 1 1 2 1 3 1 D D} 3A nonsingular point of a subspace arrangement is a point that lies in one and only one of the subspaces that constitute the arrangement.

158 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

n1 n2 nD p(x)= c x x x = c>ν (x). (4.2) n1,n2,...,nD 1 2 ··· D n n1+n2+ nD=n X··· In the above expression, c RMn(D) is the vector of all coefficients c , and ν is ∈ n1,n2,...,nD n the Veronese or Polynomial embedding of degree n, as it is known in the algebraic geometry and machine learning literature, respectively. It is defined by taking a point of RD to a point

n(D) of RM under the rule

νn n n 1 n 1 n 1 n (x ,...,x )> x ,x − x ,x − x ...,x x − ,...,x > , (4.3) 1 D 7−→ 1 1 2 1 3 1 D D  where (D) is the dimension of the space of homogeneous polynomials of degree n in Mn D indeterminates. The image of the data set X under the Veronese embedding is used to form the so-called embedded data matrix

ν (X ):=[ν (x ) ν (x )]> . (4.4) ` ` 1 ··· ` N

It is shown in [111] that when there are sufficiently many data points that are sufficiently well distributed in the subspaces, the correct number of hyperplanes is the smallest degree

` for which ν`(X ) drops rank by 1: n = min` 1 ` : Rank(ν`(X )) = M`(D) 1 . ≥ { − } Moreover, it is shown in [111] that the polynomial vector of coefficients c is the unique up to scale vector in the one-dimensional null space of νn(X ).

It follows that the task of identifying the normals to the hyperplanes from p is equivalent to extracting the linear factors of p. This is achieved4 by observing that if we have a point

4A direct factorization has been shown to be possible as well [110]; however this approach has not been generalized yet to the case of subspaces of different dimensions.

159 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

x i i0=i i0 , then the gradient p x of p evaluated at x ∈H −∪ 6 H ∇ |

n

p x = b (b>0 x) (4.5) ∇ | j j j=1 j0=j X Y6

is equal to bi up to a scale factor because bi>x =0 and hence all the terms in the sum vanish

except for the ith (see Proposition 4.76 for a more general statement). Having identified the

normal vectors, the task of clustering the points in X is straightforward.

4.1.2 Subspaces of equal dimension

Let us now consider a more general case, where we know that the subspaces are of equal

and known dimension d. Such a case can be reduced to the case of hyperplanes, by noticing

that a union of n subspaces of dimension d of RD becomes a union of hyperplanes of Rd+1 after a generic projection π : RD Rd+1. We note that any random orthogonal projection d → will almost surely preserve the number of subspaces and their dimensions, as the set of

projections πd that do not have this preserving property is a zero measure subset of the set

(d+1) D of orthogonal projections πd R × : πdπd> = I(d+1) (d+1) . ∈ ×  When the common dimension d is unknown, it can be estimated exactly by analyz-

ing the right null space of the embedded data matrix, after projecting the data generically

onto subspaces of dimension d0 +1, with d0 = D 1,D 2,... [104]. More specifi- − −

cally, when d0 > d, we have that dim (ν (π 0 (X ))) > 1, while when d0 < d we have N n d

dim (ν (π 0 (X ))) = 0. On the other hand, the case d0 = d is the only case for which the N n d

null space is one-dimensional, and so d = d0 : dim (ν (π 0 (X ))) = 1 . { N n d }

160 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Finally, when both n and d are unknown, one can first recover d as the smallest d0 such that there exists an ` for which dim (ν (π 0 (X ))) > 0, and subsequently recover n as the N ` d smallest ` such that dim (ν (π (X ))) > 0; see [104] for further details. N ` d

4.1.3 Known number of subspaces of arbitrary dimensions

When the dimensions of the subspaces are unknown and arbitrary, the problem becomes

much more complicated, even if the number n of subspaces is known, which is the case

examined in this subsection. In such a case, a union of subspaces = A S1 ∪···∪Sn of RD, henceforth called a subspace arrangement, is still an algebraic variety. The main

difference with the case of hyperplanes is that, in general, multiple polynomials of degree

n are needed to define , i.e., is the zero set of a finite collection of homogeneous A A polynomials of degree n in D indeterminates.

Example 4.2. Consider the union of a plane and two lines , in general position A S1 S2 S3 in R3 (Figure 4.1). Then = is the zero set of the degree-3 homogeneous A S1 ∪S2 ∪S3

2 S3 S

b1

S1

Figure 4.1: A union of two lines and one plane in general position in R3.

161 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

polynomials

p1 := (b1>x)(b2>,1x)(b3>,1x), p2 := (b1>x)(b2>,1x)(b3>,2x), (4.6)

p3 := (b1>x)(b2>,2x)(b3>,1x), p4 := (b1>x)(b2>,2x)(b3>,2x), (4.7)

where b is the normal vector to the plane and b , j = 1, 2, are two linearly indepen- 1 S1 i,j dent vectors that are orthogonal to the line , i = 2, 3. These polynomials are linearly Si independent and form a basis for the vector space ,3 of the degree-3 homogeneous poly- IA nomials that vanish on .5 A

In contrast to the case of hyperplanes, when the subspace dimensions are different, there

may exist vanishing polynomials of degree strictly less than the number of subspaces.

Example 4.3. Consider the setting of Example 4.2. Then there exists a unique up to scale

vanishing polynomial of degree 2, which is the product of two linear forms: one form is b>x, where b is the normal to the plane , and the other is f >x, where f is 1 1 S1 the normal to the plane defined by the lines and (Figure 4.2). S2 S3

As Example 4.2 shows, all the relevant geometric information is still encoded in the

6 factors of some special basis of ,n, that consists of degree-n homogeneous polynomials IA that factorize into the product of linear forms. However, computing such a basis remains, to

the best of our knowledge, an unsolved problem. Instead, one can only rely on computing

5The interested reader is encouraged to prove this claim. 6Strictly speaking, this is not always true. However, it is true if the subspace arrangement is general enough, in particular if it is transversal; see Definition 4.4 and Theorem 4.78.

162 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

2 3 S S H23 f

b1

S1

Figure 4.2: The geometry of the unique degree-2 polynomial p(x)=(b1>x)(f >x) that vanishes on . b is the normal vector to plane and f is the normal vector to S1 ∪S2 ∪S3 1 S1 the plane spanned by lines and . H23 S2 S3

(or be given) a general basis for the vector space ,n. In our example such basis could be IA

p + p , p p , p + p , p p (4.8) 1 4 1 − 4 2 3 2 − 3

and it can be seen that none of these polynomials is factorizable into the product of linear

forms. This difficulty was not present in the case of hyperplanes, because there was only

one vanishing polynomial (up to scale) of degree n and it had to be factorizable.

In spite of this difficulty, a solution can still be achieved in an elegant fashion by re-

sorting to polynomial differentiation. The key fact that allows this approach is that any

homogeneous polynomial p of degree n that vanishes on the subspace arrangement is a A linear combination of vanishing polynomials, each of which is a product of linear forms,

with each distinct subspace contributing a vanishing linear form in every product (Theo-

rem 4.78). As a consequence (Proposition 4.76), the gradient of p evaluated at some point

x i i0=i i0 lies in i⊥ and the of the gradients at x of all such p is pre- ∈ S −∪ 6 S S

cisely equal to ⊥. We can thus recover , remove it from and then repeat the procedure Si Si A

163 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

to identify all the remaining subspaces. As stated in Theorem 4.6, this process is provably

correct as long as the subspace arrangement is transversal, as defined next. A

Definition 4.4 (Transversal subspace arrangement [20]). A subspace arrangement = A n D i=1 i of R is called transversal, if for any subset I of [n], the codimension of i I i is S ∈ S Sthe minimum between D and the sum of the codimensions of all , i I. T Si ∈

Remark 4.5. Transversality is a geometric condition on the subspaces, which in particular requires the dimensions of all possible intersections among subspaces to be as small as the dimensions of the subspaces allow (see Appendix 4.7.3 for a discussion).

Theorem 4.6 (ASC by polynomial differentiation when n is known, [72, 111]). Let = A n D i=1 i be a transversal subspace arrangement of R , let x i i0=i i0 be a nonsin- S ∈S − 6 S S S gular point in , and let ,n be the vector space of all degree-n homogeneous polynomials A IA that vanish on . Then is the orthogonal complement of the subspace spanned by all A Si vectors of the form p x, where p ,n, i.e., i = Span( ,n x)⊥. ∇ | ∈IA S ∇IA |

Theorem 4.6 and its proof are illustrated in the next example.

Example 4.7. Consider Example 4.2 and recall that p1 =(b1>x)(b2>,1x)(b3>,1x), p2 = (b1>x)(b2>,1x)(b3>,2x), p3 = (b1>x)(b2>,2x)(b3>,1x), and p4 = (b1>x)(b2>,2x)(b3>,2x). Let x be a generic point in . Then 2 S2 −S1 ∪S3

p x = p x = b , p x = p x = b . (4.9) ∇ 1| 2 ∼ ∇ 2| 2 ∼ 2,1 ∇ 3| 2 ∼ ∇ 4| 2 ∼ 2,2

164 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Hence b2,1, b2,2 Span( ,3 x2 ) and so 2 Span( ,3 x2 )⊥. Conversely, let p ∈ ∇IA | S ⊃ ∇IA | ∈ 4 ,3. Then there exist αi R, i =1,..., 4, such that p = αipi and so IA ∈ i=1

4 P

p x = α p x Span(b , b )= ⊥. (4.10) ∇ | 2 i∇ i| 2 ∈ 2,1 2,2 S2 i=1 X

Hence ,3 x2 2⊥, and so Span( ,3 x2 )⊥ 2. ∇IA | ⊂S ∇IA | ⊃S

4.1.4 Unknown number of subspaces of arbitrary dimensions

As it turns out, when the number of subspaces n is unknown, but an upper bound m n is ≥ given, one can obtain the decomposition of the subspace arrangement from the gradients of

the vanishing polynomials of degree m, precisely as in Theorem 4.6, simply by replacing

n with m.

Theorem 4.8 (ASC when an upper bound on n is known, [72,111]). Let = n be a A i=1 Si D S transversal subspace arrangement of R , let x i i0=i i0 be a nonsingular point in ∈S − 6 S S , and let ,m be the vector space of all degree-m homogeneous polynomials that vanish A IA on , where m n. Then is the orthogonal complement of the subspace spanned by all A ≥ Si

vectors of the form p x, where p ,m, i.e., i = Span( ,m x)⊥. ∇ | ∈IA S ∇IA |

Example 4.9. Consider the setting of Examples 4.2 and 4.3. Suppose that we have the

upper bound m =4 on the number of underlying subspaces (n = 3). It can be shown that

165 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

7 the vector space ,4 has dimension 8 and is spanned by the polynomials IA

3 2 q1 := (b1>x)(f >x) , q5 := (b1>x)(f >x)(b3>x) , (4.11)

2 2 2 q2 := (b1>x) (f >x) q6 := (b1>x)(b2>x) (f >x), (4.12)

3 2 q3 := (b1>x) (f >x), q7 := (b1>x)(b2>x) (b3>x), (4.13)

2 2 q4 := (b1>x)(f >x) (b3>x), q8 := (b1>x)(b2>x)(b3>x) , (4.14) where b is the normal to , f is the normal to the plane defined by lines and , and 1 S1 S2 S3 b is a normal to line that is linearly independent from f, for i = 2, 3. Hence = i Si S1

Span(b )⊥ and = Span(f, b )⊥, i =2, 3. Then for a generic point x , 1 Si i 2 ∈S2 −S1 ∪S3 we have that

q x = q x = q x = q x = q x =0, (4.15) ∇ 1| 2 ∇ 2| 2 ∇ 4| 2 ∇ 6| 2 ∇ 7| 2

q x = q x = f, q x = b . (4.16) ∇ 3| 2 ∼ ∇ 5| 2 ∼ ∇ 8| 2 ∼ 2

Hence f, b2 Span( ,4 x2 ) and so 2 Span( ,4 x2 )⊥. Similarly to Example ∈ ∇IA | S ⊃ ∇IA |

4.7, since every element of ,4 is a linear combination of the q`,` = 1,..., 8, we have IA

2 = Span( ,4 x2 )⊥. S ∇IA |

Remark 4.10. Notice that both Theorems 4.6 and 4.8 are statements about the abstract subspace arrangement , i.e., no finite subset X of is explicitly considered. To pass A A from to X and get similar theorems, we need to require X to be in general position in A , in some suitable sense. As one may suspect, this notion of general position must entail A 7This can be verified by applying the dimension formula of Corollary 3.4 in [20].

166 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING that polynomials of degree n for Theorem 4.6, or of degree m for Theorem 4.8, that vanish on X must also vanish on and vice versa. In that case, we can compute the required A basis for ,n, simply by computing a basis for X ,n, by means of the Veronese embedding IA I described in 4.1.1, and similarly for ,m. We will make the notion of general position § IA precise in Definition 4.12.

4.1.5 Computational complexity and recursive ASC

Although Theorem 4.8 is quite satisfactory from a theoretical point of view, using an upper bound m n for the number of subspaces comes with the practical disadvantage that the ≥ dimension of the Veronese embedding, Mm(D), grows exponentially with m. In addition, increasing m also increases the number of polynomials in the null space of νm(X ), some which will eventually, as m becomes large, be polynomials that simply fit the data X but do not vanish on . To reduce the computational complexity of the polynomial differentiation A algorithm, one can consider vanishing polynomials of smaller degree, m

167 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING the data, computing the gradients of these vanishing polynomials to cluster the data into m0 n groups, and then repeating the procedure for each group until the data from each ≤ group can be fit by polynomials of degree 1, in which case each group lies in single linear subspace. While this recursive ASC algorithm is very intuitive, no rigorous proof of its correctness has appeared in the literature. In fact, there are examples where this recursive method provably fails in the sense of producing ghost subspaces in the decomposition of

. For instance, when partitioning the data from Example 4.3 into two planes and , A S1 H23 we may assign the data from the intersection of the two planes to . If this is the case, H23 when trying to partition further the data of , we will obtain three lines: , and the H23 S2 S3 ghost line = (see Figure 4.3(a)). S4 S1 ∩H23

4.1.6 Instability in the presence of noise and spectral ASC

Another important issue with Theorem 4.8 from a practical standpoint is its sensitivity to noise. More precisely, when implementing Theorem 4.8 algorithmically, one is required to estimate the dimension of the null space of νm(X ), which is an extremely challenging prob- lem in the presence of noise. Moreover, small errors in the estimation of dim (ν (X )) N m have been observed to have dramatic effects in the quality of the clustering, thus render- ing algorithms that are directly based on Theorem 4.8 unstable. While the recursive ASC algorithm of [51, 111] is more robust than such algorithms, it is still sensitive to noise, as considerable errors may occur in the partitioning process. Moreover, the performance of the recursive algorithm is always subject to degradation due to the potential occurrence of

168 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

ghost subspaces.

To enhance the robustness of ASC in the presence of noise and obtain a stable working

algebraic algorithm, the standard practice has been to apply a variation of the polynomial

differentiation algorithm based on spectral clustering. More specifically, given noisy data

X lying close to a union of n subspaces , one computes an approximate vanishing poly- A nomial p whose coefficients are given by the right singular vector of νn(X ) corresponding

to its smallest singular value. Given p, one computes the gradient of p at each point in

X (which gives a normal vector associated with each point in X ), and builds an affinity

matrix between points xj and xj0 as the cosine of the angle between their corresponding

normal vectors, i.e.,

p xj p xj0 Cjj0,angle = ∇ | , ∇ | . (4.17) p xj p xj0 D||∇ | || ||∇ | ||E

This affinity is then used as input to any spectral clustering algorithm (see [115] for a

X n tutorial on spectral clustering) to obtain a clustering = i=1 xi. We call this Spectral

ASC method with angle-based affinity as SASC-A. S

To gain some intuition about C, suppose that is a union of n hyperplanes and that A

there is no noise in the data. Then p must be of the form p(x)=(b>x) (b>x). In 1 ··· n

this case Cjj0 is simply the cosine of the angle between the normals to the hyperplanes

that are associated with points xj and xj0 . If both points lie in the same hyperplane, their

normals must be equal, and hence Cjj0 = 1. Otherwise, Cjj0 < 1 is the cosine of the

angles between the hyperplanes. Thus, assuming that the smallest angle between any two

hyperplanes is sufficiently large and that the points are well distributed on the union of the

169 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

hyperplanes, applying spectral clustering to the affinity matrix C will in general yield the correct clustering.

Even though SASC-A is much more robust in the presence of noise than purely alge- braic methods for the case of a union of hyperplanes, it is fundamentally limited by the fact that, theoretically, it applies only to unions of hyperplanes. Indeed, if the orthogonal complement of a subspace has dimension greater than 1, there may be points x, x0 inside §

such that the angle between p x and p x0 is as large as 90◦. In such instances, points S ∇ | ∇ | associated to the same subspace may be weakly connected and thus there is no guarantee for the success of spectral clustering.

4.1.7 The challenge

As the discussion so far suggests, the state of the art in ASC can be summarized as follows:

1. A complete closed form solution to the abstract subspace clustering problem (Prob-

lem 4.1) exists and can be found using the polynomial differentiation algorithm im-

plied by Theorem 4.8.

2. All known algorithmic variants of the polynomial differentiation algorithm are sen-

sitive to noise, especially for subspaces of arbitrary dimensions.

3. The recursive algorithm of 4.1.5 does not in general solve the abstract subspace § clustering problem (Problem 4.1), and is in addition sensitive to noise.

4. The spectral algebraic algorithm described in 4.1.6 is less sensitive to noise, but is § 170 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

theoretically justified only for unions of hyperplanes.

The above list reveals the challenge that we will be addressing next: Develop an ASC algorithm, that solves the abstract subspace clustering problem for perfect data, while at the same time it is robust to noisy data.

4.2 Filtrated algebraic subspace clustering (FASC)

4.2.1 Filtrations of subspace arrangements: geometric overview

In this paragraph we provide an overview of the proposed Filtrated Algebraic Subspace

Clustering (FASC) algorithm, which conveys the geometry of the key idea of this chapter while keeping technicalities at a minimum. To that end, let us pretend for a moment that we have access to the entire set , so that we can manipulate it via set operations such A as taking its intersection with some other set. Then the idea behind FASC is to construct a descending filtration of the given subspace arrangement RD, i.e., a sequence of A ⊂ inclusions of subspace arrangements, that starts with and terminates after a finite number A of c steps with one of the irreducible components of :8 S A

=: = . (4.18) A A0 ⊃A1 ⊃A2 ⊃···⊃Ac S 8 We will also be using the notation =: 0 1 2 , where the arrows denote embeddings. A A ←A ←A ←···

171 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

The mechanism for generating such a filtration is to construct a strictly descending filtration

of intermediate ambient spaces, i.e.,

, (4.19) V0 ⊃V1 ⊃V2 ⊃··· such that = RD, dim( ) = dim( ) 1, and each contains the same fixed V0 Vs+1 Vs − Vs irreducible component of . Then the filtration of subspace arrangements is obtained by S A intersecting with the filtration of ambient spaces, i.e., A

:= := := . (4.20) A0 A⊃A1 A∩V1 ⊃A2 A∩V2 ⊃···

This can be seen equivalently as constructing a descending filtration of pairs ( , ), Vs As where is a subspace arrangement of : As Vs

D D 1 D 2 (R , ) ( = R − , ) ( = R − , ) . (4.21) A ← V1 ∼ A1 ← V2 ∼ A2 ←···

But how can we construct a filtration of ambient spaces (4.19), that satisfies the ap-

parently strong condition , s? The answer lies at the heart of ASC: to construct Vs ⊃ S ∀ pick a suitable polynomial p vanishing on and evaluate its gradient at a nonsingu- V1 1 A

lar point x of . Notice that x will lie in some irreducible component x of . Then A S A take to be the hyperplane of RD defined by the gradient of p at x. We know from V1 1

Proposition 4.76 that must contain x. To construct we apply essentially the same V1 S V2 procedure on the pair ( , ): take a suitable polynomial p that vanishes on , but does V1 A1 2 A1

not vanish on 1, and take 2 to be the hyperplane of 1 defined by π 1 ( p2 x). As we V V V V ∇ |

will show in Section 4.2.2, it is always the case that π 1 ( p2 x) x and so 2 x. V ∇ | ⊥ S V ⊃ S 172 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

D ∼= R x V0 A0 S

D 1 ∼= R − x V1 A1 S

D 2 ∼= R − x V2 A2 S (4.22) . . . .

D c+1 ∼= R − c 1 c 1 x V − A − S

D c ∼= ∼= ∼= R − x Vc Ac S

Now notice, that after precisely c such steps, where c is the codimension of x, will be S Vc D a (D c)-dimensional linear subspace of R that by construction contains x. But x is − S S

also a (D c)-dimensional subspace and the only possibility is that = x. Observe − Vc S also that this is precisely the step where the filtration naturally terminates, since there is

no polynomial that vanishes on x but does not vanish on . The relations between the S Vc intermediate ambient spaces and subspace arrangements are illustrated in the commutative diagram of (4.22). The filtration in (4.22) will yield the irreducible component := x of S S that contains the nonsingular point x that we started with. We will be referring to A ∈ A

such a point as the reference point. We can also take without loss of generality x = . S S1

Having identified , we can pick a nonsingular point x0 x and construct a filtration S1 ∈ A−S of as above with reference point x0. Such a filtration will terminate with the irreducible A component x0 of containing x0, which without loss of generality we take to be . Pick- S A S2

173 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Algorithm 4.1 Filtrated Algebraic Subspace Clustering (FASC) - Geometric Version 1: procedure FASC( ) A 2: L ; ; ← ∅ L ← ∅ 3: while = do A − L 6 ∅ 4: pick a nonsingular point x in ; A−L 5: RD; V ← 6: while ( do V∩A V

7: find polynomial p that vanishes on but not on , s.t. p x = 0; A∩V V ∇ | 6

8: let be the orthogonal complement of π ( p x) in ; V V ∇ | V 9: end while

10: L L ; ; ← ∪ {V} L←L∪V 11: end while

12: return L;

13: end procedure

ing a new reference point x00 x x0 and so on, we can identify the entire list of ∈A−S ∪S irreducible components of , as described in Algorithm 4.1. A

Example 4.11. Consider the setting of Examples 4.2 and 4.3. Suppose that in the first

filtration the algorithm picks as reference point x . Suppose further that the ∈S2 −S1 ∪S3 algorithm picks the polynomial p(x)=(b>x)(f >x), which vanishes on but certainly 1 A not on R3. Then the first ambient space of the filtration associated to x is constructed V1 3 as = Span( p x)⊥. Since p x = f, this gives that is precisely the plane of R V1 ∇ | ∇ | ∼ V1

174 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

2 2 S3 S S3 S 3 2 S 23 S b2 Hf S4 S4 b 1 4 b4 S (1) (1) 1 V1 V b3 S1

(a) (b) (c)

Figure 4.3: (a): The plane spanned by lines and intersects the plane at the S2 S3 S1 line . (b): Intersection of the original subspace arrangement = S4 A S1 ∪ S2 ∪ S3 with the intermediate ambient space (1), giving rise to the intermediate subspace ar- V1 rangement (1) = . (c): Geometry of the unique degree-3 polynomial A1 S2 ∪ S3 ∪ S4

p(x)=(b>x)(b>x)(b>x) that vanishes on as a variety of the intermedi- 2 3 4 S2 ∪S3 ∪S4 ate ambient space (1). b , i =2, 3, 4. V1 i ⊥Si

with normal vector f. Then is = , and consists of the union of three lines A1 A1 A∩V1 , where is the intersection of with (see Figs. 4.3(a) and 4.3(b)). S2 ∪S3 ∪S4 S4 V1 S1 Since ( , the algorithm takes one more step in the filtration. Suppose that the A1 V1

algorithm picks the polynomial q(x)=(b2>x)(b3>x)(b4>x), where bi is the unique normal

vector of that is orthogonal to , for i =2, 3, 4 (see Fig 4.3(c)). Because of the general V1 Si position assumption, none of the lines , , is orthogonal to another. Consequently, S2 S3 S4

q x =(b3>x)(b4>x)b2 =0. Moreover, since b2 1, we have that π 1 ( q x)= q x = ∇ | 6 ∈V V ∇ | ∇ | ∼ b defines a line in that must contain . Intersecting with we obtain = 2 V1 S2 A1 V2 A2 A1 ∩

= and the filtration terminates with output the irreducible component x = = V2 V2 S S2 V2

175 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

of associated to reference point x. A

Continuing, the algorithm now picks a new reference point x0 x, say x0 . ∈A−S ∈S1

A similar process as above will identify as the intermediate ambient space = x0 of S1 V1 S the filtration associated to x0 that arises after one step. Then a third reference point will be chosen as x00 x x0 and will be identified as the intermediate ambient ∈ A−S ∪S S3 space = x00 of the filtration associated to x00 that arises after two steps. Since the V2 S set x x0 x00 is empty, the algorithm will terminate and return x, x0 , x00 , A−S ∪S ∪S {S S S } which is up to a permutation a decomposition of the original subspace arrangement into its constituent subspaces.

Strictly speaking, Algorithm 4.1 is not a valid algorithm in the computer-science the- oretic sense, since it takes as input an infinite set , and it involves operations such as A checking equality of the infinite sets and . Moreover, one may reasonably ask: V A∩V

1. Why is it the case that through the entire filtration associated with reference point x

we can always find polynomials p such that p x =0? ∇ | 6

2. Why is it true that even if p x =0 then π ( p x) =0? ∇ | 6 V ∇ | 6

We address all issues above and beyond in the next Section, which is devoted to rigorously establishing the theory of the FASC algorithm.9

9At this point the reader unfamiliar with algebraic geometry is encouraged to read the appendices before proceeding.

176 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

4.2.2 Filtrations of subspace arrangements: theory

This section formalizes the concepts outlined in 4.2. 4.2.2.1 formalizes the notion of a § § set X being in general position inside a subspace arrangement . 4.2.2.2-4.2.2.4 estab- A § lish the theory of a single filtration of a finite subset X lying in general position inside a transversal subspace arrangement , and culminate with the Algebraic Descending Filtra- A tion (ADF) algorithm for identifying a single irreducible component of (Algorithm 4.2) A and the theorem establishing its correctness (Theorem 4.28). The ADF algorithm naturally leads us to the core theoretical contribution of this chapter in 4.2.2.5, which is the FASC § algorithm for identifying all irreducible components of (Algorithm 4.3) and the theorem A establishing its correctness (Theorem 4.29).

4.2.2.1 Data in general position in a subspace arrangement

From an algebraic geometric point of view, a union of linear subspaces is the same as A the set of polynomial functions that vanish on . However, from a computer-science- IA A theoretic point of view, and are quite different: is an infinite set and hence it can A IA A not be given as input to any algorithm. On the other hand, even though is also an infinite IA set, it is generated as an ideal by a finite set of polynomials, which can certainly serve as

input to an algorithm.That said, from a machine-learning point of view, both and are A IA often unknown, and one is usually given only a finite set of points X in , from which we A wish to compute its irreducible components ,..., . S1 Sn To lend ourselves the power of the algebraic-geometric machinery, while providing an

177 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

algorithm of interest to the machine learning and computer science communities, we adopt

the following setting. The input to our algorithm will be the pair (X ,m), where X is a

finite subset of an unknown union of linear subspaces := n of RD, and m is an A i=1 Si upper bound on n. To make the problem of recovering the decompositionS = n A i=1 Si from X well-defined, it is necessary that be uniquely identifiable form X . InS other A words, X must be in general position inside , as defined next. A

Definition 4.12 (Points in general position). Let X = x ,..., x be a finite subset of a { 1 N } subspace arrangement = . We say that X is in general position in with A S1 ∪···∪Sn A

respect to degree m, if m n and = ( X ), i.e., if is precisely the zero locus of all ≥ A Z I ,m A homogeneous polynomials of degree m that vanish on X .

The intuitive geometric condition = ( X ) of Definition 4.12 guarantees that A Z I ,m there are no spurious polynomials of degree less or equal to m that vanish on X .

Proposition 4.13. Let X be a finite subset of an arrangement of n linear subspaces A of RD. Then X lies in general position inside with respect to degree m if and only if A

,k = X ,k, k m. IA I ∀ ≤

Proof. ( ) We first show that ,m = X ,m. Since X , every homogeneous polyno- ⇒ IA I A ⊃

mial of degree m that vanishes on must vanish on X , i.e., ,m X ,m. Conversely, the A IA ⊂I

hypothesis = ( X ) implies that every polynomial of X must vanish on , i.e., A Z I ,m I ,m A

,m X ,m. IA ⊃I

Now let k

178 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

direction, suppose for the sake of contradiction that there exists some p X that does ∈ I ,k not vanish on . This means that there must exist an irreducible component of , say , A A S1 such that p does not vanish on . Let ζ be a vector of RD non-orthogonal to , i.e., the S1 S1 linear form g(x) = ζ>x does not vanish on . Since p vanishes on X so will the degree S1 m k m k m polynomial g − p, i.e., g − p X ,m. But we have already shown that X ,m = ,m, ∈I I IA m k m k and so it must be the case that g − p ,m. Since g − p vanishes on , it must vanish ∈ IA A m k on 1, i.e., g − p 1 . Since by hypothesis p 1 , and since 1 is a prime ideal (see S ∈ IS 6∈ IS IS m k 4.73), it must be the case that g − 1 . Because 1 is a prime ideal, we must have that ∈IS IS

g 1 . But this is true if and only if ζ 1⊥, contradicting the definition of ζ. ∈IS ∈S

( ) Suppose ,k = X ,k, k m. We will show that = ( X ,m). But this is the ⇐ IA I ∀ ≤ A Z I same as showing that = ( ,m), which is true, by Proposition 4.75. A Z IA

The next Proposition ensures the existence of points in general position with respect to

any degree m n. ≥

Proposition 4.14. Let be an arrangement of n linear subspaces of RD and let m be any A integer n. Then there exists a finite subset X that is in general position inside ≥ ⊂ A A with respect to degree m.

Proof. By Proposition 4.75 is generated by polynomials of degree m. Then by IA ≤

Theorem 2.9 in [72], there exists a finite set X such that ,k = X ,k, k m, ⊂ A IA I ∀ ≤ which concludes the proof in view of Proposition 4.13.

Notice that there is a price to be paid by requiring X to be in general position, which is

179 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

that we need the cardinality of X to be artificially large, especially when m n is large. In −

particular, since the dimension of X ,m must match the dimension of ,m, the cardinality I IA of X must be at least Mm(D) dim( ,m). − IA

Example 4.15. Let Φ = be the union of two planes of R3 with normal vectors S1 ∪S2 b , b , and let X = x , x , x , x be four points of Φ, such that, x , x and 1 2 { 1 2 3 4} 1 2 ∈ S1 −S2 x , x . Let and be the planes spanned by x , x and x , x respectively, 3 4 ∈S2−S1 H13 H24 1 3 2 4 and let b13, b24 be the normals to these planes. Then the polynomial q(x)=(b13> x)(b24> x) certainly vanishes on X . But q does not vanish on Φ, because the only (up to a scalar)

X homogeneous polynomial of degree 2 that vanishes on Φ is p(x)=(b1>x)(b2>x). Hence is not in general position in Φ. The geometric reasoning is that two points per plane are not enough to uniquely define the union of the two planes; instead a third point in one of the planes is required.

The next result will be useful in the sequel.

Lemma 4.16. Suppose that X is in general position inside with respect to degree m. Let A (n0) n0 n0 < n. Then the set X := X X lies in general position inside the subspace − i=1 i (n0) S arrangement := 0 with respect to degree m n0. A Sn +1 ∪···∪Sn −

Proof. We begin by noting that m n0 is an upper bound on the number of subspaces − of the arrrangement (n0). According to Proposition 4.13, it is enough to prove that a A (n0) homogeneous polynomial p of degree less or equal than m n0 vanishes on X if and − only if it vanishes on (n0). So let p be a homogeneous polynomial of degree less or equal A

180 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

(n0) (n0) than m n0. If p vanishes on , then it certainly vanishes on X . It remains to prove − A (n0) the converse. So suppose that p vanishes on X . Suppose that for each i = 1,...,n0 we have a vector ζ , such that ζ 0 ,..., . Next, define the polynomial i ⊥ Si i 6⊥ Sn +1 Sn r(x)=(ζ>x) (ζ>0 x)p(x). Then r has degree m and vanishes on X . Since X is in 1 ··· n ≤ general position inside , r must vanish on . For the sake of contradiction suppose that A A p does not vanish on (n0). Then p does not vanish say on . On the other hand r does A Sn

vanish on n, hence r n or equivalently (ζ1>x) (ζn>0 x)p(x) n . Since n is a S ∈ IS ··· ∈ IS IS

prime ideal we must have either ζi>x n for some i [n0] or p n . Now, the latter ∈ IS ∈ ∈ IS

can not be true by hypothesis, thus we must have ζi>x n for some i [n0]. But this ∈ IS ∈ implies that ζ , which contradicts the hypothesis on ζ . Hence it must be the case i ⊥ Sn i that p vanishes on (n0). A

To complete the proof we show that such vectors ζi, i = 1,...,n0 always exist. It is enough to prove the existence of ζ . If every vector of RD orthogonal to were orthogonal 1 S1 to, say 0 , then we would have that ⊥ ⊥0 , or equivalently, 0 . But this Sn +1 S1 ⊂ Sn +1 S1 ⊃ Sn +1 contradicts the transversality of . A

Remark 4.17. Notice that the notion of points X lying in general position inside a sub- space arrangement is independent of the notion of transversality of (Definition 4.4). A A Nevertheless, to facilitate the technical analysis by avoiding degenerate cases of subspace arrangements, in the rest of Section 4.2.2 we will assume that is transversal. For a ge- A ometric interpretation of transversality as well as examples, the reader is encouraged to consult Appendix 4.7.3.

181 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

4.2.2.2 Constructing the first step of a filtration

We will now show how to construct the first step of a descending filtration associated with a

single irreducible component of , as in (4.22). Once again, we are given the pair (X ,m), A where X is a finite set in general position inside with respect to degree m, is transver- A A sal, and m is an upper bound on the number n of irreducible components of ( 4.2.2.1). A § To construct the first step of the filtration, we need to find a first hyperplane of RD V1 that contains some irreducible component of . According to Proposition 4.76, it would Si A be enough to have a polynomial p that vanishes on the irreducible component together 1 Si

with a point x . Then p x would be the normal to a hyperplane containing . ∈ Si ∇ 1| V1 Si Since every polynomial that vanishes on necessarily vanishes on , i = 1,...,n, a A Si ∀

reasonable choice is a vanishing polynomial of minimal degree k, i.e., some 0 = p1 ,k, 6 ∈IA where k is the smallest degree at which is non-zero. Since X is assumed in general IA

position in with respect to degree m, by Proposition 4.13 we will have ,k = X ,k, A IA I

and so our p1 can be computed as an element of the right null space of the embedded data

matrix νk(X ). The next lemma ensures that given any such p1, there is always a point x in

X such that p x =0. ∇ 1| 6

Lemma 4.18. Let 0 = p X be a vanishing polynomial of minimal degree. Then 6 1 ∈ I ,k

there exists 0 = x X such that p x = 0, and moreover, without loss of generality 6 ∈ ∇ 1| 6 x . ∈S1 − i>1 Si S Proof. We first establish the existence of a point x X such that p x = 0. For the ∈ ∇ 1| 6

182 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

sake of contradiction, suppose that no such x X exists. Since 0 = p X , p can ∈ 6 1 ∈ I ,k 1 not be a constant polynomial, and so there exists some j [D] such that the degree k 1 ∈ −

∂p1 polynomial is not the zero polynomial. Now, by hypothesis p1 = 0, x X , ∂xj ∇ x ∀ ∈

∂p1 ∂p1 hence = 0, x X . But then, 0 = X ,k 1 and this would contradict the ∂xj x ∀ ∈ 6 ∂xj ∈ I − hypothesis that k is the smallest index such that X =0. Hence there exists x X such I ,k 6 ∈ that p x = 0. To show that x can be chosen to be non-zero, note that if k =1, then p ∇ 1| 6 ∇ 1 is a constant vector and we can take x to be any non- of X . If k > 1 then

p 0 = 0 and so x must necessarily be different from zero. ∇ 1| Next, we establish that x . Without loss of generality we can assume ∈ S1 − i>1 Si that x X := X . For the sakeS of contradiction, suppose that x for ∈ 1 ∩S1 ∈ S1 ∩Si some i > 1. Since x = 0, there is some index j [D] such that the jth coordinate of x, 6 ∈ n k denoted by χ , is different from zero. Define g(x) := x − p (x). Then g X and by the j j 1 ∈I ,n general position assumption we also have that g ,n. Since is assumed transversal, ∈ IA A by Theorem 4.78, g can be written in the form

g = c l l , (4.23) r1,...,rn r1,1 ··· rn,n ri [ci],i [n] ∈ X∈ where c R is a scalar coefficient, l is a linear form vanishing on , and the r1,...,rn ∈ ri,i Si summation runs over all multi-indices (r ,...,r ) [c ] [c ]. Then evaluating 1 n ∈ 1 ×···× n the gradient of the expression on the right of (4.23) at x, and using the hypothesis that x for some i > 1, we see that g x = 0. However, evaluating the gradient ∈ S1 ∩Si ∇ | n ` n k of g at x from the formula g(x) := x − p (x), we get g x = χ − p x = 0. This j 1 ∇ | j ∇ 1| 6 contradiction implies that the hypothesis x for some i > 1 can not be true, i.e., ∈ S1 ∩Si 183 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING x lies only in the irreducible component . S1

D Using the notation established so far and setting b = p x, the hyperplane of R 1 ∇ 1| given by = Span(b )⊥ = (b>x) contains the irreducible component of associated V1 1 Z 1 A with the reference point x, i.e., . Then we can define a subspace sub-arrangement V1 ⊃S1 of by A1 A

:= = ( ) ( ). (4.24) A1 A∩V1 S1 ∪ S2 ∩V1 ∪···∪ Sn ∩V1

Observe that can be viewed as a subspace arrangement of , since (see A1 V1 A1 ⊂ V1 also the commutative diagram of eq. (4.22)). Certainly, our algorithm can not manipulate directly the infinite sets and . Nevertheless, these sets are algebraic varieties and as A V1 a consequence we can perform their intersection in the algebraic domain. That is, we can obtain a set of polynomials defining , as shown next.10 A∩V1

Lemma 4.19. := is the zero set of the ideal generated by X and b>x, i.e., A1 A∩V1 I ,m 1

= (a ) , a := X + b>x . (4.25) A1 Z 1 1 hI ,mi h 1 i

Proof. ( ) : We will show that (a ). Let w be a polynomial of a . Then by ⇒ A1 ⊂ Z 1 1 definition of a , w can be written as w = w + w , where w X and w b>x . 1 1 2 1 ∈ hI ,mi 2 ∈ h 1 i

Now take any point y 1. Since y , and X ,m = ,m, we must have w1(y)=0. ∈ A ∈ A I IA Since y , we must have that w (y)=0. Hence w(y)=0, i.e., every point of ∈ V1 2 A1 is inside the zero set of a . ( ) : We will show that (a ). Let y (a ), 1 ⇐ A1 ⊃ Z 1 ∈ Z 1 i.e., every element of a vanishes on y. Hence every element of X vanishes on y, i.e., 1 I ,m 10Lemma 4.19 is a special case of Proposition 4.63.

184 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

y ( X ) = . In addition, every element of b>x vanishes on y, in particular ∈ Z I ,m A h 1 i

b>y =0, i.e., y . 1 ∈V1

In summary, the computation of the vector b completes algebraically the first 1 ⊥ S1 step of the filtration, which gives us the hyperplane and the sub-variety . Then, there V1 A1 are two possibilities: = or ( . In the first case, we need to terminate the A1 V1 A1 V1 filtration, as explained in 4.2.2.3, while in the second case we need to take one more step § in the filtration, as explained in 4.2.2.4. §

4.2.2.3 Deciding whether to take a second step in a filtration

If = , we should terminate the filtration because in this case = , as Lemma 4.20 A1 V1 V1 S1 shows, and so we have already identified one of the subspaces. Lemma 4.21 will give us an algebraic procedure for checking if the condition = holds true, while Lemma 4.22 A1 V1 will give us a computationally more friendly procedure for checking the same condition.

Lemma 4.20. = if and only if = . V1 A1 V1 S1 . Proof. ( ): Suppose = = ( ) ( ). Taking the vanishing- ⇒ V1 A1 S1 ∪ S2 ∩V1 ∪···∪ Sn ∩V1 ideal operator on both sides, we obtain

1 = 1 2 1 n 1 . (4.26) IV IS ∩IS ∩V ∩···∩IS ∩V

Since 1 is a linear subspace, 1 is a prime ideal by Proposition 4.73, and so by Proposition V IV

4.52 1 must contain one of the ideals 1 , 2 1 ,..., n 1 . Suppose that 1 i 1 IV IS IS ∩V IS ∩V IV ⊃IS ∩V for some i> 1. Taking the zero-set operator on both sides, and using Proposition 4.64 and

185 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

the fact that linear subspaces are closed in the Zariski topology, we obtain , V1 ⊂ Si ∩V1 which implies that . Since , we must have that , which contradicts V1 ⊂Si S1 ⊂V1 S1 ⊂Si the assumption of transversality on . Hence it must be the case that 1 1 . Taking A IV ⊃ IS the zero-set operator on both sides we get , which implies that = , since V1 ⊂ S1 V1 S1 . ( ): Suppose = . Then = = and so = . S1 ⊂V1 ⇐ V1 S1 V1 S1 ⊂A1 ⊂V1 S1 A1 V1

Knowing that a filtration terminates if = , we need a mechanism for checking A1 V1 this condition. The next lemma shows how this can be done in the algebraic domain.

Lemma 4.21. = if and only if X b>x . V1 A1 I ,m ⊂h 1 im

Proof. ( ): Suppose = . Then and by taking vanishing ideals on both sides ⇒ A1 V1 A⊃V1 we get 1 = b1>x . Since X ,m = ,m , it follows that X ,m b1>x m. IA ⊂ IV h i I IA ⊂ IA I ⊂ h i

( ) : Suppose X b>x and for the sake of contradiction suppose that ( . ⇐ I ,m ⊂ h 1 im A1 V1 In particular, from Lemma 4.20 we have that ( . Hence, there exists a vector ζ S1 V1 1 linearly independent from b such that ζ . Now for any i > 1, there exists ζ lin- 1 1 ⊥ S1 i early independent from b1 such that ζi i. For if not, then i 1 and so i 1, ⊥ S IS ⊂ IV S ⊃ V which leads to the contradiction . Then the polynomial (ζ>x) (ζ>x) is an Si ⊃ S1 1 ··· n element of ,n = X ,n and by the hypothesis that X ,m b1>x m we must have that IA I I ⊂ h i m n+1 (ζ>x) − (ζ>x) b>x . But b>x is a prime ideal and so one of the factors of 1 ··· n ∈ h 1 i h 1 i

(ζ>x) (ζ>x) must lie in b>x . So suppose ζ>x b>x , for some j [n]. This 1 ··· n h 1 i j ∈ h 1 i ∈ implies that there must exist a polynomial h such that ζj>x = h (b1>x). By degree consid- erations, we conclude that h must be a constant, in which case the above equality implies

ζ = b . But this contradicts the definition of ζ , thus it can not be that ( . j ∼ 1 j A1 V1 186 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Notice that checking the condition X b>x in Lemma 4.21, requires computing I ,m ⊂h 1 im a basis of X and checking whether each element of the basis is divisible by the linear I ,m form b>x. Equivalently, to check the inclusion of finite dimensional vector spaces X 1 I ,m ⊂

b>x we need to compute a basis BX of X as well as a basis B of b>x and h 1 im ,m I ,m h 1 im check whether the rank equality Rank([BX ,m B]) = Rank(B) holds true. Note that a basis of b>x can be obtained in a straightforward manner by multiplying all monomials h 1 im of degree m 1 with the linear form b>x. On the other hand, computing a basis of X − 1 I ,m by computing a basis for the right nullspace of νm(X ) can be computationally expensive, particularly when m is large. If however, the points X are in general position in ∩S1 S1 with respect to degree m, then checking the condition X b>x can be done more I ,m ⊂ h 1 im efficiently, as we now explain. Let V 1 = [v1,..., vD 1] be a basis for the vector space −

D 1 D 1 . Then is isomorphic to R − under the σV : R − that takes V1 V1 1 V1 → a vector v = α1v1 + + αD 1vD 1 to its coordinate representation (α1,...,αD 1)>. ··· − − − Then the next result says that checking the condition = is equivalent to checking the V1 A1 rank-deficiency of the embedded data matrix ν (σV (X )), which is computationally m 1 ∩V1 a simpler task than computing the right nullspace of νm(X ).

Lemma 4.22. Suppose that X is in general position inside with respect to degree m. 1 S1

Then = if and only if the embedded data matrix ν (σV (X )) is full rank. V1 A1 m 1 ∩V1

Proof. The statement is equivalent to the statement “ 1 = 1 if and only if X 1,m = V A I ∩V

b>x ”, which we now prove. ( ) : Suppose = . Then by Lemma 4.20 = h 1 im ⇒ V1 A1 V1

1, which implies that 1 = b1>x . This in turn implies that 1,m = b1>x m. Now S IS h i IS h i 187 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

X X 1,m = X 1,m = X 1,m. By the general position hypothesis on 1 we have 1,m = I ∩V I ∩S I IS

X 1,m. Hence X 1,m = b1>x m. ( ) : Suppose that X 1,m = b1>x m. For the I I ∩V h i ⇐ I ∩V h i sake of contradiction, suppose that ( . Since is an arrangement of at most m A1 V1 A1 subspaces, there exists a homogeneous polynomial p of degree at most m that vanishes

on but does not vanish on . Since X , p will vanish on X , i.e., A1 V1 ∩V1 ⊂ A1 ∩V1

p X 1,m or equivalently p b1>x m by hypothesis. But then p vanishes on 1, which ∈I ∩V ∈h i V is a contradiction; hence it must be the case that = . V1 A1

4.2.2.4 Taking multiple steps in a filtration and terminating

If ( , then it follows from Lemma 4.20 that ( . Therefore, subspace has A1 V1 S1 V1 S1 not yet been identified in the first step of the filtration and we should take a second step.

As before, we can start constructing the second step of our filtration by choosing a suitable vanishing polynomial p2, such that its gradient at the reference point x is not colinear with b1. The next lemma shows that such a p2 always exists.

Lemma 4.23. X admits a homogeneous vanishing polynomial p of degree ` n, such 2 ≤ that p2 1 and p2 x Span(b1). 6∈ IV ∇ | 6∈

Proof. Since ( , Lemma 4.20 implies that ( . Then there exists a vector ζ A1 V1 S1 V1 1 that is orthogonal to and is linearly independent from b . Since x , for S1 1 ∈ S1 − i>1 Si each i > 1 we can find a vector ζ such that ζ x and ζ . Notice thatS the pairs i i 6⊥ i ⊥ Si b , ζ are linearly independent for i > 1, since b x but ζ x. Now, the polynomial 1 i 1 ⊥ i 6⊥ p2 := (ζ1>x) (ζn>x) has degree n and vanishes on , hence p2 X , m. Moreover, ··· A ∈ I ≤ 188 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

p x = (ζ>x) (ζ>x)ζ = 0, since by hypothesis ζ>x = 0, i > 1. Since ζ is ∇ 2| 2 ··· n 1 6 i 6 ∀ 1 linearly independent from b , we have p x Span(b ). Finally, p does not vanish on 1 ∇ 2| 6∈ 1 2 , by a similar argument to the one used in the proof of Lemma 4.21. V1

Remark 4.24. Note that if ` is the degree of p2 as in Lemma 4.23, and if q1,...,qs is a basis for X , then at least one of the q satisfies the conditions of the lemma. This is important I ,` i algorithmically, because it implies that the search for our p2 can be done sequentially. We can start by first computing a minimal-degree polynomial in ,k, and see if it satisfies our IA requirements. If not, then we can compute a second linearly independent polynomial and check again. We can continue in that fashion until we have computed a full basis for X . I ,k If no suitable polynomial has been found, we can repeat the process for degree k +1, and so on, until we have reached degree n, if necessary.

By using a polynomial p as in Lemma 4.23, Proposition 4.76 guarantees that p x is 2 ∇ 2| orthogonal to . Recall though that for the purpose of the filtration, we want to construct S1 a hyperplane of . Since there is no guarantee that p x is inside (thus defining a V2 V1 ∇ 2| V1 hyperplane of ), we must project p x onto and guarantee that this projection is still V1 ∇ 2| V1 orthogonal to . The next lemma ensures that this is always the case. S1

0 Lemma 4.25. Let 0 = p2 X , m 1 such that p2 x Span(b1). Then = 6 ∈ I ≤ − IV ∇ | 6∈ 6

π 1 ( p2 x) 1. V ∇ | ⊥S

Proof. For the sake of contradiction, suppose that π 1 ( p2 x)=0. Setting b11 := b1, let V ∇ | us augment b to a basis b , b ..., b for the orthogonal complement of in RD. In 11 11 12 1c S1

189 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

fact, we can choose the vectors b12,..., b1c to be a basis for the orthogonal complement of

inside . By proposition 4.72, p must have the form S1 V1 2

p (x)= q (x)(b> x)+ q (x)(b> x)+ + q (x)(b> x), (4.27) 2 1 11 2 12 ··· c 1c where q ,...,q are homogeneous polynomials of degree deg(p ) 1. Then 1 c 2 −

p x = q (x)b + q (x)b + + q (x)b . (4.28) ∇ 2| 1 11 2 12 ··· c 1c

Projecting the above equation orthogonally onto we get V1

π 1 ( p2 x)= q2(x)b12 + + qc(x)b1c, (4.29) V ∇ | ··· which is zero by hypothesis. Since b , , b are linearly independent vectors of it 12 ··· 1c V1 must be the case that q (x) = = q (x)=0. But this implies that p x = q (x)b , 2 ··· c ∇ 2| 1 11 which is a contradiction on the non-colinearity of p x with b . Hence it must be the ∇ 2| 11 case that 0 = π 1 ( p2 x). The fact that π 1 ( p2 x) 1 follows from (4.29) and the fact 6 V ∇ | V ∇ | ⊥S that by definition b ,..., b are orthogonal to . 12 1c S1

At this point, letting b2 := π 1 ( p2 x), we can define 2 = Span(b1, b2)⊥, which V ∇ | V is a subspace of codimension 1 inside (and hence of codimension 2 inside := V1 V0 RD). As before, we can define a subspace sub-arrangement of by intersecting A2 A1 with . Once again, this intersection can be realized in the algebraic domain as A1 V2

= ( X , b>x, b>x). Next, we have a similar result as in Lemmas 4.20 and 4.21, A2 Z I ,m 1 2 which we now prove in general form:

190 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Lemma 4.26. Let b ,..., b be s vectors orthogonal to and define the intermediate 1 s S1 ambient space := Span(b , , b )⊥. Let be the subspace arrangement obtained Vs 1 ··· s As by intersecting with . Then the following are equivalent: A Vs

(i) = Vs As

(ii) = Vs S1

(iii) = Span(b ,..., b )⊥ S1 1 s

(iv) X b>x,..., b>x . I ,m ⊂h 1 s im

Proof. (i) (ii) : By taking vanishing ideals on both sides of = ( ) ⇒ Vs S1 i>1 Si ∩Vs S we get s = 1 i s . By using Proposition 4.52 in a similar fashion as in the IV IS i>1 IS ∩V proof of Lemma 4.20,T we conclude that = . (ii) (iii) : This is obvious from Vs S1 ⇒ the definition of . (iii) (iv) : Let h X . Then h vanishes on and hence Vs ⇒ ∈ I ,m A on 1 and by Proposition 4.72 we must have that h 1 = b1>x,..., bs>x . (iv) S ∈ IS h i ⇒

(i) : X ,m b1>x,..., bs>x m can be written as X ,m s . By the general position I ⊂ h i I ⊂ IV assumption ,m = X ,m and so we have ,m s . Taking zero sets on both sides we IA I IA ⊂ IV get , and intersecting both sides of this relation with , we get . Since A⊃Vs Vs As ⊃ Vs , this implies that = . As ⊂Vs Vs As

Similarly to Lemma 4.22 we have:

D s Lemma 4.27. Let V s =[v1,..., vD s] be a basis for s, and let σV s : s R − be the − V V → linear map that takes a vector v = α1v1 + +αD svD s to its coordinate representation ··· − −

191 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Algorithm 4.2 Algebraic Descending Filtration (ADF) 1: procedure ADF(p, x, X ,m)

2: B p x; ←∇ |

3: while X b>x : b B do I ,m 6⊂ h ∈ i

4: find p X , m b>x : b B s.t. p x Span(B); ∈I ≤ −h ∈ i ∇ | 6∈

5: B B π ⊥ ( p x) ; ← ∪ Span(B) ∇ |  6: end while

7: return B;

8: end procedure

(α1,...,αD s)>. Suppose that X 1 is in general position inside 1 with respect to degree − S m. Then = if and only if ν (σV (X )) is full rank. Vs As m s ∩Vs

By Lemma 4.26, if X b>x, b>x , the algorithm terminates the filtration with I ,m ⊂ h 1 2 i output the orthogonal basis b , b for the orthogonal complement of the irreducible com- { 1 2} ponent of . If on the other hand X b>x, b>x , then the algorithm picks a basis S1 A I ,m 6⊂ h 1 2 i element p3 of X ,m such that p3 2 and p3 x Span(b1, b2), and defines a subspace I 6∈ IV ∇ | 6∈ 11 3 of codimension 1 inside 2 using π 2 ( p3 x). Setting b3 := π 2 ( p3 x), the algo- V V V ∇ | V ∇ | rithm uses Lemma 4.26 to determine whether to terminate the filtration or take one more step and so on.

The principles established in the previous paragraphs, formally lead us to the algebraic descending filtration Algorithm 4.2 and its Theorem 4.28 of correctness.

11 The proof of existence of such a p3 is similar to the proof of Lemma 4.23 and is omitted.

192 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Theorem 4.28 (Correctness of Algorithm 4.2). Let X = x ,..., x be a finite set of { 1 N } points in general position (Definition 4.12) with respect to degree m inside a transversal

(Definition 4.4) arrangement of at most m linear subspaces of RD. Let p be a polynomial A of minimal degree that vanishes on X . Then there always exists a nonsingular x X such ∈ that p x = 0, and for such an x, the output B of Algorithm 4.2 is an orthogonal basis ∇ | 6 for the orthogonal complement in RD of the irreducible component of that contains x. A

4.2.2.5 The FASC algorithm

In 4.2.2.2-4.2.2.4 we established the theory of a single filtration, according to which § one starts with a nonsingular point x := x X and obtains an orthogonal basis 1 ∈ A∩ b ,..., b for the orthogonal complement of the irreducible component of that 11 1c1 S1 A contains reference point x1. To obtain an orthogonal basis b21,..., b2c2 corresponding to a second irreducible component of , our approach is the natural one: remove X S2 A 1 from X and run a filtration on the set X (1) := X X . All we need for the theory of − 1 4.2.2.2-4.2.2.4 to be applicable to the set X (1), is that X (1) be in general position inside § the arrangement (1) := . But this has been proved in Lemma 4.16. With A S2 ∪···∪Sn Lemma 4.16 establishing the correctness of recursive application of a single filtration, the correctness of the FASC Algorithm 4.3 follows at once, as in Theorem 4.29. Note that in

Algorithm 4.3, n is the number of subspaces, while D and L are ordered sets, such that, up to a permutation, the i-th element of D is d = dim , and the i-th element of L is an i Si orthogonal basis for ⊥. Si

193 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Algorithm 4.3 Filtrated Algebraic Subspace Clustering D N 1: procedure FASC(X R × ,m) ∈ 2: n 0; D ; L ; ← ← ∅ ← ∅ 3: while X = do 6 ∅ 4: find polynomial p of minimal degree that vanishes on X ;

5: find x X s.t. p x =0; ∈ ∇ | 6 6: B ADF(p, x, X ,m); ← 7: L L B ; ← ∪ { } 8: D D D Card(B) ; ← ∪ { − }

9: X X Span(B)⊥; ← − 10: n n +1; m m 1; ← ← − 11: end while

12: return n, D, L;

13: end procedure

Theorem 4.29 (Correctness of Algorithm 4.3). Let X = x ,..., x be a set in general { 1 N } position with respect to degree m (Definition 4.12) inside a transversal (Definition 4.4) arrangement of at most m linear subspaces of RD. For such an X and m, Algorithm 4.3 A always terminates with output a set L = B ,..., B , such that up to a permutation, B { 1 n} i is an orthogonal basis for the orthogonal complement of the ith irreducible component Si n of , i.e., = Span(B )⊥, i =1,...,n, and = . A Si i A i=1 Si S

194 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

4.3 Filtrated spectral algebraic subspace clustering

In this section we show how FASC (Sections 4.2-4.2.2) can be adapted to a working sub-

space clustering algorithm that is robust to noise. As we will soon see, the success of such

an algorithm depends on being able to 1) implement a single filtration in a robust fashion,

and 2) combine multiple robust filtrations to obtain the clustering of the points.

4.3.1 Implementing robust filtrations

Recall that the filtration component ADF (Algorithm 4.2) of the FASC Algorithm 4.3, is

based on computing a descending filtration of ambient spaces . Recall that V1 ⊃ V2 ⊃··· D is obtained as the hyperplane of R with normal vector p x, where x is the reference V1 ∇ | point associated with the filtration, and p a polynomial of minimal degree k that vanishes on X . In the absence of noise, the value of k can be characterized as the smallest ` such that

ν`(X ) drops rank (see Section 4.1.1 for notation). In the presence of noise, and assuming

X m+D 1 that has cardinality at least m − , there will be in general no vanishing polynomial  of degree m, i.e., the embedded data matrix ν (X ) will have full column rank, for any ≤ ` ` m. Hence, in the presence of noise we do not know a-priori what the minimal degree k ≤ is. On the other hand, we do know that m n, which implies that the underlying subspace ≥ arrangement admits vanishing polynomials of degree m. Thus a reasonable choice for A

an approximate vanishing polynomial p1 := p, is the polynomial whose coefficients are

given by the right singular vector of νm(X ) that corresponds to the smallest singular value.

195 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Recall also that in the absence of noise we chose our reference point x X such that ∈

p x = 0. In the presence of noise this condition will be almost surely true every point ∇ 1| 6 x X ; then one can select the point that gives the largest gradient, i.e., we can pick as ∈ reference point an x that maximizes the norm of the gradient p x . k∇ 1| k2 Moving on, ADF constructs the filtration of X by intersecting X with the intermediate ambient spaces . In the presence of noise in the dataset X , such intersections V1 ⊃V2 ⊃··· will almost surely be empty. As it turns out, we can replace the operation of intersecting

X with the intermediate spaces ,s = 1, 2,... , by projecting X onto . In the absence Vs Vs of noise, the norm of the points of X that lie in will remain unchanged after projection, Vs while points that lie outside will witness a drop in their norm upon projection onto . Vs Vs Points whose norm is reduced can then be removed and the end result of this process is equivalent to intersecting X with . In the presence of noise one can choose a threshold Vs δ > 0, such that if the distance of a point from subspace is less than δ, then the point Vs is maintained after projection onto , otherwise it is removed. But how to choose δ? One Vs reasonable way to proceed, is to consider the polynomial p that corresponds to the right singular vector of νm(X ) of smallest singular value, and then consider the quantity

N 1 x> p x β(X ) := j ∇ | j . (4.30) N x p j=1 j 2 x j 2 X k k ∇ |

Notice that in the absence of noise dim (νm(X )) > 0 and subsequently β(X )=0. In the N presence of noise however, β(X ) represents the average distance of a point x in the dataset to the hyperplane that it produces by means of p x (in the absence of noise this distance is ∇ | zero by Proposition 4.76). Hence intuitively, δ should be of the same order of magnitude as

196 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

β(X ); a natural choice is to set δ := γ β(X ), where γ is a user-defined parameter taking · values close to 1. Having projected X onto and removed points whose distance from V1 is larger than δ, we obtain a second approximate polynomial p from the right singular V1 2 vector of smallest singular value of the embedded data matrix of the remaining projected

points and so on.

It remains to devise a robust criterion for terminating the filtration. Recall that the

criterion for terminating the filtration in ADF is X b>x,..., b>x , where = I ,m ⊂ h 1 s im Vs

Span(b ,..., b )⊥. Checking this criterion is equivalent to checking the inclusion X 1 s I ,m ⊂

b>x,..., b>x of finite dimensional vector spaces. In principle, this requires comput- h 1 s im ing a basis for the vector space X . Now recall from Section 4.1.6, that it is precisely I ,m this computation that renders the classic polynomial differentiation algorithm unstable to

noise; the main difficulty being the correct estimation of dim( X ), and the dramatic de- I ,m pendence of the quality of clustering on this estimate. Consequently, for the purpose of

obtaining a robust algorithm, it is imperative to avoid such a computation. But we know

from Lemma 4.27 that, if X := X is in general position inside with respect to i ∩Si Si degree m for every i [n], then the criterion for terminating the filtration is equivalent to ∈ checking whether in the coordinate representation of the points X admit a vanish- Vs ∩Vs ing polynomial of degree m. But this is computationally equivalent to checking whether

(ν (σV (X ))) = 0; see notation in Lemma 4.27. This is a much easier problem N m s ∩Vs 6

than estimating dim( X ), and we solve it implicitly as follows. Recall that in the ab- I ,m sence of noise, the norm of the reference point remains unchanged as it passes through the

197 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

filtration. Hence, it is natural to terminate the filtration at step s, if the distance from the

projected reference point12 to is more than δ, i.e., if the projected reference point is Vs+1 among the points that are being removed upon projection from to . To guard against Vs Vs+1 overestimating the number of steps in the filtration, we enhance the termination criterion

by additionally deciding to terminate at step s if the number of points that survived the

projection from to is less than a pre-defined integer L, which is to be thought of as Vs Vs+1 the minimum number of points in a cluster.

4.3.2 Combining multiple filtrations

Having determined a robust algorithmic implementation for a single filtration, we face

the following issue: In general, two points lying approximately in the same subspace S will produce different hyperplanes that approximately contain with different levels of S accuracy. In the noiseless case any point would be equally good. In the presence of noise though, the choice of the reference point x becomes significant. How should x be chosen?

To deal with this problem in a robust fashion, it is once again natural to construct a single

filtration for each point in X and define an affinity between points j and j0 as

(j) (j) πs π (x 0 ) if x 0 remains k j ◦ · · · ◦ 1 j k j C 0 =  (4.31) jj ,FSASC   0 otherwise,

(j)  where πs is the projection from to associated to the filtration of point x and s Vs Vs+1 j j

is the length of that filtration. This affinity captures the fact that if points xj and xj0 are 12Here by projected reference point we mean the image of the reference point under all projections up to step s.

198 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

in the same subspace, then the norm of xj0 should not change from step 0 to step c of the

filtration computed with reference point x , where c = D dim( ) is the codimension of j − S

the irreducible component associated to reference point x . Otherwise, if x and x 0 are S j j j in different subspaces, the norm of xj0 is expected to be reduced by the time the filtration

reaches step c. In the case of noiseless data, only the points in the correct subspace survive

step c and their norms are precisely equal to one. In the case of noisy data, the affinity

defined above will only be approximate.

4.3.3 The FSASC algorithm

Having an affinity matrix as in eq. (4.31), standard spectral clustering techniques can be

applied to obtain a clustering of X into n groups. We emphasize that in contrast to the

abstract case of Algorithm 4.3, the number n of clusters must be given as input to the

algorithm. On the other hand, the algorithm does not require the subspace dimensions to

be given: these are implicitly estimated by means of the filtrations. Finally, one may choose

to implement the above scheme for M distinct values of the parameter γ and choose the affinity matrix that leads to the smallest nth eigengap. The above discussion leads to the

Filtrated Spectral Algebraic Subspace Clustering (FSASC) Algorithm 4.4, in which

SPECTRUM NL(C +C>) denotes the spectrum of the normalized Laplacian matrix •  of C + C>,

SPECCLUST C∗ +(C∗)>,n denotes spectral clustering being applied to C∗ + C∗> •  to obtain n clusters,

199 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

VANISHING ν (X ) is the polynomial whose coefficients are the right singular vec- • n  tor of νn(X ) corresponding to the smallest singular value.

d d 1 π R ∼ R − is to be read as “π is assigned the composite linear trans- • ← →H −→ h i d d 1 d formation R ∼ R − , where the first arrow is the orthogonal projection of R →H −→ to hyperplane , and the second arrow is the linear isomorphism that maps a basis H d d 1 of in R to the standard coordinate basis of R − ”. H

4.3.4 A distance-based affinity

13 Observe that the FSASC affinity (4.31) between points xj and xj0 , can be interpreted as the distance of point x 0 to the orthogonal complement of the final ambient space j Vsj of the filtration corresponding to reference point x . If all irreducible components of j A were hyperplanes, then the optimal length of each filtration would be 1. Inspired by this observation, we may define a simple distance-based affinity, alternative to the angle-based affinity of eq. (4.17), by

xj>0 p xj C 0 := 1 ∇ | . (4.32) jj ,dist − p xj 2 ∇ |

The affinity of eq. (4.32) is theoretically justified only for hyperplanes, as Cjj0,angle is; yet as we will soon see in the experiments, Cjj0,dist is much more robust than Cjj0,angle in the

case of subspaces of different dimensions. We attribute this phenomenon to the fact that, in

the absence of noise, it is always the case that Cjj0,dist =1 whenever xj, xj0 lie in the same

13 We will henceforth be assuming that all points x1,..., xN are normalized to unit `2-norm.

200 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Algorithm 4.4 Filtrated Spectral Algebraic Subspace Clustering (FSASC) 1: procedure FSASC( ,D,n,L, γ M ) X { m}m=1 2: if N < (D) then Mn 3: return (’Not enough points’);

4: else

5: eigengap 0; C∗ 0N N ; ← ← × 6: x x / x , j [N]; j ← j || j|| ∀ ∈ 7: p VANISHING(ν (X )); ← n N p x 8: 1 x ∇ | j ; β N j=1 j, p x ← h ||∇ | j || i P 9: for k =1: M do

10: δ β γk, C 0N N ; ← · ← × 11: for j =1: N do

12: C FILTRATION(X , x ,p,L,δ,n); j,: ← j 13: end for

N 14: λ SPECTRUM(NL(C + C>)) ; { s}s=1 ← 15: if (eigengap <λ λ ) then n+1 − n

16: eigengap λ λ ; C∗ C; ← n+1 − n ← 17: end if

18: end for

n 19: SPECCLUST(C∗ + C∗>,n); {Yi}i=1 ←

201 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

n 20: return ; {Yi}i=1 21: end if

22: end procedure

23: function FILTRATION(X , x,p,L,δ,n)

24: d D, [N],q p, c 01 N ; ← J ← ← ← × 25: flag 1; ← 26: while (d> 1) and (flag =1) do

d d 1 27: q x ⊥, π R ∼ R − ; H←h∇ | i ← →H −→ h i 28: if ( x π(x) )/ x >δ then || ||−|| || || || 29: if d = D then

30: c(j0) π(x0 ) , j0 [N]; ← || j || ∀ ∈ 31: end if

32: flag 0; ← 33: else

x 0 π(x 0 ) 34: j [N]: || j ||−|| j || δ 0 x 0 J ← ∈ || j || ≤ n o 35: if

38: c(j0) π(x0 ) , j0 ; ← || j || ∀ ∈J

202 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

39: c(j0) 0, j0 [N] ; ← ∀ ∈ −J 40: if < (d) then |J| Mn 41: flag 0; ← 42: else

43: d d 1, x π(x); ← − ←

44: x 0 π(x 0 ) j0 ; j ← j ∀ ∈J

45: X x 0 : j0 ; ← { j ∈ J } 46: q VANISHING(ν (X )); ← n 47: end if

48: end if

49: end if

50: end while

51: return (c);

52: end function

irreducible component; as mentioned in Section 4.1.6, this need not be the case for Cjj0,angle.

We will be referring to the Spectral ASC method that uses affinity (4.32) as SASC-D.

4.3.5 Discussion on the computational complexity

As mentioned in 4.1, the main object that needs to be computed in algebraic subspace § clustering is a vanishing polynomial p in D variables of degree n, where D is the ambient dimension of the data and n is the number of subspaces. This amounts to computing a right

203 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

n+D 1 null-vector of the N (D) embedded data matrix ν ( ), where (D) := − , ×Mn n X Mn n  and N (D). In practice, the data are noisy and there are usually no vanishing polyno- ≥Mn mials of degree n; instead one needs to compute the right singular vector of the embedded data matrix that corresponds to the smallest singular value. Approximate iterative meth- ods for performing this task do exist [63, 82, 117], and in this work we use the MATLAB function svds.m, which is based on an inverse-shift iteration technique; see, e.g., the in- troduction of [63]. Even though svds.m is in principle more efficient than computing the full SVD of ν ( ) via the MATLAB function svd.m, the complexity of both functions is n X of the same order

n + D 1 2 N (D)2 = N − , (4.33) Mn n   which is the well-known complexity of SVD [39] adapted to the dimensions of ν ( ). This n X is because svds.m requires at each iteration the solution to a linear system of equations whose coefficient matrix has size of the same order as the size of ν ( ). n X Evidently, the complexity of (4.33) is prohibitive for large D even for moderate values of n. If we discount the spectral clustering step, this is precisely the complexity of SASC-

A of 4.1.6 as well as of SASC-D of 4.3.4. On the other hand, FSASC (Algorithm 4.4) § § is even more computationally demanding, as it requires the computation of a vanishing polynomial at each step of every filtration, and there are as many filtrations as the total number of points. Assuming for simplicity that there is no noise and that the dimensions of all subspaces are equal to d < D, then the complexity of a single filtration in FSASC is

204 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

of the order of

D d D d 2 − − n + D i 1 N ( (D i))2 = N − − . (4.34) Mn − n i=0 i=0 X X   Since FSASC computes a filtration for each and every point, its total complexity (dis- counting the spectral clustering step and assuming that we are using a single value for the parameter γ) is

D d D d 2 − − n + D i 1 N N ( (D i))2 = N 2 − − . (4.35) Mn − n i=0 i=0 X X   Even though the filtrations are independent of each other, and hence fully parallelizable, the

complexity of FSASC is still prohibitive for large scale applications even after paralleliza-

tion. Nevertheless, when the subspace dimensions are small, then FSASC is applicable

after one reduces the dimensionality of the data by means of a projection, as will be done

in 4.4.2. At any case, we hope that the complexity issue of FSASC will be addressed in § future research.

4.4 Experiments

In this section we evaluate experimentally the proposed methods FSASC (Algorithm 4.4)

and SASC-D ( 4.3.4) and compare them to other state-of-the-art subspace clustering meth- § ods, using synthetic data ( 4.4.1), as well as real motion segmentation data ( 4.4.2). § §

205 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

4.4.1 Experiments on synthetic data

We begin by randomly generating n = 3 subspaces of various dimension configurations

9 (d1,d2,d3) in R . The choice D =9 for the ambient dimension is motivated by applications in two-view geometry [47, 113]. Once the subspaces are randomly generated, we use a zero-mean unit-variance Gaussian distribution with support on each subspace to randomly sample Ni = 200 points per subspace. The points of each subspace are then corrupted by additive zero-mean Gaussian noise with standard deviation σ 0, 0.01, 0.03, 0.05 and ∈ { } support in the orthogonal complement of the subspace. All data points are subsequently normalized to have unit euclidean norm.

Using data as above, we compare the proposed methods FSASC (Algorithm 4.4) and

SASC-D ( 4.3.4) to the state-of-the-art SASC-A ( 4.1.6) from algebraic subspace clus- § § tering methods, as well as to state-of-the-art self-expressiveness-based methods, such as

Sparse Subspace Clustering (SSC) [30], Low-Rank Representation (LRR) [64, 66], Low-

Rank Subspace Clustering (LRSC) [106] and Least-Squares Regression subspace cluster- ing (LSR) [69]. For FSASC we use L = 10 and γ =0.1. For SSC we use the Lasso version with αz = 20, where αz is defined above equation (14) in [30], and ρ = 0.7, where ρ is the thresholding parameter of the SSC affinity (see MATLAB function thrC.m provided by the authors of [30]). For LRR we use the ADMM version provided by its first author with λ = 4 in equation (7) of [65]. For LRSC we use the ADMM method proposed by its authors with τ = 420 and α = 4000, where α and τ are defined at problem (P ) of

206 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Table 4.1: Mean subspace clustering error in % over 100 independent trials for synthetic

9 data randomly generated in three random subspaces of R of dimensions (d1,d2,d3). There are 200 points associated to each subspace, which are corrupted by zero-mean additive white noise of standard deviation σ =0, 0.01 and support in the orthogonal complement of the subspace.

method (2, 3, 4) (4, 5, 6) (6, 7, 8) (2, 5, 8) (3, 3, 3) (6, 6, 6) (7, 7, 7) (8, 8, 8)

σ =0

FSASC 000 0 0000 SASC-D 000 0 0000 SASC-A 42 39 6 14 37 24 12 0 SSC 0 1 18 49 0 3 14 55 LRR 0 3 39 5 0 9 42 51 LRR-H 0 3 36 6 0 8 38 51 LRSC 0 3 39 5 0 9 42 51 LSR 0 3 39 5 0 9 42 51 LSR-H 0 3 32 6 0 8 38 51

σ =0.01

FSASC 000 1 000 5 SASC-D 0 0 1 1 0 0 0 3 SASC-A 54 45 8 24 57 36 13 3 SSC 2 2 18 49 0 3 13 55 LRR 0 3 38 5 0 9 42 51 LRR-H 0 3 36 7 0 8 38 51 LRSC 0 3 38 5 0 9 42 51 LSR 0 3 39 5 0 9 42 51 LSR-H 0 3 32 6 0 8 38 51

207 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Table 4.2: Mean subspace clustering error in % over 100 independent trials for synthetic

9 data randomly generated in three random subspaces of R of dimensions (d1,d2,d3). There are 200 points associated to each subspace, which are corrupted by zero-mean additive white noise of standard deviation σ =0.03, 0.05 and support in the orthogonal complement of the subspace.

method (2, 3, 4) (4, 5, 6) (6, 7, 8) (2, 5, 8) (3, 3, 3) (6, 6, 6) (7, 7, 7) (8, 8, 8)

σ =0.03

FSASC 001 2 001 10 SASC-D 0 0 4 3 0 1 2 6 SASC-A 57 46 13 31 58 37 15 7 SSC 0 1 20 48 0 3 13 55 LRR 0 3 39 5 0 9 42 51 LRR-H 0 3 36 8 0 8 38 51 LRSC 0 3 39 5 0 10 42 51 LSR 0 3 39 5 0 10 42 51 LSR-H 0 3 32 6 0 8 37 51

σ =0.05

FSASC 1 0 2 3 1 0 2 14 SASC-D 117 5 125 10 SASC-A 58 46 17 36 60 39 17 11 SSC 0 2 20 49 0 3 15 55 LRR 1 3 39 6 0 10 42 51 LRR-H 1 3 36 13 0 8 38 52 LRSC 1 3 39 6 0 10 42 51 LSR 1 3 39 6 0 10 42 51 LSR-H 1 3 32 7 0 8 38 51

208 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Table 4.3: Mean intra-cluster connectivity over 100 independent trials for synthetic data

9 randomly generated in three random subspaces of R of dimensions (d1,d2,d3). There are

200 points associated to each subspace, which are corrupted by zero-mean additive white noise of standard deviation σ and support in the orthogonal complement of each subspace.

method (2, 3, 4) (4, 5, 6) (6, 7, 8) (2, 5, 8) (3, 3, 3) (6, 6, 6) (7, 7, 7) (8, 8, 8)

σ =0

FSASC 111 1 1111 SASC-D 111 1 1111 SASC-A 0.37 0.37 0.37 0.39 0.34 0.41 0.37 1 3 4 3 3 7 SSC 10− 0.01 10− 10− 0.01 0.02 10− 10− LRR 0.59 0.37 0.43 0.31 0.64 0.41 0.45 0.50 LRR-H 0.28 0.23 0.23 0.19 0.31 0.24 0.24 0.26 LRSC 0.59 0.37 0.43 0.31 0.64 0.41 0.45 0.50 LSR 0.59 0.37 0.42 0.31 0.64 0.41 0.45 0.50 LSR-H 0.28 0.24 0.24 0.21 0.31 0.25 0.25 0.27

σ =0.01

FSASC 0.05 0.35 0.43 0.10 0.09 0.43 0.42 0.43 SASC-D 0.91 0.93 0.85 0.84 0.94 0.91 0.87 0.85 SASC-A 0.32 0.30 0.12 0.14 0.30 0.29 0.24 0.07 3 4 3 3 7 SSC 10− 0.01 10− 10− 0.01 0.02 10− 10− LRR 0.42 0.37 0.43 0.31 0.51 0.41 0.45 0.50 LRR-H 0.13 0.23 0.23 0.17 0.22 0.24 0.24 0.26 LRSC 0.42 0.37 0.43 0.31 0.52 0.41 0.45 0.50 LSR 0.41 0.37 0.42 0.31 0.51 0.41 0.45 0.50 LSR-H 0.11 0.24 0.24 0.18 0.21 0.25 0.25 0.27

209 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Table 4.4: Mean inter-cluster connectivity in % over 100 independent trials for synthetic

9 data randomly generated in three random subspaces of R of dimensions (d1,d2,d3). There are 200 points associated to each subspace, which are corrupted by zero-mean additive white noise of standard deviation σ and support in the orthogonal complement of each subspace.

method (2, 3, 4) (4, 5, 6) (6, 7, 8) (2, 5, 8) (3, 3, 3) (6, 6, 6) (7, 7, 7) (8, 8, 8)

σ =0

FSASC 001 1 0002 SASC-D 60 60 60 60 60 60 60 60 SASC-A 55 55 38 43 55 50 42 35 SSC 0 2 22 2 0 7 23 46 LRR 1 49 60 45 0 55 60 63 LRR-H 0 18 43 9 0 32 44 55 LRSC 2 49 60 45 2 55 60 63 LSR 2 49 60 43 2 56 60 64 LSR-H 0 11 24 6 0 19 25 30

σ =0.01

FSASC 2 4 22 18 2 6 15 35 SASC-D 62 61 60 61 62 60 60 60 SASC-A 63 58 46 51 64 55 47 39 SSC 0.1 1 23 3 0.1 7 23 46 LRR 17 49 60 45 16 55 60 63 LRR-H 1 18 43 9 1 32 44 55 LRSC 17 49 60 45 16 55 60 63 LSR 17 49 60 46 16 55 60 64 LSR-H 0.1 11 24 6 0.1 19 25 30

210 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Table 4.5: Mean running time of each method in seconds over 100 independent trials for synthetic data randomly generated in three random subspaces of R9 of dimensions

(d1,d2,d3). There are 200 points associated to each subspace, which are corrupted by zero- mean additive white noise of standard deviation σ = 0.01 and support in the orthogonal complement of each subspace. The reported running time is the time required to compute the affinity matrix, and it does not include the spectral clustering step. The experiment is run in MATLAB on a standard Macbook-Pro with a dual core 2.5GHz Processor and a total of 4GB Cache memory.

method (2, 3, 4) (4, 5, 6) (6, 7, 8) (2, 5, 8) (3, 3, 3) (6, 6, 6) (7, 7, 7) (8, 8, 8)

σ =0.01

FSASC 13.57 12.11 8.34 13.90 13.69 10.67 8.55 6.01 SASC-D 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 SASC-A 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 SSC 5.01 4.84 5.06 6.59 4.90 4.71 4.80 5.03 LRR 0.54 0.36 0.34 0.45 0.53 0.34 0.34 0.34 LRR-H 0.65 0.48 0.45 0.61 0.65 0.46 0.46 0.45 LRSC 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 LSR 0.05 0.05 0.05 0.07 0.05 0.05 0.05 0.05 LSR-H 0.25 0.25 0.24 0.32 0.24 0.24 0.24 0.24

211 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

page 2 in [106]. Finally, for LSR we use equation (16) in [69] with λ = 0.0048. For both

LRR and LSR we also report results with the heuristic post-processing of the affinity matrix proposed by the first author of [65] in their MATLAB function lrr motion seg.m; we denote these versions of LRR and LSR by LRR-H and LSR-H respectively.

Notice that all compared methods are spectral methods, i.e., they produce a pairwise affinity matrix C upon which spectral clustering is applied. To evaluate the quality of the produced affinity, besides reporting the standard subspace clustering error, which is the percentage of misclassified points, we also report the intra-cluster and inter-cluster con-

nectivities of the affinity matrices C. As an intra-cluster connectivity we use the minimum

algebraic connectivity among the subgraphs corresponding to the ground truth clusters.

The algebraic connectivity of a subgraph is the second smallest eigenvalue of its normal-

ized Laplacian, and measures how well connected the graph is. In particular, values close to

1 indicate that the subgraph is indeed well-connected (single connected component), while

values close to 0 indicate that the subgraph tends to split to at least two connected compo-

nents. Clearly, from a clustering point of view, the latter situation is undesirable, since it

may lead to over-segmentation. Finally, as inter-cluster connectivity we use the percentage

of the `1-norm of the affinity matrix C that corresponds to erroneous connections, i.e., the

quantity 0 C 0 C . The smaller the inter-cluster connectivity is, the xj i,x 0 0 ,i=i j,j / 1 ∈§ j ∈§i 6 | | || || fewer erroneousP connections the affinity contains. To summarize, a high-quality affinity

matrix is characterized by high intra-cluster and low inter-cluster connectivity, which is

then expected to lead to small spectral clustering error.

212 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Tables 4.1-4.4 show the clustering error, and the intra-cluster and inter-cluster connec- tivities associated with each method, averaged over 100 independent experiments. Inspec- tion of Table 4.1 reveals that, in the absence of noise (σ = 0), FSASC gives exactly zero error across all dimension configurations. This is in agreement with the theoretical results of 4.2.2, which guarantee that, in the absence of noise, the only points that survive the § filtration associated with some reference point are precisely the points lying in the same subspace as the reference point. Indeed, notice that in Table 4.3 and for σ = 0 the con- nectivity attains its maximum value 1, indicating that the subgraphs corresponding to the ground truth clusters are fully connected. Moreover in Table 4.4 we see that for σ =0 the erroneous connections are either zero or negligible. This practically means that each point is connected to each and every other point from same subspace, while not connected to any other points, which is the ideal structure that an affinity matrix should have.

Remarkably, the proposed SASC-D, which is much simpler than FSASC, also gives zero error for zero noise. Table 4.3 shows that SASC-D achieves perfect intra-cluster con- nectivity, while Table 4.4 shows that the inter-cluster connectivity associated with SASC-D is very large. This is clearly an undesirable feature, which nevertheless seems not to be affecting the clustering error in this experiment, perhaps because the intra-cluster connec- tivity is very high. As we will see though later ( 4.4.2), the situation is different for real § data, for which SASC-D performs inferior to FSASC.

Going back to Table 4.1 and σ = 0, we see that the improvement in performance of the proposed FSASC and SASC-D over the existing SASC-A is dramatic: indeed, SASC-A

213 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

succeeds only in the case of hyperplanes, i.e., when d1 = d2 = d3 =8. This is theoretically

expected, since in the case of hyperplanes there is only one normal direction per subspace,

and the gradient of the vanishing polynomial at a point in the hyperplane is guaranteed to

recover this direction. However, when the subspaces have lower-dimensions, as is the case,

e.g., for the dimension configuration (4, 5, 6), then there are infinitely many orthogonal

directions to each subspace. Hence a priori, the gradient of a vanishing polynomial may

recover any such direction, and such directions could be dramatically different even for

points in the same subspace (e.g., they could be orthogonal), thus leading to a clustering

error of 39%.

As far as the rest of the self-expressiveness methods are concerned, Table 4.1 (σ = 0)

shows what we expect: the methods give a perfect clustering when the subspace dimensions

are small, e.g., for dimension configurations (2, 3, 4) and (3, 3, 3), they start to degrade as

the subspace dimensions increase ((4, 5, 6), (6, 6, 6)), and eventually they fail when the

subspace dimensions become large enough ((6, 7, 8),(7, 7, 7),(8, 8, 8)). To examine the ef-

fect of the subspace dimension on the connectivity, let us consider SSC and the dimension

configurations (2, 3, 4) and (2, 5, 8): Table 4.3 (σ = 0) shows that for both of these con-

3 figurations the intra-cluster connectivity has a small value of 10− . This is expected, since

SSC computes sparse affinities and it is known to produce weakly connected clusters. Now,

Table 4.4 (σ =0) shows that the inter-cluster connectivity of SSC for (2, 3, 4) is zero, i.e.,

there are no erroneous connections, and so, even though the intra-cluster connectivity is as

3 small as 10− , spectral clustering can still give a zero clustering error. On the other hand,

214 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Table 4.6: Mean subspace clustering error in % over 100 independent trials for synthetic data randomly generated in four random subspaces of R9 of dimensions (8, 8, 5, 3). There

are 200 points associated to each subspace, which are corrupted by zero-mean additive

white noise of standard deviation σ = 0, 0.01, 0.03, 0.05 and support in the orthogonal

complement of each subspace.

method / σ 0 0.01 0.03 0.05

FSASC 0 2.19 5.08 7.65 SASC-D 22.88 17.83 15.93 17.44 SASC-A 22.88 27.21 31.43 36.36 SSC 64.39 64.17 64.36 64.13 LRR 42.86 42.88 43.04 42.91 LRR-H 42.08 42.06 42.23 42.21 LRSC 42.85 42.88 43.05 42.90 LSR 42.84 42.85 43.00 42.93 LSR-H 38.72 38.74 38.96 39.86

for the case (2, 5, 8) the inter-cluster connectivity is 2%, which, even though small, when

3 coupled with the small intra-cluster connectivity of 10− , leads to a spectral clustering error

7 of 49%. Finally, notice that for the case of (8, 8, 8) the intra-cluster connectivity is 10− and the inter-cluster connectivity is 46%, indicating that the quality of the produced affinity is very poor, thus explaining the corresponding clustering error of 55%.

When the data are corrupted by noise (σ =0.01, 0.03, 0.05), the rest of the Tables 4.1-

4.4 show that FSASC is the best method, with the exception of the case of hyperplanes. In this latter case, i.e., when d1 = d2 = d3 =8, the best method is SASC-D with a clustering

215 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

error of 6% when σ =0.03, as opposed to 10% for FSASC. This is expected, since for the case of codimension-1 subspaces the length of each filtration should be precisely 1, since in theory, the length of the filtration is equal to the codimension of the subspace associated to the reference point. Since FSASC automatically determines this length based on the data and the value of the parameter γ, it is expected that when the data are noisy, errors will be made in the estimation of the filtration length. On the other hand, SASC-D is equivalent to FSASC with an a priori configured filtration length equal to 1, thus performing superior to FSASC. Certainly, giving as input to FSASC more than one values for γ, as shown in Algorithm 4.4, is expected to address this issue, but also increase the running time of

FSASC (see Table 4.5 for average running times of the methods in the current experiment).

We conclude this section by demonstrating the interesting property of FSASC of being able to give the correct clustering by using vanishing polynomials of degree strictly less than the true number of subspaces. Towards that end, we consider a similar situation as above, except that now we have n = 4 subspaces of dimensions (8, 8, 5, 3). Contrary to

SASC-D and SASC-A, for which the theory requires degree-4 polynomials, FSASC is still applicable if one works with polynomials of degree 3: the crucial observation is that for the dimension configuration (8, 8, 5, 3), the corresponding subspace arrangement always admits vanishing polynomials of degree 3, and the same is true for every intermediate arrangement occurring in a filtration. For example, if one lets b1 be a normal vector to one of the 8-dimensional subspaces, and b2 a normal vector to the other, and b3 a normal vector to the 8-dimensional subspace spanned by both the 5-dimensional and 3-dimensional

216 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

subspace, then the polynomial p(x)=(b1>x)(b2>x)(b3>x) has degree 3 and vanishes on the

entire arrangement of the four subspaces. Interestingly, Table 4.6 shows that FSASC gives

zero error in the absence of noise and 7.65% error for the worst case σ = 0.05, while

all other methods fail. In particular, the other two algebraic methods, i.e., SASC-D and

SASC-A, are not able to cluster the data using a single vanishing polynomial of degree 3.

4.4.2 Experiments on real motion sequences

We evaluate different methods on the Hopkins155 motion segmentation data set [94], which

contains 155 videos of n =2, 3 moving objects, each one with N = 100-500 feature point

trajectories of dimension D = 56-80. While SSC, LRR, LRSC and LSR can operate di-

rectly on the raw data, algebraic methods require (D) N. Hence, for algebraic Mn ≤ methods, we project the raw data onto the subspace spanned by their D principal compo- nents, where D is the largest integer 8 such that (D) N, and then normalize each ≤ Mn ≤ point to have unit norm. We apply SSC to i) the raw data (SSC-raw) and ii) the raw points projected onto their first 8 principal components and normalized to unit norm (SSC-proj).

For FSASC we use L = 10 and γ =0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10. LRR, LRSC and LSR use the same parameters as in 4.4.1, while for SSC the parameters are α = 800 § and ρ =0.7.

The clustering errors and the intra/inter-cluster connectivities are reported in Table 4.7 and Fig. 4.4. Notice the clustering errors of about 5% and 37% for SASC-A for two and three motions respectively. Notice how changing the angle-based by the distance-based

217 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Table 4.7: Mean clustering error (E) in %, intra-cluster connectivity (C1), and inter-cluster

connectivity (C2) in % for the Hopkins155 data set.

2 motions 3 motions all motions

method E C1 C2 E C1 C2 E C1 C2

FSASC 0.80 0.18 4 2.48 0.10 10 1.18 0.16 5 SASC-D 5.65 0.82 26 14.0 0.8046 7.59 0.81 31 SASC-A 4.99 0.35 5 36.8 0.09 35 12.2 0.29 12 SSC-raw 1.53 0.05 2 4.40 0.04 3 2.18 0.05 2 SSC-proj 5.87 0.04 3 5.70 0.03 3 5.83 0.03 3 LRR 4.26 0.2519 7.78 0.2528 5.05 0.25 21 LRR-H 2.25 0.05 2 3.40 0.04 3 2.51 0.05 2 LRSC 3.38 0.2519 7.42 0.2428 4.29 0.25 21 LSR 3.60 0.2418 7.77 0.2328 4.54 0.23 21 LSR-H 2.73 0.04 1 2.60 0.03 2 2.70 0.04 1

affinity, SASC-D already gives errors of around 5.5% and 14%. But most dramatically,

notice how FSASC further reduces those errors to 0.8% and 2.48%. Moreover, even though

the dimensions of the subspaces (d 1, 2, 3, 4 for motion segmentation) are low relative i ∈ { } to the ambient space dimension (D = 56-80) - a case that is specifically suited for SSC,

LRR, LRSC, LSR - projecting the data to D 8, which makes the subspace dimensions ≤ comparable to the ambient dimension, is sufficient for FSASC to get superior performance relative to the best performing algorithms on Hopkins 155. We believe that this is because, overall, FSASC produces a much higher intra-cluster connectivity, without increasing the inter-cluster connectivity too much.

218 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

0.6 FSASC (1.18%) 0.5 SASC-D (7.59%) SASC-A (12.2%) SSC-raw (2.18%) 0.4 SSC-proj (5.83%) LRR (5.05%) 0.3 LRR-H (2.51%) LRSC (4.29%) LSR (4.54%) 0.2 LSR-H (2.70%) clustering error rate

0.1

0 90 100 110 120 130 140 150 sequence index

Figure 4.4: Clustering error ratios for both 2 and 3 motions in Hopkins155, ordered in- creasingly for each method. Errors start from the 90-th smallest error of each method.

4.5 Algebraic clustering of affine subspaces

4.5.1 Motivation

In several important applications, such as motion segmentation, the underlying subspaces do not pass through the origin, i.e., they are affine. Subspace clustering methods such as K-subspaces [9, 102] and mixtures of probabilistic PCA [93] can trivially handle this case. Likewise, the spectral clustering method of [122] can handle affine subspaces by constructing an affinity that depends on the distance from a point to a subspace. However, these methods do not come with theoretical conditions under which they are guaranteed to

219 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

give the correct clustering.

One existing work that comes with theoretical guarantees, albeit for a very restricted

class of unions of affine subspaces, is Sparse Subspace Clustering (SSC) [27, 28, 30].

Specifically, [27] exploits the fact that after embedding the data x ,..., x RD into { 1 N } ⊂ homogeneous coordinates

1 1 1  ···  , (4.36) x1 x2 xN   ···    the embedded points live in a union of linear subspaces (see 4.5.2 for details). The work § of [27] shows that when the linear subspaces are independent, the sparse representation of the embedded points produced by SSC is subspace preserving, i.e., points from different subspaces lie in distinct connected components of the affinity graph. Even so, this is not enough to guarantee the correct clustering, since the intra cluster connectivity could be weak, which could lead to oversegmentation [76].

Returning to ASC, the traditional way to handle points from a union of affine subspaces

(see [104] for details) is to use homogeneous coordinates as in (4.36), and subsequently ap- ply ASC to the embedded data. We will refer to this two-step approach as Affine ASC

(AASC). Although AASC has been observed to perform well in practice, it lacks a suffi- cient theoretical justification. On one hand, while it is true that the embedded points live in a union of associated linear subspaces, it is obvious that they have a very particular struc- ture inside these subspaces. In particular, even if the original points are generic, in the sense that they are randomly sampled from the affine subspaces, the embedded points are

220 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

clearly non-generic, in the sense that they always lie in the zero-measure intersection of the

union of the associated linear subspaces with the hyperplane x0 =1.

Thus, even in the absence of noise, one may wonder whether this non-genericity of the

embedded points will affect the behavior of AASC and to what extent. On the other hand,

even if the affine subspaces are transversal, there is no guarantee that the associated linear

subspaces are also transversal. Thus, it is natural to ask for conditions on the affine sub-

spaces and the data points under which AASC is guaranteed to give the correct clustering.

4.5.2 Problem statement and traditional approach

In this section we define the problem of clustering unions of affine subspaces, and analyze

the traditional algebraic approach, whose correctness is far from obvious.

Let X = x ,..., x be a finite set of points living in a union Ψ = n of { 1 N } i=1 Ai n affine subspaces of RD, where for simplicity we assume that n is known. EachS affine

subspace is the translation by some vector µ RD of a d -dimensional linear subspace Ai i ∈ i , i.e., = + µ . The affine subspace clustering problem involves clustering the Si Ai Si i points X according to their subspace membership, and finding a parametrization of each

affine subspace by finding a translation vector µ and a basis for its linear part , for all Ai i Si i =1,...,n. Note that there is an inherent ambiguity in determining the translation vectors

µ , since if = + µ , then = +(s + µ ) for any vector s . Consequently, i Ai Si i Ai Si i i i ∈Si

the best we can hope for is to determine the unique component of µi in the orthogonal complement ⊥ of . Si Si

221 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Since the inception of ASC, the standard algebraic approach to cluster points living in a

union of affine subspaces has been to embed the points into RD+1 and subsequently apply

ASC [104]. The precise embedding φ : RD , RD+1 is given by 0 →

φ α =(α ,...,α ) 0 α˜ = (1,α ,...,α ). (4.37) 1 D 7−→ 1 D

To understand the effect of this embedding and why it is meaningful to apply ASC to the

embedded points, let = +µ be a d-dimensional affine subspace of RD, with u ,..., u A S 1 d

being a basis for its linear part . As noted earlier, we can also assume that µ ⊥. For S ∈ S x , there exists y Rd such that ∈A ∈

D d x = Uy + µ, U := [u ,..., u ] R × . (4.38) 1 d ∈

Then the embedded point x˜ := φ0(x) can be written as

1 1 1 0 0 x˜ =   = U˜   , U˜ :=  ···  . (4.39) x  y   µ u1 ud       ···        Equation (4.39) clearly indicates that the embedded point x˜ lies in the linear (d + 1)-

dimensional subspace ˜ := Span(U˜ ) of RD+1 and the same is true for the entire affine S subspace . From (4.39) one sees immediately that (u ,..., u , µ) can be used to con- A 1 d struct a basis of ˜. The converse is also true: given any basis of ˜ one can recover a basis S S for the linear part and the translation vector µ of . Hence, the embedding φ 14 takes S A 0 a union of affine subspaces Ψ = n into a union of linear subspaces Φ˜ = n ˜ of i=1 Ai i=1 Si 14 The reader may notice that we couldS still view as a subset of the linear subspace Span(µ,Su1,..., ud) of RD, i.e., at first sight the embedding may seem unnecessary.A The issue with this approach is that whenever D we have an affine hyperplane, it will be true that Span(µ, u1,..., uD−1)= R and ASC will fail.

222 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

RD+1, in a way that there is a 1 1 correspondence between the parameters of (a basis − Ai for the linear part and the translation vector) and the parameters of ˜ (a basis) i [n]. Si ∀ ∈ To the best of the author’s knowledge, the correspondence between and ˜ has been Ai Si the sole theoretical justification so far in the subspace clustering literature for the traditional

Affine ASC (AASC) approach for dealing with affine subspaces, which consists of

1. applying the embedding φ0 to points X in Ψ,

2. computing a basis p ,...,p for the vector space X˜ of homogeneous polynomials 1 s I ,n

of degree n that vanish on the embedded points X˜ := φ0(X ),

X˜ ˜ ˜ ˜ 3. for x˜i i i=i0 i, estimating i via the formula ∈ ∩ S − 6 S S S

˜ = Span( p x ,..., p x )⊥, and (4.40) Si ∇ 1|˜i ∇ s|˜i

4. extracting the translation vector of , and a basis for its linear part, from a basis of Ai ˜ . Si

According to Theorem 4.6, the above process will succeed, if i) the embedded points X˜ are

in general position in Φ˜ (with respect to degree n in the sense of Definition 4.12; since n

is known, we will henceforth omit the “with respect to degree n” part), and ii) the union of

linear subspaces Φ˜ is transversal. Note that these conditions need not be satisfied a-priori

because of the particular structure of both the embedded data in (4.36) and the basis in

(4.39). This gives rise to the following reasonable questions:

Question 4.30. Under what conditions on X and Ψ, will X˜ be in general position in Φ˜?

223 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Question 4.31. Under what conditions on Ψ will Φ˜ be transversal

In what follows, we rigorously answer these two questions. Our insights are drawn from

the algebraic geometric properties of unions of affine subspaces, which we study next.

4.5.3 Algebraic geometry of unions of affine subspaces

In 4.5.3.1 we describe the basic algebraic geometry of affine subspaces and unions thereof, § in analogy to the case of linear subspaces. In particular, we show that a single affine

subspace is the zero-set of polynomial equations of degree 1, and a union Ψ of affine

subspaces is the zero-set of polynomial equations of degree n. In 4.5.3.2 we study more § φ closely the embedding 0 ˜ of an affine subspace RD into its associated linear A −→ S A ⊂ φ subspace ˜ RD+1, which will lead to a deeper understanding of the embedding Ψ 0 S ⊂ −→ Φ˜ of a union of affine subspaces Ψ RD into its associated union of linear subspaces ⊂ Φ˜ RD+1. As we will see, Ψ is dense in Φ˜ in a very precise sense, and the algebraic ⊂ manifestation of this relation (Proposition 4.41) will be of frequent use later in 4.5.4. §

4.5.3.1 Affine subspaces as affine varieties

Let = + µ be an affine subspace of RD and let b ,..., b be a basis for the orthogonal A S 1 c

complement ⊥ of . The first important observation is that a vector x belongs to if and S S S only if x b , k = 1,...,c. In the language of algebraic geometry this is the same as ⊥ k ∀ saying that is the zero set of c linear polynomials: S

= b>x,..., b>x , x := [x ,...,x ]>. (4.41) S Z 1 c 1 D  224 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

One may wonder if the linear polynomials bi>x, i = 1,...,c, form some sort of basis for

the vanishing ideal of (see Definition 4.56). In fact this is true (see Proposition 4.72 IS S for a proof) and can be formalized by saying that these linear polynomials are generators

of over the polynomial ring = R[x1,...,xD]. This means that every polynomial IS R

that belongs to can be written as a linear combination of b1>x,..., bc>x with polynomial IS coefficients, i.e.,

p(x)= p (x)(b>x)+ + p (x)(b>x) (4.42) 1 1 ··· c c where p ,...,p are some polynomials in . More compactly 1 c R

= b1>x,..., bc>x , (4.43) IS h i

which reads as is the ideal generated by the polynomials b1>x,..., bc>x as in (4.42). IS Moving on, the second important observation is that x if and only if x µ . ∈ A − ∈ S Equivalently,

x b x µ, k =1,...,c (4.44) ∈ A ⇔ k ⊥ − ∀

or in algebraic geometric terms

= b>x b>µ,..., b>x b>µ . (4.45) A Z 1 − 1 c − c  In other words, the affine subspace is an algebraic variety of RD. In fact, we say that A A is an affine variety, since it is defined by non-homogeneous polynomials. To describe the

vanishing ideal of , note that a polynomial p(x) vanishes on if and only if p(x + µ) IA A A

225 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING vanishes on . This, together with (4.43), give S

= b1>x b1>µ,..., bc>x bc>µ . (4.46) IA h − − i

Next, we consider a union Ψ = n of affine subspaces = + µ , i [n], i=1 Ai Ai Si i ∈ of RD. The next result describes Ψ Sas the zero-set of non-homogeneous polynomials of degree n, showing that Ψ is an affine variety of RD.

Proposition 4.32. Let Ψ= n be a union of affine subspaces of RD, where each affine i=1 Ai subspace is the translationS of a linear subspace of codimension c by a translation Ai Si i vector µ . For each = + µ , let b ,..., b be a basis for ⊥. Then Ψ is the zero set i Ai Si i i1 ici Si of all degree-n polynomials of the form

n b> x b> µ :(j ,...,j ) [c ] [c ]. (4.47) iji − iji i 1 n ∈ 1 ×···× n i=1 Y  Proof. Denote the set of all polynomials of the form (4.47) by . P First, we show that Ψ ( ). Take x Ψ; we will show that x ( ). Since ⊂ Z P ∈ ∈ Z P Ψ= , x belongs to at least one of the affine subspaces, say x , for some A1 ∪···∪An ∈Ai i. For every polynomial p of , there is a linear factor b> x b> µ of p that vanishes on P iji − iji i and thus on x. Hence p itself will vanish on x. Since p was an arbitrary element of , Ai P this shows that every polynomial of vanishes on x, i.e., x ( ). P ∈Z P Next, we show that ( ) Ψ. Let x ( ); we will show that x Ψ. If x is a Z P ⊂ ∈ Z P ∈ root of all polynomials p (x)= b> x b> µ , then x and we are done. Otherwise, 1j 1j − 1j 1 ∈A1 one of these linear polynomials does not vanish on x, say p (x) = 0. Now suppose that 11 6

226 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

x Ψ. By the above argument, for every affine subspace there must exist some linear 6∈ Ai

polynomial b>x b>µ , which does not vanish on x. As consequence, the polynomial i1 − i1 i

n p(x)= b>x b>µ (4.48) i1 − i1 i i=1 Y  does not vanish on x, i.e., p(x) =0. But because of the definition of , we must have that 6 P p . Since x was selected to be an element of ( ), we must have that p(x)=0, which ∈P Z P is a contradiction, as we just saw that p(x) =0. Consequently, the hypothesis that x Ψ, 6 6∈ must be false, i.e., ( ) Ψ, and the proof is concluded. Z P ⊂

The reader may wonder what the vanishing ideal of Ψ is and what its relation is to the IΨ linear polynomials whose products generate Ψ, as in Proposition 4.32. In fact, this question

is still partially open even in the simpler case of a union of linear subspaces [16, 20, 21].

n As it turns out, is intimately related to ˜ , where Φ˜ = ˜ is the union of linear IΨ IΦ i=1 Si S subspaces associated to Ψ under the embedding φ0 of (4.37). It is precisely this relation that

will enable us to prove Theorem 4.44, and to elucidate it we need the notion of projective

closure that we introduce next.15

4.5.3.2 The projective closure of affine subspaces

Let φ ( ) be the image of = + µ under the embedding φ : RD , RD+1 in (4.37). 0 A A S 0 → Let ˜ be the (d + 1)-dimensional linear subspace of RD+1 spanned by the columns of U˜ S 15Of course, the notion of projective closure is a well-known concept in algebraic geometry; here we intro- duce it in a self-contained fashion in the context of unions of affine subspaces, dispensing with unnecessary abstractions.

227 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

(see (4.39)). A basis for the orthogonal complement of ˜ in RD+1 is S

b>µ b>µ ˜ − 1 ˜ − c b1 :=   ,..., bc :=   , (4.49)  b1   bc          since codim( ˜ ) = codim( ), and the b˜ are linearly independent because the b are. In Si S i i algebraic geometric terms

˜ = b1>x (b1>µ)x0,..., bc>x (bc>µ)x0 S Z − − (4.50)  = b˜>x,...,˜ b˜>x˜ , x˜ := [x ,x ,...,x ]>. Z 1 c 0 1 D   By inspecting equations (4.45) and (4.50), we see that every point of φ ( ) satisfies the 0 A equations (4.50) of ˜. Since these equations are homogeneous, it will in fact be true that S for any point x˜ φ ( ) the entire line of RD+1 spanned by x˜ will still lie in ˜. Hence, we ∈ 0 A S may as well think of the embedding φ as mapping a point x RD to a line of RD+1. To 0 ∈ formalize this concept, we need the notion of [18, 48]:

Definition 4.33. The real projective space PD is defined to be the set of all lines through the origin in RD+1. Each non-zero vector α of RD+1 defines an element [α] of PD, and two elements [α], [β] of PD are equal in PD, if and only if there exists a nonzero λ R such ∈ that we have an equality α = λβ of vectors in RD+1. For each point [α] PD, we call the ∈ point α RD+1 a representative of [α]. ∈

Now we can define a new embedding φˆ : RD PD, that behaves exactly as φ 0 → 0 in (4.37), except that it now takes points of RD to lines of RD+1, or more precisely, to

228 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

elements of PD:

φˆ (α ,α ,...,α ) 0 [(1,α ,α , ,α )]. (4.51) 1 2 D 7−→ 1 2 ··· D

A point x of is mapped by φˆ to a line inside ˜, or more specifically, to the point [x˜] of A 0 S PD, whose representative x˜ satisfies the equations (4.50) of ˜. The set of all lines of RD+1 S that live in ˜, viewed as elements of PD, is denoted by [ ˜], i.e., S S

[ ˜]= [α] PD : α ˜ . (4.52) S ∈ ∈ S n o The representative α of every element [α] [ ˜] satisfies by definition the equations (4.50) ∈ S of ˜, and so [ ˜] has naturally the structure of an algebraic variety of PD, which is called a S S projective variety. We emphasize that even though the varieties ˜ and [ ˜] live in different S S spaces, RD+1 and PD respectively, they are defined by the same equations. In fact, every algebraic variety of RD+1 that is the unions of lines, which is true if and only if is Y Y defined by homogeneous equations, gives rise to a projective variety [ ] of PD defined by Y the same equations.

Example 4.34. Recall from Section 4.1 that a union Φ˜ of linear subspaces is defined as the

zero-set of homogeneous polynomials. Then Φ˜ gives rise to a projective variety [Φ]˜ of PD

defined by the same equations as Φ˜, which can be thought of as the set of lines through the

origin in RD+1 that live in Φ˜.

Returning to our embedding φˆ , the formal relation between φˆ ( ) and ˜ is revealed 0 0 A S with the help of the Zariski topology [18, 48] (Definition 4.57):

229 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Proposition 4.35. In the Zariski topology, the set φˆ ( ) is open and dense in [ ˜], in par- 0 A S ticular [ ˜] is the closure16 of φˆ ( ) in PD. S 0 A

The projective variety [ ˜] is called the projective closure of : it is the smallest projec- S A tive variety that contains φˆ ( ). We now characterize the projective closure of a union of 0 A affine subspaces.

Proposition 4.36. Let Ψ = n be a union of affine subspaces of RD. Then the i=1 Ai D S projective closure of Ψ in P , i.e., the smallest projective variety that contains φˆ0(Ψ), is

n n [ ˜ ]= ˜ = Φ˜ , (4.53) Si Si i=1 "i=1 # [ [ h i where ˜ is the linear subspace of RD+1 corresponding to under the φ of (4.37). Si Ai 0

The geometric fact that [Φ]˜ PD is the smallest projective variety of PD that contains ⊂

φˆ (Ψ), manifests itself algebraically in being uniquely defined by ˜ and vice versa, in 0 IΨ IΦ a very precise fashion. To describe this relation, we need a definition.

Definition 4.37 (Homogenization - Dehomogenization). Let p = R[x ,...,x ] be a ∈ R 1 D polynomial of degree n. The homogenization of p is the homogeneous polynomial

x x x p(h) = xnp 1 , 2 ,..., D (4.54) 0 x x x  0 0 0  of ˜ = R[x ,x ,...,x ] of degree n. Conversely, if P ˜ is homogeneous of degree n, R 0 1 D ∈ R its dehomogenization is P = P (1,x ,...,x ), a polynomial of of degree n. (d) 1 D R ≤ 16 It can further be shown that [ ˜]= φˆ0( ) [ ]: intuitively, the set that we need to add to φˆ0( ) to get a is the slope [ ] of . S A ∪ S A S A

230 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

2 2 Example 4.38. Let P = x0x1 + x0x2 + x1x2x3 be a homogeneous polynomial of degree

2 3. Its dehomogenization is the degree-3 polynomial P(d) = x1 + x2 + x1x2x3, and the

2 (h) 3 x1 x2 x1x2x3 homogenization of P(d) is P(d) = x0 + 2 + 3 = P . x0 x0 x0    The next result from algebraic geometry is crucial for our purpose.

Theorem 4.39 (Chapter 8 in [18]). Let be an affine variety of RD and let ¯ be its projec- Y Y D tive closure in P with respect to the embedding φˆ0 of (4.51). Let , ¯ be the vanishing IY IY (h) ideals of , ¯ respectively. Then ¯ = , i.e., every element of ¯ arises as a homoge- Y Y IY IY IY nization of some element of , and every element of arises as the dehomogenization of IY IY some element of ¯. IY

We have already seen that Φ˜ and [Φ]˜ are given as algebraic varieties by identical equations.

It is also not hard to see that the vanishing ideals of these varieties are identical as well.

Lemma 4.40. Let Φ˜ = n ˜ be a union of linear subspaces of RD+1, and let [Φ]˜ = i=1 Si n S D [ ˜ ] be the corresponding projective variety of P . Then ˜ = ˜ , i.e., a degree-k i=1 Si IΦ,k I[Φ],k Shomogeneous polynomial vanishes on Φ˜ if and only if it vanishes on [Φ]˜ .

As a Corollary of Theorem 4.39 and Lemma 4.40, we obtain the key result of this section:

Proposition 4.41. Let Ψ= n be a union of affine subspaces of RD. Let Φ=˜ n ˜ i=1 Ai i=1 Si S D+1 S be the union of linear subspaces of R associated to Ψ under the embedding φ0 of (4.37).

Then ˜ is the homogenization of . IΦ IΨ

231 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

4.5.4 Correctness theorems for the homogenization trick

We are now in a position to address questions 4.30 and 4.31. Regarding Question 4.30, one may be tempted to conjecture that X˜ is in general position in Φ˜, if the components of the points X along the union Φ := n of the linear parts of the affine subspaces are in i=1 Si general position inside Φ. However,S this conjecture is not true, as illustrated by the next example.

Example 4.42. Suppose that Ψ= is a union of two affine planes = + µ of A1 ∪A2 Ai Si i R3. Then Φ= is a union of 2 planes in R3 and as argued in Example 4.15, we can S1 ∪S2 find 5 points in general position in Φ. However, Φ=˜ ˜ ˜ is a union of 2 hyperplanes in S1 ∪ S2 R4 and any subset of Φ˜ in general position must consist of at least (4) 1= 2+3 1= M2 − 2 −  9 points.17

Thanks to Proposition 4.32 we can define points X to be in general position in Ψ, in analogy to Definition 4.12.

Definition 4.43. Let Ψ be a union of n affine subspaces of RD and X a finite subset of Ψ.

We will say that X is in general position in Ψ, if Ψ can be recovered as the zero set of all polynomials of degree n that vanish on X . Equivalently, a polynomial of degree n vanishes on Ψ if and only if it vanishes on X .

We are now ready to answer our Question 4.30.

17Otherwise one can fit a polynomial of degree 2 to the points, which does not vanish on Φ˜.

232 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Theorem 4.44. Let X be a finite subset of a union of n affine subspaces Ψ = n of i=1 Ai RD, where = + µ , with a linear subspace of RD of codimension 0

Proof. ( ) Suppose that X is in general position in Ψ. We need to show that X˜ is in ⇒ general position in Φ˜. In view of Proposition 4.13, and the fact that ˜ X˜ , it is IΦ,n ⊂ I ,n sufficient to show that ˜ X˜ . To that end, let P be a homogeneous polynomial of IΦ,n ⊃ I ,n degree n in R[x ,x ,...,x ] that vanishes on the points X˜ , i.e., P ˜ . Then for every 0 1 D ∈IX,n point α˜ = (1,α1,...,αD) of X˜ , we have

P (α˜ )= P (1,α1,...,αD)= P(d)(α1,...,αD)=0, (4.55)

i.e., the dehomogenization P of P vanishes on all points of X , i.e., P X . There are (d) (d) ∈I (h) two possibilities: either P(d) has degree n, in which case P = P(d) , or P(d) has degree

(h)  strictly less than n, say n k, k 1, in which case P = xk P . If P has total degree − ≥ 0 (d) (d)  n, by the general position assumption on X , P(d) must vanish on Ψ. Then by Proposition

(h) 4.41, P , and so P . If deg P = n k, k 1, suppose we can (d) ∈ IΦ,n ∈ IΦ,n (d) − ≥  find a linear form G = ζ˜>x˜, that does not vanish on any of the ˜ , i [n], and it is not Si ∈ divisible by x . Then G will have degree 1 and will not vanish on any of the , i [n]. 0 (d) Ai ∈ k Also G(d) P(d) has degree n and vanishes on X . Since X is in general position in Ψ, we

 k k (h) will have that G P vanishes on Ψ. Then by Proposition 4.41, G P ˜ . (d) (d) (d) ∈ IΦ,n   233 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Since n we must have that k (h) . Since is a prime Φ˜ = i=1 ˜i G P(d) ˜i , i [n] ˜i I IS ∈IS ∀ ∈ IS ideal (PropositionT 4.73) and , it must be the case that (h) , i.e., G ˜i P(d) ˜i , i [n] 6∈ IS ∈IS ∀ ∈ (h) k (h)  P ˜ . But P = x P , which shows that P . (d) ∈IΦ 0 (d) ∈IΦ,n   It remains to be shown that there exists a linear form G non-divisible by x0, that does not vanish on any of the ˜ . Suppose this is not true; thus if G = b>x + αx is a linear Si 0 form non-divisible by x , i.e., b = 0, then G must vanish on some ˜ . In particular, for any 0 6 Si D non-zero vector b of R , b>x = b>x +0x must vanish on some ˜ . Recall from 4.5.2, 0 Si § that if u ,..., u is a basis for , the linear part of = + µ , then i1 idi Si Ai Si i

1 0 0  ···  (4.56) µ u u  i i1 idi   ···    is a basis for ˜ . Since b>x vanishes on ˜ , it must vanish on each basis vector of ˜ . Si Si Si

In particular, b>u = = b>u = 0, which implies that the linear form b>x, now i1 ··· idi D viewed as a function on R , vanishes on i, i.e., b>x i . To summarize, we have S ∈ IS 0 D shown that for every = b R , there exists an i [n] such that b>x i . Taking 6 ∈ ∈ ∈ IS D b equal to the standard vector e1 of R , we see that the linear form x1 must vanish on

some , and similarly for the linear forms x ,...,x . This in turn means that the ideal Si 2 D n m := x1,...,xD generated by the linear forms x1,...,xD, must lie in the union i . h i i=1 IS But it is known from Proposition 1.11(i) in [1], that if an ideal a lies in the union ofS finitely

many prime ideals, then the a must lie in one of these prime ideals. Applying this result

to our case, we see that, since the i are prime ideals, m i for some i [n]. But IS ⊂ IS ∈ this says that for any vector in all of its coordinates must be zero, i.e., = 0, which Si Si

234 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

violates the assumption d > 0,, i [n]. This contradiction proves the existence of our i ∀ ∈ linear form G.

( ) Now suppose that X˜ is in general position in Φ˜. We need to show that X is in ⇐ general position in Ψ. To that end, let p be a vanishing polynomial of Ψ of degree n, then clearly p X . Conversely, let p X of degree n. Then for each point α X ∈I ∈I ∈

0= p(α)= p(α1,...,αD)

(h) (h) = p (1,α1,...,αD)= p (α˜ ), (4.57) i.e., the homogenization p(h) vanishes on X˜ . By hypothesis X˜ is in general position in Φ˜,

(h) (h) hence p ˜ . Then by Proposition 4.41, the dehomogenization of p must vanish on ∈IΦ,n . But notice that (h) , and so vanishes on . Ψ p (d) = p p Ψ  Our second theorem answers Question 4.31.

Theorem 4.45. Let Ψ= n be a union of n affine subspaces of RD, with = +µ i=1 Ai Ai Si i

S D ci n and µ = B a , where B R × is a basis for ⊥ with c = codim . If Φ= is i i i i ∈ Si i Si i=1 Si S transversal and a1,..., an do not lie in the zero-measure set of a proper algebraic variety of Rc1 Rcn , then Φ˜ is transversal. ×···×

Proof. Let b ,..., b be an orthonormal basis for ⊥, then i1 ici Si

b˜ ,..., b˜ , b˜ := [b> b> B a ]>, (4.58) i1 ici iji iji − iji i i

is a basis for ˜⊥. Suppose that Φ˜ is not transversal. Then there exists some index set Si

235 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

J [n], say without loss of generality J = 1,...,` , ` n, such that (see 4.7.3) ⊂ { } ≤ §

Rank(B˜ J) < min D +1, ci , (4.59) ( i J ) X∈ ˜ ˜ ˜ ˜ ˜ ˜ BJ := B1,..., B` , Bi := [bi1,..., bici ], (4.60) h i where we have used the fact that codim ˜ = codim = c , i [n]. Since Φ is transver- Si Si i ∀ ∈ sal, we must have either Rank(BJ)= D or Rank(BJ)= i J ci. Suppose the latter con- ∈ P dition is true, then i J ci D. Then all columns of BJ are linearly independent, which ∈ ≤ P ˜ ˜ implies that the same will be true for the columns of BJ, and so Rank(BJ) = i J ci. ∈ P Since by hypothesis i J ci D, we must have ∈ ≤ P

codim ˜ = Rank(B˜ J) = min D +1, c , (4.61) Si i i J ( i J ) \∈ X∈ and so the transversality condition is satisfied for J, which is a contradiction on the hypoth- esis (4.59). Consequently, it must be the case that Rank(BJ)= D< i J ci. Since BJ is ∈ P a submatrix of B˜ J, we must have that Rank B˜ J D. On the other hand, because of (4.59) ≥ ˜ ˜ ˜ we must have Rank(BJ) D, i.e., Rank(BJ) = D. Now BJ is a (D + 1) i J ci ≤ × ∈  matrix, with the smaller dimension being (D + 1). Since its rank is D, it must beP the case that all (D + 1) (D + 1) minors of B˜ J vanish. The vanishing of these minors defines ×

n ci an algebraic variety J of the parametric space R , and Φ˜ is non-transversal if and W i=1 Q only if (a1,..., an) := J [n] J. Since is a finite union of algebraic varieties ∈ W ⊂ W W it must be an algebraic varietyS itself, i.e., defined by a set of polynomial equations in the variables a1,..., an.

236 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

˜ One may wonder if some of the µi can be zero and Φ still be transversal. This depends on the ci as the next example shows.

Example 4.46. Let = Span(b , b )⊥ +µ be an affine line and = Span(b )⊥ +µ A1 11 12 1 A2 2 2 3 an affine plane of R . Suppose that Φ = Span(b , b )⊥ Span(b )⊥ is transversal. Then 11 12 ∪ 2 Φ=˜ ˜ ˜ is transversal if and only if the matrix S1 ∪ S2

b> µ b> µ b>µ ˜ − 11 1 − 12 1 − 2 2 4 3 B[3] =   R × (4.62) ∈  b11 b12 b2      ˜ has rank 3. But Rank B[3] = 3, irrespectively of what the µi are, simply because the   matrix B[3] =[b11 b12 b2] is full rank (by the transversality assumption on Φ). Now let us

replace the affine plane with a second affine line = Span(b , b )⊥ + µ . Then Φ˜ A2 A2 21 22 2 is transversal if and only if

b> µ b> µ b> µ b> µ 11 1 12 1 21 2 22 2 4 4 B˜ = − − − −  R × (4.63) [3] ∈  b11 b12 b21 b22      has rank 4, which is impossible if both µ1, µ2 are zero.

As a corollary of Theorems 4.6, 4.44 and 4.45, we get the correctness theorem of ASC for the case of affine subspaces, whose proof is straightforward and is thus omitted.

Theorem 4.47. Let Ψ= n be a union of affine subspaces of RD, with = + µ i=1 Ai Ai Si i

S D ci n and µ = B a , where B R × is a basis for ⊥ with c = codim . Let Φ=˜ ˜ i i i i ∈ Si i Si i=1 Si be the union of n linear subspaces of RD+1 induced by the embedding φ : RD , SRD+1 0 → of (4.37). Let X be a finite subset of Ψ and denote by X˜ Φ˜ the image of X under φ . ⊂ 0

237 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Let p ,...,p be a basis for X˜ , the vector space of homogeneous polynomials of degree 1 s I ,n n that vanish on X˜ . Let x X , and denote x˜ = φ (x). Define ∈ ∩A1 − i>1 Ai 0 S D+1 b˜ := p x R , k =1,...,s, (4.64) k ∇ k|˜1 ∈ and without loss of generality, let b˜1,..., b˜` be a maximal linearly independent subset of

D ` D ` b˜ ,..., b˜ . Define further (γ , b ) R R and (γ , B ) R R × as 1 s k k ∈ × 1 1 ∈ × γ ˜ k bk =:   , k =1,...,` (4.65)  bk      γ1 := [γ1,...,γ`]> , B1 := [b1,..., b`] . (4.66)

If X is in general position in Ψ, Φ = n is transversal, and a ,..., a do not lie in i=1 Si 1 n S the zero-measure set of a proper algebraic variety of Rc1 Rcn , then ×···×

1 = Span(B )⊥ B B>B − γ . (4.67) A1 1 − 1 1 1 1  Remark 4.48. The acute reader may notice that we still need to answer the question of whether Ψ admits a finite subset X in general position, to begin with. This answer is affirmative: If Ψ satisfies the hypothesis of Theorem 4.45, then Φ˜ will be transversal, and so by Proposition 4.41 is generated in degree n, in which case the existence of X IΨ ≤ follows from Theorem 2.9 in [72].

4.6 Conclusions

In this Chapter of the thesis we revisited the algebraic algorithm for learning a union of lin- ear subspaces and advanced its state-of-the-art both theoretically and algorithmically. Our

238 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING main theoretical contribution was to introduce the idea of filtrations of subspace arrange- ments in the context of subspace clustering, and use it to develop a new algebraic algorithm called Filtrated Algebraic Subspace Clustering (FASC). The main advantage of FASC over the classic polynomial differentiation algebraic algorithm is that its implementation does not require estimating the rank of a matrix. As a consequence, the numerical adaptation of

FASC, called Filtrated Spectral Algebraic Subspace Clustering (FSASC), not only dramat- ically improves over the performance of earlier algebraic clustering algorithms, rather on several occasions involving both synthetic and real data, it is competitive to state-of-the- art subspace clustering algorithms. As a second theoretical contribution, we demonstrated theoretically the correctness of clustering algebraically affine subspaces by means of ho- mogeneous coordinates.

Overall, this chapter addressed one of the main two issues associated with Algebraic

Subspace Clustering (ASC), i.e., its robustness to noise. Still, the second issue of ASC, i.e., its exponential complexity, remains an open problem, which we hope will be resolved by future research, perhaps by means of ideas such as those developed in Chapter 3 of this thesis.

4.7 Appendix

4.7.1 Notions from commutative algebra

A central concept in the theory of polynomial algebra is that of an ideal:

239 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Definition 4.49 (Ideal). A subset of the ring R[x] := R[x ,...,x ] of polynomials is I 1 D called an ideal if for every p,q and every r R[x] we have that p + q and rp . ∈I ∈ ∈I ∈I

If p1,...,pn are elements of R[x], then the ideal generated by these elements is the set of

all linear combinations of the pi with coefficients in R[x].

A polynomial f R[x] is called homogeneous of degree r, if all the monomials that ∈ appear in f have degree r. An ideal is called homogeneous, if it is generated by homo- I geneous elements, i.e., = f ,...,f where f is a homogeneous polynomial of degree I h 1 si i

ri. The reader can check that an ideal is homogeneous if and only if = k 0 k, where I I ⊕ ≥ I = R[x] . It is not hard to see that the intersection and the sum of two (homogeneous) Ik I∩ k ideals is a (homogeneous) ideal. In performing algebraic operations with ideals it is also

useful to have a notion of product of ideals:

Definition 4.50 (Product of ideals). Let , be ideals of R[x]. The product of , I1 I2 I1I2 I1 I2 is defined to be the set of all elements of the form p q + + p q for any m N,p 1 1 ··· m m ∈ i ∈ ,q . I1 i ∈I2

The notion of a prime ideal is a natural generalization of the notion of a prime number.

Prime ideals play a fundamental role in the study of the structure of general ideals, in

analogy to the role that prime numbers have in the structure of integers.

Definition 4.51 (Prime ideal). An ideal p of R[x] is called prime, if whenever pq p for ∈ some p,q R[x], then either p p or q p. ∈ ∈ ∈

We note that if p is a homogeneous ideal, then in order to check whether p is prime, it

240 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

is enough to consider f,g homogeneous polynomials in the above definition.

Proposition 4.52. Let p, ,..., be ideals of R[x] with p being prime. If p , I1 In ⊃I1∩···∩In then p for some i [n]. ⊃Ii ∈

Proof. Suppose p for all i. Then for every i there exists x I p. But then 6⊃ Ii i ∈ i − s x s p and since p is prime, some x p, contradiction. i=1 i ∈∩i=1Ii ⊂ j ∈ Q A final notion that we need is that of a radical ideal:

Definition 4.53. An ideal of R[x] is called radical, if whenever some p R[x] satisfies I ∈ p` for some `, then it must be the case that p . ∈I ∈I

Radical ideals have a very nice structure:

Theorem 4.54. Every radical ideal of R[x] can be written uniquely as the finite inter- I section of prime ideals. Conversely, the intersection of a finite number of prime ideals is

always a radical ideal.

For further information on commutative algebra we refer the reader to [1] and [25] or

to the more advanced treatment of [73].

4.7.2 Notions from algebraic geometry

The central object of algebraic geometry is that of an algebraic variety:

241 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Definition 4.55 (Algebraic variety). A subset of RD is called an algebraic variety or Y algebraic set if it is the zero-locus of some ideal a of R[x], i.e.,

= y RD : p(y)=0, p a . (4.68) Y ∈ ∀ ∈  A standard notation is to write = (a) where the operator ( ) denotes zero set. Y Z Z ·

If = (a) is an algebraic variety, then certainly every polynomial of a vanishes on Y Z the entire (by definition). However, there may be more polynomials with that property, Y and they have a special name:

Definition 4.56 (Vanishing ideal). The vanishing ideal of a subset of RD, denoted , is Y IY the set of all polynomials of R[x] that vanish on every point of , i.e., Y

= p R[x]: p(y)=0, y . (4.69) IY { ∈ ∀ ∈ Y}

It can be shown that the algebraic varieties induce a topology on RD:

Definition 4.57 (Zariski topology). The Zariski Topology on RD is the topology generated by defining the closed sets to be all the algebraic varieties.

Applying the definition of an irreducible in the context of the Zariski topology, we obtain:

Definition 4.58 (Irreducible algebraic variety). An algebraic variety is called irreducible Y if it can not be written as the union of two proper subsets of that are closed in the Y of .18 Y 18We note that certain authors (e.g. [48]) reserve the term algebraic variety to refer to an irreducible closed set.

242 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

The following theorem is one of many interesting connections between geometry and

algebra:

Theorem 4.59. An algebraic variety = (a) is irreducible if and only if its vanishing Y Z ideal is prime. IY

Perhaps not surprisingly, irreducible varieties are the fundamental building blocks of

general varieties:

Theorem 4.60 (Irreducible decomposition). Every algebraic variety of RD can be uniquely Y written as = , where are irreducible varieties and there are no inclusions Y Y1 ∪···∪Yn Yi for i = j. The varieties are referred to as the irreducible components of . Yi ⊂Yj 6 Yi Y

Proposition 4.61. If = (a ), = (a ) are algebraic varieties such that a a , Y1 Z 1 Y2 Z 2 1 ⊂ 2 then . Y1 ⊃Y2

Theorem 4.62. It two subsets , of RD satisfy the inclusion , then their Y1 Y2 Y1 ⊃ Y2 vanishing ideals will satisfy the reverse inclusion 1 2 . IY ⊂IY

Proposition 4.63. Let = (a ), = (a ) be varieties of RD. Then = Y1 Z 1 Y2 Z 2 Y1 ∩Y2 (a + a ). Z 1 2

The final theorem that we present characterizes the set of all points that arise as the zero set of the vanishing ideal of an arbitrary subset of RD. Y

Proposition 4.64. Let be a subset of RD and its vanishing ideal. Then ( )= Y cl, Y IY Z IY where Y cl is the topological closure of Y in the Zariski topology.

243 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Finally, it should be noted that most of classic and modern algebraic geometry [48] assume that the underlying algebraic field (in this thesis R) is algebraically closed [59].

An example of an algebraically closed field is the complex numbers C. Consequently, one should be careful when using results such as Hilbert’s Nullstellensatz.

4.7.3 Subspace arrangements and their vanishing ideals

We begin by defining the main mathematical object of interest in this chapter.

Definition 4.65 (Subspace arrangement). A union = n of linear subspaces ,..., A i=1 Si S1 Sn of RD, with D 1,n 1 is called a subspace arrangement.S ≥ ≥

It is often technically convenient to work with subspace arrangements that are as general as possible. One way to capture this notion is by the following definition.

Definition 4.66 (Transversal subspace arrangement [20]). A subspace arrangement = A n D i=1 i R is called transversal, if for any subset I of [n], the codimension of i I i S ⊂ ∈ S isS the minimum between D and the sum of the codimensions of all , i I, i.e., T Si ∈

codim = min D, c , (4.70) Si i i I ! ( i I ) \∈ X∈ where c = codim . i Si

Transversality is a geometric condition on the subspaces ,..., , that requires all S1 Sn possible intersections among the subspaces to be as small as possible, as allowed by the dimensions of the subspaces. To see this, let I be a subset of [n], which without loss of

244 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

generality can be taken to be I = 1, 2,...,` = [`], where ` n. For every i I { } ≤ ∈

let B be a D c matrix, whose columns form a basis for ⊥, where c = codim S := i × i Si i i

D dim i, and let B = [B1 ... B`]. Then the intersection i I i can be described − S ∈ S algebraically as T

x B>x =0. (4.71) ∈ Si ⇔ i I \∈

From (4.71) it is clear that the dimension of i I i is equal to the dimension of the right ∈ S nullspace of B, or equivalently T

codim = Rank(B). (4.72) Si i I ! \∈

Now, B is a D i I ci matrix and so its rank will satisfy × ∈ P  Rank(B) min D, c , (4.73) ≤ i ( i I ) X∈ which in conjunction with (4.72) justifies the geometric interpretation of Definition 4.4.

In fact, if is not transversal, then there exists some subset I [n], for which B is A ⊂ rank-deficient, which shows that certain algebraic relations must be satisfied among the parametrizations B ,..., B of the subspaces ,..., . This is essentially the argument 1 n S1 Sn behind the proof of the next Proposition, which shows that transversality is not a strong condition, rather it will be satisfied almost surely.

Proposition 4.67. Let be a subspace arrangement consisting of n linear subspaces of RD A of dimensions d ,...,d . If is chosen uniformly at random, then will be transversal 1 n A A with probability 1.

245 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Example 4.68. An arrangement = RD such that is non- A S1 ∪S2 ∪S3 ⊂ S1 ⊂ S2 transversal, since codim = codim = c < min D,c + c . Note that when S1 ∩S2 S1 1 { 1 2} choosing , , uniformly at random, the event has probability zero. S1 S2 S3 S1 ⊂S2

Example 4.69. An arrangement of three planes = of R3 that intersect on A H1 ∪H2 ∪H3 a line is non-transversal, because codim =2 < min 3, 1+1+1 . When H1 ∩H2 ∩H3 { } , , are chosen uniformly at random, which is equivalent to choosing their normal H1 H2 H3 vectors b1, b2, b3 uniformly at random, the three planes intersect on a line only if b1, b2, b3 are linearly dependent, which is a probability zero event.

Another notion of subspace arrangements in general position that is closely related to transversal arrangements, is that of linearly general subspaces.

Definition 4.70 (Linearly general subspace arrangement [16]). A subspace arrangement

= n is called linearly general, if for every subset I [n] we have A i=1 Si ⊂ S dim = min D, d , (4.74) Si i i I ! ( i I ) X∈ X∈ where d = dim . i Si

As the reader may suspect, the notion of transversal and linearly general are dual to each other in the following sense.

Proposition 4.71. A subspace arrangement n is transversal if and only if the sub- i=1 Si n S space arrangement ⊥ is linearly general. i=1 Si S

246 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Proof. This follows by noting that with reference to the matrix B constructed below Defi-

nition 4.4, we have

codim = Rank(B) = dim ⊥ , (4.75) Si Si i I ! i I ! \∈ X∈

and that codim = dim ⊥. Si Si

In order to understand some important properties of subspace arrangements, it is nec-

essary to examine the algebraic-geometric properties of a single subspace of RD of di- S mension d. Let b ,..., b be a basis for the orthogonal complement of , where c = D d 1 c S −

and define the polynomials pi(x) = bi>x, i = 1,...,c. Notice that pi(x) is homogeneous

of degree 1 and is thus also referred to as linear form. If a point x belongs to , then S p (x)=0, i. Conversely, if a point x RD satisfies p (x)=0, i, then x . This i ∀ ∈ i ∀ ∈ S shows that = (p ,...,p ), i.e., is an algebraic variety. Notice that the set of linear S Z 1 c S forms that vanish on is a vector space and the polynomials p , i =1,...,c form a basis. S i The Proposition that follows asserts that the vanishing ideal of , i.e., the set of all poly- S nomials that vanish at every point of , is in fact generated by the polynomials p (x), i = S i 1,...,c.

Proposition 4.72 (Vanishing Ideal of a Subspace). Let = Span(b ,..., b )⊥ be a sub- S 1 of RD defined as the orthogonal complement of the space spanned by b ,..., b { 1 c}

over R. Then is generated over R[x] by the linear forms b1>x,..., bc>x. IS

Proof. Let b ,..., b be a basis for the orthogonal complement of and augment it to { 1 c} S D a basis b1,..., bc, h1,..., hD c of R , where h1,..., hD c is a basis for . Now de- { − } − S 247 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

fine a transformation φ : RD RD, which maps the basis b ,..., b , → 1 c D h1,..., hD c to the canonical basis e1,..., eD of R , where ei is the i-th column − { } of the D D identity matrix. Notice that b is mapped to e and as a consequence × i i is mapped to the orthogonal complement of the vectors e ,..., e . Since φ is a vec- S 1 c tor space isomorphism, we do not loose generality if we assume from the beginning that

= Span(e ,..., e )⊥ = Span(e ,..., e ) and the vector space of linear forms that S 1 c c+1 D vanish on is generated by x ,...,x . Notice that x if and only if the first c coordi- S 1 c ∈ S nates of x are zero.

c Now let g . We can write g(x)=¯g(xc+1,...,xD)+ xigi(x). By hypothesis ∈IS i=1 P we have g(0,..., 0, ac+1 . . . , aD)=0 for any real numbers ac+1, . . . , aD, which implies

that g¯(a , . . . , a )=0, a , . . . , a R. This in turn implies that g¯ is the zero poly- c+1 D ∀ c+1 D ∈ 19 c nomial . Hence g(x) = i=1 xigi(x), which shows that g is inside the ideal generated by the linear forms that vanishP on . S

In algebraic-geometric notation, the above proposition can be concisely stated as

(b>x,...,b>x) = b1>x,..., bc>x . (4.76) IZ 1 c h i

Interestingly, the vanishing ideal of a subspace is a prime ideal:

Proposition 4.73. Let be a subspace of RD. Then is irreducible in the Zariski topology S S of RD or equivalently, is a prime ideal of R[x]. IS 19 We can prove by induction on d that if F is an infinite field and g(x1,...,xd) = 0, x1, ,xd F, then g = 0. ∀ ··· ∈

248 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Proof. As in the proof of Proposition 4.72 we can assume that (x1,...,xc) is a basis for

the linear forms of R[x] that vanish on . Then = x1,...,xc and our task is to show S IS h i that is prime. So let f,g be homogeneous polynomials such that fg and suppose IS ∈ IS that f . We will show that g . We can write f = f1 + f2, where f1,f2 are 6∈ IS ∈ IS

polynomials such that f1 and f2 R[xc+1,...,xD]. Similarly g = g1 + g2, with ∈ IS ∈

g1 and g2 R[xc+1,...,xD]. Since by hypothesis f , it must be the case that ∈ IS ∈ 6∈ IS

f2 =0. To show that g , it is enough to show that g2 =0. 6 ∈IS

Towards that end, notice that fg = (fg1 + f1g2) + f2g2, where fg1 + f1g2 . ∈ IS

Since by hypothesis fg , we also have that f2g2 . This means that there exist ∈ IS ∈ IS polynomials h ,...,h R[x ,...,x ], such that f g = x h + + x h . However, 1 c ∈ 1 D 2 2 1 1 ··· c c none of the variables x1,...,xc appear on the left hand side of this equation, and so this equation is true only when both sides are equal to zero. Since by hypothesis f = 0, this 2 6 implies that g2 =0, and so g . ∈IS Alternative Proof : A more direct proof exists if we assume familiarity of the reader with quotient rings. In particular, it is known that an ideal of a commutative ring R I is prime if and only if the quotient ring R/I has no zero-divisors [1]. By noticing that

R[x ,...,x ]/ x ,...,x = R[x ,...,x ] we immediately see that x ,...,x is 1 D h 1 ci ∼ c+1 D h 1 ci prime.

Returning to the subspace arrangements, we see that a subspace arrangement = A is the union of irreducible algebraic varieties ,..., . This immediately S1 ∪···∪Sn S1 Sn suggests that the subspace arrangment itself is an algebraic variety. This was established

249 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

in [72] via an alternative argument. Additionally, in view of Theorem 4.60, the irreducible

components of are precisely its constituent subspaces ,..., , which also proves that A S1 Sn a subspace arrangement can be uniquely written as the union of subspaces among which

there are no inclusions. We summarize these observations in the following theorem:

Theorem 4.74. Let ,..., be subspaces of RD such that no inclusions exist between S1 Sn any two subspaces. Then the arrangement = is an algebraic variety and A S1 ∪···∪Sn its irreducible components are ,..., . S1 Sn

The vanishing ideal of a subspace arrangement = n is readily seen to relate to A i=1 Si the vanishing ideals of its irreducible components via theS formula

= 1 n . (4.77) IA IS ∩···∩IS

Since i is a prime ideal, Theorem 4.54 implies that is radical and that uniquely IS IA A determines the ideals 1 ,..., n , assuming that there are no inclusions between the sub- IS IS spaces. Hence, retrieving the irreducible components of a subspace arrangement is equiva- lent to that of computing the prime factors of its vanishing ideal . IA Since the ideal of a single subspace is generated by linear forms, i.e., it is generated S1 in degree 1, one may be tempted to conjecture that the ideal of a union of n subspaces IA is generated in degree less or equal than n. In fact, this is true:

Proposition 4.75. Let be an arrangement of n linear subspaces of RD. Then its vanish- A ing ideal is generated in degree n. IA ≤

250 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

Proof. By [21] the Castelnuovo-Mumford regularity20 of is bounded above by n. But IA by construction, the CM-regularity of an ideal bounds from above the maximal degree of a

generator of the ideal.

A crucial property of a subspace arrangement in relation to the theory of Algebraic A Subspace Clustering is that for any non-zero vanishing polynomial p on , the orthogonal A complement of the space spanned by the gradient of p at some point x contains the ∈ A subspace to which x belongs.

n D Proposition 4.76. Let = i be a subspace arrangement of R , p and x , A i=1 S ∈IA ∈A S say x for some i [n]. Then p x . ∈Si ∈ ∇ | ⊥Si

Proof. Take p . From = 1 n we have that i . Hence ∈ IA IA IS ∩···∩IS IA ⊂ IS p i . Now, from Proposition 4.72 we know that i is generated by a basis among ∈ IS IS all linear forms that vanish on i, i.e., by a basis of i,1. If (bi1,..., bici ) is an R- S IS basis for i⊥ then (bi>1x,..., bic> x) is an R-basis for i,1 and a set of generators for i S i IS IS

ci over R[x]. Hence we can write p(x) = (b>x)g (x) where g (x) R[x]. Tak- j=1 ij j j ∈ ing the gradient of both sides of the aboveP equation we get p = ci g (x)b + ∇ j=1 j ij

ci P (b>x) g . Now let x be any point of . Evaluating both sides at x we j=1 ij ∇ j ∈ Si Si

P ci ci have p x = g (x)b + (b>x) g x. By hypothesis we have b>x = 0, j ∇ | j=1 j ij j=1 ij ∇ j| ij ∀

P ci P and so we obtain p x = g (x)b ⊥. ∇ | j=1 j ij ∈Si P One may wonder when it is the case that the gradient of a vanishing polynomial on a

subspace arrangement is zero at every point of . This is answered by A A 20Please see [25], [16], [21] or [20] for the definition of Castelnuovo-Mumford regularity.

251 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

n D Proposition 4.77. Let = i be a subspace arrangement of R and let p . A i=1 S ∈ IA S n 2 Then p x =0, x if and only if p i=1 i . ∇ | ∀ ∈A ∈ IS T

Proof. ( ) Suppose that p , such that p x =0, x . Since i , i [n], ⇒ ∈IA ∇ | ∀ ∈A IA ⊂IS ∀ ∈ by Proposition 4.72, p(x) can be written as

ci

p(x)= gi,j(x)(bi,j> x), (4.78) j=1 X where c is the codimension of , (b ,..., b ) is a basis for ⊥ and g (x) are polyno- i Si i,1 i,ci Si i,j

mials. Now the hypothesis p x = 0, x implies that ∂p/∂x x = 0, x , k ∇ | ∀ ∈ A k| ∀ ∈ A ∀ ∈

[D]. Thus ∂p/∂xk and so ∂p/∂xk i . Hence, again by Proposition 4.72, ∂p/∂xk ∈IA ∈IS can be written as

ci

∂p/∂xk = hi,j,k(x)(bi,j> x). (4.79) j=1 X

Differentiating equation (4.78) with respect to xk gives

ci ci

∂p/∂xk = (∂gi,j/∂xk)(bi,j> x)+ gi,j(x)bi,j(k). (4.80) j=1 j=1 X X From equations (4.79), (4.80) we obtain

ci ci g (x)b (k)= (h (x) ∂g /∂x )(b> x) (4.81) i,j i,j i,j,k − i,j k i,j j=1 j=1 X X which can equivalently be written as

ci ci

bi,j(k)gi,j(x)= qi,j,k(x)(bi,j> x) (4.82) j=1 j=1 X X

252 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

where q (x) := h (x) ∂g /∂x . Note that equation (4.82) is true for every k [D]. i,j,k i,j,k − i,j k ∈ We can write these D equations in matrix form

gi,1(x) bi,>1x     b>  gi,2(x)   i,2x  b b b   = Q(x)   , (4.83) i,1 i,2 i,ci  .   .   ···   .   .   .   .           g (x)   b> x   i,ci   i,ci          where Q(x) is a D c polynomial matrix with entries in R[x]. We can view equation (4.83) × i as a linear system of equations over the field R(x). Define Bi := bi,1 bi,2 bi,ci .  ···  The columns of B form a basis of ⊥, and so they will be linearly independent over R. i Si

Consequently, the Bi>Bi will be invertible over R and its inverse will also be the inverse of Bi>Bi over the larger field R(x). Multiplying both sides of equation (4.83)

1 from the left with (Bi>Bi)− Bi>, we obtain

gi,1(x) bi,>1x    

gi,2(x) b> x   1  i,2    =(B>Bi)− B>Q(x)   . (4.84)  .  i i  .   .   .   .   .           g (x)   b> x   i,ci   i,ci          1 ci ci Note that (B>B )− B>Q(x) (R[x]) × and so equation (4.84) gives that g (x) i i i ∈ i,j ∈ 2 i , j [ci]. Returning back to equation (4.78), we readily see that p i , i [n], IS ∀ ∈ ∈ IS ∀ ∈ n 2 which implies that p i=1 i . ∈∩ IS n 2 n 2 n ( ) Suppose that p i=1 i . Since i=1 i i=1 i = , we see that p must ⇐ ∈ ∩ IS ∩ IS ⊂ ∩ IS IA 2 be a vanishing polynomial. Since p i , by Proposition 4.72 we can write p(x) = ∈ IS 253 CHAPTER 4. ADVANCES IN ALGEBRAIC SUBSPACE CLUSTERING

ci 0 g 0 (x)(b> x)(b> 0 x) from which it follows that p x = 0, x . Since this j,j =1 j,j i,j i,j ∇ | i ∀ i ∈ Si P holds for any i [n], we get that p x =0, x . ∈ ∇ | ∀ ∈A

We conclude with a theorem lying at the heart of Algebraic Subspace Clustering.

Theorem 4.78. Let = n be a transversal subspace arrangement of RD with van- A i=1 Si S ishing ideal . Let be the product ideal = 1 n . Then the two ideals are IA JA JA IS ···IS equal at degrees ` n, i.e., ,` = ,`, ` n. ≥ IA JA ∀ ≥

Theorem 4.78 implies that every polynomial of degree n that vanishes on a transversal subspace arrangement of n subspaces is a linear combination of products of linear forms A vanishing on , a fundamental fact that is used repeatedly in the main text of this chapter. A Theorem 4.78 was first proved in Proposition 3.4 of [16], in the context of the Castelnuovo-

Mumford regularity of products of ideals generated by linear forms. It was later reproved in

[20] using a Hilbert series argument and the result from [21] on the Castelnuovo-Mumford regularity of a subspace arrangement.

254 Chapter 5

Conclusions

This thesis attempted to advance the state-of-the-art of subspace learning methods, in the

regime where the subspaces have high relative-dimension, i.e., when their dimension is

comparable to that of the ambient space. This is a regime where most state-of-the-art

subspace learning methods are theoretically inapplicable.

Towards that end, the first part of this thesis introduced a new subspace learning method,

called Dual Principal Component Pursuit (DPCP), which is based on non-convex opti- mization, and is naturally suited for subspaces of high relative-dimension. The thesis gave an extensive theoretical analysis of DPCP for the cases of single subspace learning in the presence of outliers, as well as the case of learning an arrangement of hyperplanes. Exten- sive experiments demonstrated that on real data DPCP is competitive with state-of-the-art methods, while on synthetic data DPCP is significantly superior.

255 CHAPTER 5. CONCLUSIONS

The second part of the thesis improved the already existing method of Algebraic Sub- space Clustering (ASC), one of the theoretically most suitable methods for learning sub- spaces of high relative-dimension. In particular, the thesis introduced Filtrated Algebraic

Subspace Clustering (FASC) and its numerical implementation Filtrated Spectral Algebraic

Subspace Clustering (FSASC), which resolved the lack of robustness to noise of traditional

ASC. The thesis also gave a theoretical justification for the correctness of algebraically clustering affine subspaces through homogeneous coordinates.

There are many open directions for future research. These include solving the com- putational complexity bottleneck of FSASC (inherited by that of ASC), and it is an inter- esting question to what extent the ideas surrounding DPCP can prove helpful. Regarding

DPCP itself, future research will deal with probabilistic analysis, proving convergence of

Iteratively-Reweighted-Least-Squares algorithm implementing DPCP, designing scalable

DPCP algorithms suitable for high-dimensional data, and exploring applications, particu- larly in connection with deep networks.

256 Bibliography

[1] M.E. Atiyah and I.G. MacDonald. Introduction to Commutative Algebra. Westview

Press, 1994.

[2] L. Bako. Identification of switched linear systems via sparse optimization. Automat-

ica, 47(4):668–677, 2011.

[3] L. Balzano, R. Nowak, and B. Recht. Online identification and tracking of subspaces

from highly incomplete information. In Communication, Control, and Computing

(Allerton), 2010 48th Annual Allerton Conference on, pages 704–711. IEEE, 2010.

[4] R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces. IEEE

transactions on pattern analysis and machine intelligence, 25(2):218–233, 2003.

[5] J. Beck. Sums of distances between points on a sphere—an application of the theory

of irregularities of distribution to discrete geometry. Mathematika, 31(01):33–41,

1984.

257 BIBLIOGRAPHY

[6] T.E. Boult and L.G. Brown. Factorization-based segmentation of motions. In IEEE

Workshop on Motion Understanding, pages 179–186, 1991.

[7] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and

statistical learning via the alternating direction method of multipliers. Foundations

and Trends in Machine Learning, 3(1):1–122, 2010.

[8] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via

graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence,

23(11):1222–1239, 2001.

[9] P. S. Bradley and O. L. Mangasarian. k-plane clustering. Journal of Global Opti-

mization, 16(1):23–32, 2000.

[10] J. S. Brauchart and P. J. Grabner. Distributing many points on spheres: minimal

energy and designs. Journal of Complexity, 31(3):293–326, 2015.

[11] J.P. Brooks, J.H. Dula,´ and E.L. Boone. A pure l1-norm principal component anal-

ysis. Computational statistics & data analysis, 61:83–98, 2013.

[12] E. Candes,` X. Li, Y. Ma, and J. Wright. Robust principal component analysis?

Journal of the ACM, 58(3), 2011.

[13] E. Candes,` M. Wakin, and S. Boyd. Enhancing sparsity by reweighted `1 minimiza-

tion. Journal of Fourier Analysis and Applications, 14(5):877–905, 2008.

258 BIBLIOGRAPHY

[14] R. Chartrand and W. Yin. Iteratively reweighted algorithms for compressive sensing.

In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing,

pages 3869–3872. IEEE, 2008.

[15] G. Chen and G. Lerman. Spectral curvature clustering (SCC). International Journal

of Computer Vision, 81(3):317–330, 2009.

[16] A. Conca and J. Herzog. Castelnuovo-mumford regularity of products of ideals.

Collectanea Mathematica, 54(2):137–152, 2003.

[17] J. Costeira and T. Kanade. A multibody factorization method for independently

moving objects. International Journal of Computer Vision, 29(3):159–179, 1998.

[18] D.A. Cox, J. Little, and D. O’Shea. Ideals, Varieties, and Algorithms. Springer,

2007.

[19] I. Daubechies, R. DeVore, M. Fornasier, and C. S. Gunt¨ urk.¨ Iteratively reweighted

least squares minimization for sparse recovery. Communications on Pure and Ap-

plied Mathematics, 63(1):1–38, 2010.

[20] H. Derksen. Hilbert series of subspace arrangements. Journal of Pure and Applied

Algebra, 209(1):91–98, 2007.

[21] H. Derksen and J. Sidman. A sharp bound for the castelnuovo-mumford regularity

of subspace arrangements. Advances in Mathematics, 172:151–157, 2002.

259 BIBLIOGRAPHY

[22] J. Dick. Applications of geometric discrepancy in numerical analysis and statistics.

Applied Algebra and Number Theory, 2014.

[23] Chris Ding, Ding Zhou, Xiaofeng He, and Hongyuan Zha. R1-pca: rotational in-

variant l1-norm principal component analysis for robust subspace factorization. In

Proceedings of the 23rd international conference on Machine learning, pages 281–

288. ACM, 2006.

[24] B. Draper, M. Kirby, J. Marks, T. Marrinan, and C. Peterson. A flag representation

for finite collections of subspaces of mixed dimensions. and its

Applications, 451:15–32, 2014.

[25] D. Eisenbud. Commutative Algebra with a View Toward Algebraic Geometry.

Springer, 2004.

[26] L. Elden. Solving quadratically constrained least squares problems using a

differential-geometric approach. BIT Numerical Mathematics, 42(2):323–335, 2002.

[27] E. Elhamifar and R. Vidal. Sparse subspace clustering. In IEEE Conference on

Computer Vision and Pattern Recognition, pages 2790–2797, 2009.

[28] E. Elhamifar and R. Vidal. Clustering disjoint subspaces via sparse representation. In

IEEE International Conference on Acoustics, Speech, and Signal Processing, 2010.

[29] E. Elhamifar and R. Vidal. Robust classification using structured sparse representa-

tion. In IEEE Conference on Computer Vision and Pattern Recognition, 2011.

260 BIBLIOGRAPHY

[30] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and

applications. IEEE Transactions on Pattern Analysis and Machine Intelligence,

35(11):2765–2781, 2013.

[31] P. Favaro, R. Vidal, and A. Ravichandran. A closed form solution to robust subspace

estimation and clustering. In IEEE Conference on Computer Vision and Pattern

Recognition, pages 1801 –1807, 2011.

[32] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few

training examples: An incremental bayesian approach tested on 101 object cate-

gories. Comput. Vis. Image Underst., 106(1):59–70, April 2007.

[33] J. Feng, H. Xu, and S. Yan. Online robust pca via stochastic optimization. In Ad-

vances in Neural Information Processing Systems, pages 404–412, 2013.

[34] M. A. Fischler and R. C. Bolles. RANSAC random sample consensus: A paradigm

for model fitting with applications to image analysis and automated cartography.

Communications of the ACM, 26:381–395, 1981.

[35] W. Gander. Least squares with a quadratic constraint. Numerische Mathematik,

36(3):291–307, 1980.

[36] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman. From few to many: Illu-

mination cone models for face recognition under variable lighting and pose. IEEE

Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.

261 BIBLIOGRAPHY

[37] K. Ghalieh and M. Hajja. The fermat point of a spherical triangle. The Mathematical

Gazette, 80(489):561–564, 1996.

[38] G. H Golub and U. Von Matt. Quadratically constrained least squares and quadratic

problems. Numerische Mathematik, 59(1):561–580, 1991.

[39] G.H. Golub and c. F. Van Loan. Matrix computations, volume 3. Johns Hopkins

Univ Pr, 1996.

[40] P. J. Grabner, B. Klinger, and R.F. Tichy. Discrepancies of point sequences on the

sphere and numerical integration. Mathematical Research, 101:95–112, 1997.

[41] P. J. Grabner and R.F. Tichy. Spherical designs, discrepancy and numerical integra-

tion. Math. Comp., 60(201):327–336, 1993.

[42] A. Gruber and Y. Weiss. Multibody factorization with uncertainty and missing data

using the EM algorithm. In IEEE Conference on Computer Vision and Pattern

Recognition, volume I, pages 707–714, 2004.

[43] Inc. Gurobi Optimization. Gurobi optimizer reference manual, 2015.

[44] G. Harman. Variations on the koksma-hlawka inequality. Uniform Distribution

Theory, 5(1):65–78, 2010.

[45] J. Harris. Algebraic Geometry: A First Course. Springer-Verlag, 1992.

[46] R. Hartley and R. Vidal. The multibody trifocal tensor: Motion segmentation from

262 BIBLIOGRAPHY

3 perspective views. In IEEE Conference on Computer Vision and Pattern Recogni-

tion, volume I, pages 769–775, 2004.

[47] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cam-

bridge, 2nd edition, 2004.

[48] R. Hartshorne. Algebraic Geometry. Springer, 1977.

[49] E. Hlawka. Discrepancy and riemann integration. Studies in Pure Mathematics,

pages 121–129, 1971.

[50] H. Hotelling. Analysis of a complex of statistical variables into principal compo-

nents. Journal of Educational Psychology, 24:417–441, 1933.

[51] K. Huang, Y. Ma, and R. Vidal. Minimum effective dimension for mixtures of

subspaces: A robust GPCA, algorithm and its applications. In IEEE Conference on

Computer Vision and Pattern Recognition, volume II, pages 631–638, 2004.

[52] P. Huber. Robust Statistics. John Wiley & Sons, New York, 1981.

[53] S. Javed, S. H. Oh, T. Bouwmans, and S. K. Jung. Handbook of robust low-rank and

sparse : Applications in image and video processing. Chap-

man and Hall/CRC, pages 457–480, 2016.

[54] Bangti Jin, Dirk Lorenz, and Stefan Schiffler. Elastic-net regulariztion: error esti-

mates and active set methods. Inverse Problems, 25(11), 2009.

263 BIBLIOGRAPHY

[55] I. Jolliffe. Principal Component Analysis. Springer-Verlag, 2nd edition, 2002.

[56] K. Kanatani. Motion segmentation by subspace separation and model selection.

In IEEE International Conference on Computer Vision, volume 2, pages 586–591,

2001.

[57] W. Ku, R. H. Storer, and C. Georgakis. Disturbance detection and isolation by

dynamic principal component analysis. Chemometrics and Intelligent Laboratory

Systems, 30:179–196, 1995.

[58] L. Kuipers and H. Niederreiter. Uniform distribution of sequences. Courier Corpo-

ration, 2012.

[59] S. Lang. Algebra. Springer, 2005.

[60] G. Lerman and T. Zhang. Robust recovery of multiple subspaces by geometric `p

minimization. Annals of Statistics, 39(5):2686–2715, 2011.

[61] G. Lerman and T. Zhang. `p-recovery of the most significant subspace among mul-

tiple subspaces with outliers. Constructive Approximation, 40(3):329–385, 2014.

[62] Gilad Lerman, Michael B McCoy, Joel A Tropp, and Teng Zhang. Robust compu-

tation of linear models by convex relaxation. Foundations of Computational Mathe-

matics, 15(2):363–410, 2015.

[63] Q. Liang and Q. Ye. Computing singular values of large matrices with an inverse-

264 BIBLIOGRAPHY

free preconditioned method. Electronic Transactions on Numerical

Analysis, 42:197–221, 2014.

[64] G. Liu, Z. Lin, S. Yan, J. Sun, and Y. Ma. Robust recovery of subspace structures

by low-rank representation. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 35(1):171–184, Jan 2013.

[65] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Yi Ma. Robust recovery of subspace

structures by low-rank representation. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 2012.

[66] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation.

In International Conference on Machine Learning, pages 663–670, 2010.

[67] R. Livni, D. Lehavi, S. Schein, H. Nachliely, S. Shalev-shwartz, and Amir Glober-

son. Vanishing component analysis. In International Conference on Machine Learn-

ing, pages 597–605, 2013.

[68] S. Loyd, M. Mohseni, and P. Rebentrost. Quantum principal component analysis.

Nature Physics, (10):631–633, 2014.

[69] C-Y. Lu, H. Min, Z-Q. Zhao, L. Zhu, D-S. Huang, and S. Yan. Robust and efficient

subspace segmentation via least squares regression. In European Conference on

Computer Vision, 2012.

[70] Y. Ma, H. Derksen, W. Hong, and J. Wright. Segmentation of multivariate mixed

265 BIBLIOGRAPHY

data via lossy coding and compression. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 29(9):1546–1562, 2007.

[71] Y. Ma and R. Vidal. Identification of deterministic switched ARX systems via iden-

tification of algebraic varieties. In Hybrid Systems: Computation and Control, pages

449–465. Springer Verlag, 2005.

[72] Y. Ma, A. Y. Yang, H. Derksen, and R. Fossum. Estimation of subspace arrange-

ments with applications in modeling and segmenting mixed data. SIAM Review,

50(3):413–458, 2008.

[73] H. Matsumura. Commutative Ring Theory. Cambridge studies in advanced mathe-

matics, 2006.

[74] B. C. Moore. Principal component analysis in linear systems: Controllability,

observability, and model reduction. IEEE Transactions on Automatic Control,

26(1):17–32, 1981.

[75] S. Nam, M.E. Davies, M. Elad, and R. Gribonval. The cosparse analysis model and

algorithms. Applied and Computational Harmonic Analysis, 34(1):30–56, 2013.

[76] B. Nasihatkon and R. Hartley. Graph connectivity in sparse subspace clustering. In

IEEE Conference on Computer Vision and Pattern Recognition, 2011.

[77] Y. Panagakis and C. Kotropoulos. Elastic net subspace clustering applied to pop/rock

music structure analysis. Pattern Recognition Letters, 38:46–53, 2014.

266 BIBLIOGRAPHY

[78] K. Pearson. On lines and planes of closest fit to systems of points in space. The Lon-

don, Edinburgh and Dublin Philosphical Magazine and Journal of Science, 2:559–

572, 1901.

[79] A. Price, N. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich.

Principal components analysis corrects for stratification in genome-wide association

studies. Nature Genetics, (38):904–909, 2006.

[80] Q. Qu, J. Sun, and J. Wright. Finding a sparse vector in a subspace: Linear sparsity

using alternating directions. In Advances in Neural Information Processing Systems,

pages 3401–3409, 2014.

[81] A. Sampath and J. Shan. Segmentation and reconstruction of polyhedral building

roofs from aerial lidar point clouds. IEEE Transactions on Geoscience and Remote

Sensing, 48(3):1554–1567, 2010.

[82] H. Schwetlick and U. Schnabel. Iterative computation of the smallest singular value

and the corresponding singular vectors of a matrix. Linear Algebra and its Applica-

tions, 371:1–30, 2003.

[83] N. Silberman, P. Kohli, D. Hoiem, and R. Fergus. Indoor segmentation and support

inference from rgbd images. In European Conference on Computer Vision, 2012.

[84] M. Soltanolkotabi and E. J. Candes.` A geometric analysis of subspace clustering

with outliers. Annals of Statistics, 40(4):2195–2238, 2012.

267 BIBLIOGRAPHY

[85] H. Spath¨ and G.A. Watson. On orthogonal linear `1 approximation. Numerische

Mathematik, 51(5):531–543, 1987.

[86] D.A. Spielman, H. Wang, and J. Wright. Exact recovery of sparsely-used dictio-

naries. In Proceedings of the 23d international joint conference on Artificial Intelli-

gence, pages 3087–3090. AAAI Press, 2013.

[87] M Steinbach, G. Karypis, V. Kumar, et al. A comparison of document clustering

techniques. KDD workshop on text mining, 400(1):525–526, 2000.

[88] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere. In

Sampling Theory and Applications (SampTA), 2015 International Conference on,

pages 407–410. IEEE, 2015.

[89] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere i:

Overview and the geometric picture. arXiv preprint arXiv:1511.03607, 2015.

[90] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere ii: Re-

covery by riemannian trust-region method. arXiv preprint arXiv:1511.04777, 2015.

[91] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery using nonconvex opti-

mization. In Proceedings of the 32nd International Conference on Machine Learning

(ICML-15), pages 2351–2360, 2015.

[92] C. Sutton and A. McCallum. An introduction to conditional random fields for rela-

268 BIBLIOGRAPHY

tional learning, volume 2. Introduction to statistical relational learning. MIT Press,

2006.

[93] M. Tipping and C. Bishop. Mixtures of probabilistic principal component analyzers.

Neural Computation, 11(2):443–482, 1999.

[94] R. Tron and R. Vidal. A benchmark for the comparison of 3-D motion segmentation

algorithms. In IEEE Conference on Computer Vision and Pattern Recognition, pages

1–8, 2007.

[95] M. C. Tsakiris and R. Vidal. Abstract algebraic-geometric subspace clustering. In

Asilomar Conference on Signals, Systems and Computers, 2014.

[96] M. C. Tsakiris and R. Vidal. Dual principal component pursuit. In ICCV Workshop

on Robust Subspace Learning and Computer Vision, pages 10–18, 2015.

[97] M. C. Tsakiris and R. Vidal. Filtrated spectral algebraic subspace clustering. In

ICCV Workshop on Robust Subspace Learning and Computer Vision, pages 28–36,

2015.

[98] M. C. Tsakiris and R. Vidal. Dual principal component pursuit. (preprint), 2016.

[99] M. C. Tsakiris and R. Vidal. Hyperplane clustering via dual principal component

pursuit. (preprint), 2016.

[100] M. C. Tsakiris and R. Vidal. Algebraic clustering of affine subspaces. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 2017 (to appear).

269 BIBLIOGRAPHY

[101] M. C. Tsakiris and R. Vidal. Filtrated algebraic subspace clustering. SIAM Journal

on Imaging Sciences, 2017 (to appear).

[102] P. Tseng. Nearest q-flat to m points. Journal of Optimization Theory and Applica-

tions, 105(1):249–252, 2000.

[103] D. Ucar, Q. Hu, and K. Tan. Combinatorial chromatin modification patterns in

the human genome revealed by subspace clustering. Nucleic acids research, page

gkr016, 2011.

[104] R. Vidal. Generalized Principal Component Analysis (GPCA): an Algebraic Ge-

ometric Approach to Subspace Clustering and Motion Segmentation. PhD thesis,

University of California, Berkeley, August 2003.

[105] R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(3):52–68,

March 2011.

[106] R. Vidal and P. Favaro. Low rank subspace clustering (LRSC). Pattern Recognition

Letters, 43:47–61, 2014.

[107] R. Vidal and R. Hartley. Motion segmentation with missing data by PowerFactor-

ization and Generalized PCA. In IEEE Conference on Computer Vision and Pattern

Recognition, volume II, pages 310–316, 2004.

[108] R. Vidal and R. Hartley. Three-view multibody structure from motion. IEEE Trans-

270 BIBLIOGRAPHY

actions on Pattern Analysis and Machine Intelligence, 30(2):214–227, February

2008.

[109] R. Vidal, Y. Ma, and J. Piazzi. A new GPCA algorithm for clustering subspaces by

fitting, differentiating and dividing polynomials. In IEEE Conference on Computer

Vision and Pattern Recognition, volume I, pages 510–517, 2004.

[110] R. Vidal, Y. Ma, and S. Sastry. Generalized Principal Component Analysis (GPCA).

In IEEE Conference on Computer Vision and Pattern Recognition, volume I, pages

621–628, 2003.

[111] R. Vidal, Y. Ma, and S. Sastry. Generalized Principal Component Analysis (GPCA).

IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1–15,

2005.

[112] R. Vidal, Y. Ma, and S. Sastry. Generalized Principal Component Analysis. Springer

Verlag, 2016.

[113] R. Vidal, Y. Ma, S. Soatto, and S. Sastry. Two-view multibody structure from mo-

tion. International Journal of Computer Vision, 68(1):7–25, 2006.

[114] R. Vidal, R. Tron, and R. Hartley. Multiframe motion segmentation with missing

data using PowerFactorization, and GPCA. International Journal of Computer Vi-

sion, 79(1):85–105, 2008.

271 BIBLIOGRAPHY

[115] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17,

2007.

[116] S. Vyas and L. Kumaranayake. Constructing socio-economic status indices: how to

use principal components analysis. Health Policy and Planning, 21:459–468, 2006.

[117] L. Wu and A. Stathopoulos. A preconditioned hybrid svd method for accurately

computing singular triplets of large matrices. SIAM Journal on Scientific Computing,

37(5):S365–S388, 2015.

[118] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pur-

suit. In Advances in Neural Information Processing Systems, pages 2496–2504,

2010.

[119] J. Yan and M. Pollefeys. A general framework for motion segmentation: Inde-

pendent, articulated, rigid, non-rigid, degenerate and non-degenerate. In European

Conference on Computer Vision, pages 94–106, 2006.

[120] C. You, C.-G. Li, D. Robinson, and R. Vidal. Oracle based active set algorithm for

scalable elastic net subspace clustering. In IEEE Conference on Computer Vision

and Pattern Recognition, 2016.

[121] T. Zhang, A. Szlam, and G. Lerman. Median k-flats for hybrid linear modeling with

many outliers. In Workshop on Subspace Methods, pages 234–241, 2009.

272 BIBLIOGRAPHY

[122] T. Zhang, A. Szlam, Y. Wang, and G. Lerman. Hybrid linear modeling via local

best-fit flats. International Journal of Computer Vision, 100(3):217–240, 2012.

273 Vita

Manolis C. Tsakiris received his undergraduate degree in Electrical and Computer Engi-

neering from the National Technical University of Athens in 2007. He received his M.S.

degree in Communications and Signal Processing from Imperial College London in 2009,

where he was ranked 1st in the program and was awarded the best MS thesis award. In

2009-2010 he was a research student at Sao Paulo University, and he was awarded a sec- ond MS degree. His scientific interests lie in the intersection of machine learning and signal processing with algebra, geometry and optimization. In 2014 he won the best student paper award in Asilomar Conference on Signals, Systems and Computers for his work Abstract algebraic-geometric subspace clustering.

274