FOUNDATIONS TOWARDS an INTEGRATED THEORY of INTELLIGENCE by MOHAMMAD TARIFI a DISSERTATION PRESENTED to the GRADUATE SCHOOL of T

FOUNDATIONS TOWARDS AN INTEGRATED THEORY OF INTELLIGENCE

MOHAMMAD TARIFI

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA 2012 c 2012 Mohammad Tariﬁ

2 This work is dedicated to future generations.

3 ACKNOWLEDGMENTS First, I would like to thank my advisor, Dr Meera Sitharam. She is a source of constant growth and inspiration. I would like to thank all my friends who supported me through the journey. Bassam Aoun has been a consistent support. Finally, I would like thank my family: my mother Dallal, my father Hassan, and my sister Reem. They mean the world to me.

4 TABLE OF CONTENTS page

ACKNOWLEDGMENTS ...... 4

LIST OF TABLES ...... 8

LIST OF FIGURES ...... 9 ABSTRACT ...... 10

CHAPTER

1 INTRODUCTION ...... 12

1.1 Background Models ...... 12 1.1.1 Hierarchical Models ...... 13 1.2 Organization ...... 14

2 DRDL ELEMENT ...... 16 2.1 Technical Background ...... 16 2.1.1 Historical Context ...... 16 2.1.2 Sparse Approximation ...... 18 2.1.3 Illustrating The Model With Simple Examples ...... 18 2.1.4 Pursuit Algorithms ...... 20 2.1.5 Dimension Reduction ...... 22 2.1.6 Dictionary Learning ...... 24 2.2 Our Contribution ...... 26 2.2.1 DRDL Circuit Element ...... 26 2.2.1.1 Relation between DR and DL ...... 26 2.2.1.2 Discussion of trade-offs in DRDL ...... 27 2.2.2 Our Hierarchical Sparse Representation ...... 27 2.2.2.1 Assumptions of the generative model ...... 28 2.2.2.2 Learning algorithm ...... 29 2.2.2.3 Representation inference ...... 29 2.2.2.4 Mapping between HSR and current models ...... 30 2.2.2.5 What does HSR offer beyond current models? ...... 31 2.2.2.6 Discussion of trade-offs in HSR ...... 32 2.2.3 Incorporating Additional Model Prior ...... 32 2.2.4 Experiments ...... 34 2.2.4.1 MNIST Results ...... 34 2.2.4.2 COIL Results ...... 35 2.3 Discussion ...... 36

5 3 GEOMETRIC DICTIONARY LEARNING ...... 37 3.1 Preliminaries ...... 37 3.1.1 Statistical Approaches to Dictionary Learning ...... 37 3.1.2 Contribution and Organization ...... 38 3.2 The Setup ...... 38 3.2.1 Assumptions of The Generative Model ...... 38 3.2.2 Problem Deﬁnitions ...... 39 3.2.3 Problem Relationships ...... 41 3.3 Cluster and Intersect Algorithm ...... 43 3.3.1 Learning Subspace Arrangements ...... 43 3.3.1.1 Random sample consensus ...... 44 3.3.1.2 Generalized principle component analysis ...... 45 3.3.1.3 Combinatorics of subspace clustering ...... 45 3.3.2 s-Independent Smallest Spanning Set ...... 46 3.4 Learning an Orthogonal Basis ...... 47 3.5 Summary ...... 48

4 SAMPLING COMPLEXITY FOR GEOMETRIC DICTIONARY LEARNING ... 49

4.1 Algebraic Representation ...... 51 4.2 Combinatorial Rigidity ...... 53 4.3 Required Graph Properties ...... 56 4.4 Rigidity Theorem in d = 3, s = 2 ...... 57 4.4.1 Consequences ...... 64 4.5 Summary ...... 65

5 FUTURE WORK AND CONCLUSIONS ...... 66

5.1 Minor Results ...... 66 5.1.1 Representation Scheme on The Cube, Moutlon Mapping ...... 66 5.1.2 Cautiously Greedy Pursuit ...... 67 5.2 Future Work From Chapter 2 ...... 67 5.2.1 DRDL Trade-offs ...... 67 5.2.2 The Hierarchy Question ...... 68 5.3 Future Work From Chapter 3 ...... 69 5.3.1 Cluster and Intersect Extensions ...... 69 5.3.2 Temporal Coherence ...... 70 5.4 Future Work From Chapter 4 ...... 71 5.4.1 Uniqueness and Number of Solutions ...... 71 5.4.2 Higher Dimensions ...... 72 5.4.3 Genericty ...... 73 5.4.4 Computing The Realization ...... 73 5.5 Future Extension For The Model ...... 73 5.6 Conclusion ...... 74

REFERENCES ...... 75

6 BIOGRAPHICAL SKETCH ...... 88

7 LIST OF TABLES Table page

2-1 A comparison between DRDL, HSR, and standard techniques on a classiﬁcation task on the MNIST dataset...... 35 2-2 A comparison between DRDL and standard techniques on a classiﬁcation task on the COIL dataset...... 35

8 LIST OF FIGURES Figure page

2-1 Vectors distributed as s = 1 sparse combinations of the 3 shown vectors .... 19

2-2 Normalized vectors distributed as s = 2 non-negative sparse combinations of the 4 shown vectors ...... 19

2-3 Dictionary of the 1st layer of MNIST (left); COIL-100 (right)...... 20

2-4 A dimension reduction for the 3 blue points from d = 3 to d = 2 that minimizes pairwise distances distortion...... 24

2-5 A simple 3 layer Hierarchy with no cycles ...... 28

2-6 An example of factorizing HSR by incorporating additional prior. In step 1, we factor a single layer into a 3 layer hierarchy. In step 2, we factor into 3 dictionary learning elements and 2 dimension reduction elements. In step 3, the dimension reduction at level 2 is factored to fan out into two separate DRDL elements .. 33

3-1 Two possible solutions to the smallest spanning set problem of the line arrangement shown ...... 41

3-2 An arrangement of points that is sufﬁcient to determine the generators using s = 3 decomposition strategy but not s = 2...... 42 4-1 Two simple arrangements. The larger (blue) dots are the points P and the small (grey) dots are the given the pins X...... 51

4-2 A simple arrangement of 6 points...... 52

4-3 The velocities of a pair of points pi and pj constraint by a pin xk...... 55 5-1 A k-5 conﬁguration with two distinct solutions...... 71

5-2 An arrangement of 6 points in d = 3 that is sufﬁcient to determine their minimum generating dictionary of size 4...... 72

9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulﬁllment of the Requirements for the Degree of Doctor of Philosophy FOUNDATIONS TOWARDS AN INTEGRATED THEORY OF INTELLIGENCE

Mohammad Tariﬁ

December 2012

Chair: Meera Sitharam Major: Computer Engineering

This work outlines the beginning of a new attempt at an Integrated Theory of Intelligence. A complete understanding of intelligence requires a multitude of interdisciplinary ideas. Instead, we abstract out many speciﬁcs in favor of purely computational building blocks. This enables us to focus on algorithmic and mathematical aspects of understanding intelligence.

A novel universal circuit element for bottom-up learning and inference is proposed.

The circuit element is a concatenation of Dictionary Learning and Dimension Reduction. This fundamental building block is used to learn hierarchies of sparse representations.

The model is applied to standard datasets where the numerical experiments show promising performance.

The Dictionary Learning problem is then examined more closely from a geometric point of view. We identify related problems and draw formal relationships among them. This leads to a view of dictionary learning as the minimum generating set of a subspace arrangement. An exhaustive algorithm follows that applies subspace clustering techniques then an intersection algorithm to learn the dictionary. Notable special instances are discussed. Geometric Dictionary Learning is then investigated theoretically with the help of a surrogate problem, in which the combinatorics of the subspace supports are speciﬁed.

The problem is then approached using machinery from algebraic and combinatorial

10 geometry. Speciﬁcally, a rigidity-type theorem is obtained that characterize the sample arrangements, that recover a ﬁnite number of dictionaries, using a purely combinatorial property on the supports.

Finally we discuss some minor results, open questions, and future work.

11 CHAPTER 1 INTRODUCTION

Working towards a Computational Theory of Intelligence, we develop a computational

framework inspired by ideas from Neuroscience. Speciﬁcally, we integrate notions of columnar organization, hierarchical structure, sparse distributed representations, and

sparse coding.

1.1 Background Models

An integrated view of Intelligence has been proposed by Karl Friston based on

free-energy [37–41, 45]. In this framework, Intelligence is viewed as a surrogate minimization of the entropy of this sensorium. This work is intuitively inspired by this view, aiming to provide a computational foundation for a theory of intelligence from the perspective of theoretical computer science, thereby connecting to ideas in mathematics. By building foundations for a principled approach, the computational essence of problems can be isolated, formalized, and their relationship to fundamental problems in mathematics and theoretical computer science can be illuminated and the full power of available mathematical techniques can be brought to bear. Speculation on a common cortical micro-circuit element dates back to Mountcastle’s observation that a cortical column may serve as an algorithmic building block of the neocortex [82]. Later work by Lee and Mumford [94] and Hawkins and George [51]

attempted further investigation of this process.

The bottom-up organization of the neocortex is generally assumed to be a heterarchical topology of columns. This can be modeled as a directed acyclic graph, but

is usually simpliﬁed to a hierarchical tree. Work by Poggio, Serre, et al. [110, 115–117],

Dean [25, 26], discussed a hierarchical topology. Smale et al. attempts to develop a theory accounting for the importance of the hierarchical structure [12, 136]. Work on modeling early stages of sensory processing by Olshausen [108, 135],

using sparse coding, produced results that account for the observed receptive ﬁelds in

12 early visual processing. This is usually done by learning an overcomplete dictionary. However, it remained unclear how to extend this to higher layers. Our work can be

partially viewed as a progress in this direction.

Computational Learning Theory [44] is the formal study of learning algorithms.

Probably Approximately Correct (PAC) learning deﬁnes a natural setting for analyzing such algorithms [43]. However, with few notable exceptions (boosting, inspiration for

Support Vector Machines [42], etc) the produced guarantees are divorced from practice.

Without tight guarantees, Machine Learning is studied using experimental results on

standard benchmarks, which is problematic. We aim at closing the gap between theory

and practice by providing stronger assumptions on the structures and forms considered by the theory, through constraints inspired by biology and complex systems.

1.1.1 Hierarchical Models

Several hierarchical models have been introduced in the literature. H-Max is based

on Simple-Complex cell hierarchy of Hubel and Wiesel. It is basically a hierarchical

succession of template matching and max-operations, corresponding to simple and

complex cells respectively [110].

Hierarchical Temporal Memory (HTM) is a learning model composed of a hierarchy of spatial coincidence detection and temporal pooling [50, 51, 55]. Coincidence

detection involves ﬁnding a spatial clustering of the input, while temporal pooling is

about ﬁnding variable order Markov chains [47] describing temporal sequences in the

data. H-Max can be mapped into HTM in a straightforward manner. In HTM, the

transformations in which the data remains invariant are learned in the temporal pooling

step [55]. H-Max explicitly hard codes translational transformations through the max

operation. This gives H-Max better sample complexity for speciﬁc problems where

translational invariance is present.

13 Bouvrie et al. [11, 12] introduced a generalization of hierarchical architectures centered around a foundational element involving two steps, Filtering and Pooling.

Filtering is described through a reproducing Kernel K(x, y), such as the standard

inner product K(x, y) = hx, yi, or a Gaussian kernel K(x, y) = e−γkx−yk2 . Pooling

then remaps the result to a single value. Examples of pooling functions include max,

mean, and lp norm (such as l1 or l∞). H-max, Convolutional Neural Nets [53], and Deep Feedforward Neural Networks [52] all belong to this category of hierarchical architectures corresponding to different choices of the Kernel and Pooling functions. As we show in Section 2.3.4, our model does not fall within Bouvrie’s present framework, and can be viewed as a generalization of hierarchical models in which both HTM and Bouvrie’s framework are a special case.

Friston proposed Hierarchical Dynamic Models (HDM) which are similar to the above mentioned architectures but framed in a control theoretic framework operating in continuous time [37]. A computational formalism of his approach is thus prohibitively difﬁcult.

A computational approach is focused on developing tractable algorithms, exploring

the complexity limits of Intelligence, thereby, improving the quality of available guarantees

for evaluating performance of models, improving comparisons among models, and

moving towards provable guarantees such as sample size, time complexity, and generalization error. In addition, prior assumptions about the environment are made

explicit. This furnishes a solid theoretical foundation which may be used, among other

things, as a basis for building Artiﬁcial Intelligence.

1.2 Organization

The next chapter introduces an elemental building block that combines Dictionary

Learning and Dimension Reduction (DRDL). We show how this foundational element

can be used to iteratively construct a Hierarchical Sparse Representation (HSR) of a sensory stream. We compare our approach to existing models showing the generality of

14 our simple prescription. We then perform preliminary experiments using this framework, illustrating with the example of an object recognition task using standard datasets.

In Chapter 3, the Dictionary Learning problem is then examined more closely from a geometric point of view. We identify related problems and draw formal relationships among them. This leads to a view of dictionary learning as the minimum generating set of a subspace arrangement. An exhaustive algorithm follows that applies subspace clustering techniques then an intersection algorithm to learn the dictionary. Notable special instances are discussed.

In Chapter 4, Geometric Dictionary Learning is then investigated theoretically with the help of a surrogate problem, in which the combinatorics of the subspace supports are specified. The problem is then approached using machinery from algebraic and combinatorial geometry. Specifically, a rigidity-type theorem is obtained that characterize the sample arrangements, that recover a finite number of dictionaries, using a purely combinatorial property on the supports. The thesis is concluded in Chapter 5 with some minor results, open questions, and discussion of future work.

15 CHAPTER 2 DRDL ELEMENT

This chapter introduces an elemental building block that combines Dictionary

Learning [3] (to be formally introduced and discussed in Section 2.1.6) and Dimension Reduction [49] (to be formally introduced and discussed in Section 2.1.5). The element

is often abbreviated as DRDL. We show how this foundational element can be used

to iteratively construct a Hierarchical Sparse Representation (HSR, to be formally

introduced and discussed in Section 2.2.2) of a sensory stream. We compare our approach to existing models showing the generality of our simple prescription. We then perform preliminary experiments using this framework, illustrating with the example of an object recognition task using standard datasets.

Next, we introduce some of the relevant mathematical and conceptual topics and techniques which we used in our work. 2.1 Technical Background

A Dictionary is an d-by-m matrix in Rd×m, where the columns are normalized to unit magnitude. A dictionary is called overcomplete, or redundant, if d > m. Two dictionaries D1 and D2 are equivalent if there exists a signed permutation matrix Γ such that D1 = D2Γ. We begin by placing our work with respect to historical context.

2.1.1 Historical Context

Nonlinear Approximation Theory, as a part of functional analysis, can be traced

back to Kolomogrov, who formulated the notion of n-widths [157]. This ﬁeld seeks

to find the best approximation to a single input function drawn from a fixed space (according to various types of fixed norms/distributions), by choosing from a given,

possibly overcomplete/dependent (i.e. many more functions than the dimension),

family of functions (called a Dictionary) drawn from a vector space, and combining the

16 chosen functions in ﬁxed, allowed set of non-linear ways such as compositions and superpositions. In addition, one may restrict the number of functions used.

In general, the problem of ﬁnding a sparse approximation is nonlinear due to

overcompleteness. Even if the family is a basis, the usual projection type methods do

not generally work, due to nonorthogonality of the basis family. Nor do duality based methods work in general. A simpler problem is obtained when restricting the allowed set

of combinations to sparse linear combinations from a dictionary D.

For orthogonal and independent D, well studied instances included classical

harmonic approximation (such as Fourier bases) and orthogonal polynomial bases.

When D is allowed to be non-orthogonal, but remains independent, classic examples studied include polynomial bases and splines.

Overcomplete and dependent families studied include approximation by rational

functions or Free Knot spline and spaces, where the location of knots is free, but

the number of knots is ﬁxed. Another example is approximation from Wavelets and Multi-resolution Sobolev and Besov spaces aiming at harmonic analysis with ﬁnite

support (as with splines). These are overcomplete families, but are nicely layered into

independent, even orthogonal, families (the splines are orthogonalized) that moreover

have certain types of orthogonality relationships between the layers.

In Functional Analysis, these approximation schemes are applied to inﬁnite dimensional vector spaces such as Hilbert and even Banach spaces. [67].

In Numerical Approximation Theory, which is appropriate to seeking computational understanding, the domain is discretized, and both the domain reﬁnement as well as other parameters related to the dimension of the vector spaces (such as degree of the approximating functions) are important measures of complexity. In approximation theory, asymptotic complexity and quality guarantees are captured by proving bounds and trade-offs between the dimension of approximating space, degree, rates of convergence of iterative methods, and approximation order.

17 These problems therefore are set in ﬁnite dimensional linear algebra, with an asymptotic thrust (for complexity analysis). Not surprisingly, these are related to

classical problems and methods in numerical linear algebra and statistics such as

inverse problems and least squares, and were attacked by various Pursuit Algorithms

such as Basis Pursuit, and variants such as LARS/LASSO [130, 134, 139, 150, 151] 2.1.2 Sparse Approximation

Sparse approximation, also know as model or vector selection, is the problem of

representing input data in a known Dictionary D. The problem can be stated as: Deﬁnition 1. Given a dictionary D ∈ Rd×m (possibly overcomplete) and a input vector y ∈ Rd, the Sparse Representation problem asks for the x ∈ Rm such that

min ||x||0 : y = Dx.

That is, x is the sparsest vector that represents y as linear combinations of the columns of D. We say that a dictionary s-spans a vector y, if the sparse representation

problem can be solved with ||x||0 ≤ s 2.1.3 Illustrating The Model With Simple Examples

Example 1. Figure 2-1 illustrates a particular distribution that is s = 1-sparse in the 3

drawn vectors. The Dictionary D1 is    1 0 1    . 0 1 1

The columns of the dictionary correspond to the drawn vectors, and the data

is expressed simply as a vector with one coefﬁcient corresponding to inner product

with the closest vector and zero otherwise. This produces an s = 1 sparse vector in

dimension 3.

A slightly more complicated example is s = 2 in the Dictionary D2 shown in Figure 2-2.

18 Figure 2-1. Vectors distributed as s = 1 sparse combinations of the 3 shown vectors

Example 2. Figure 2-2 illustrates the distribution drawn from the dictionary below for non-negative coefﬁcients. The data was normalized for convenience.    1 0 0 1       0 1 0 1    0 0 1 1

Figure 2-2. Normalized vectors distributed as s = 2 non-negative sparse combinations of the 4 shown vectors

Next we give a typical example that is encountered in practice.

Example 3. The MNIST dataset is a database of handwritten digits [175]. The COIL-100 dataset consists of color images of 100 objects taken at pose intervals of 5 degrees, for a total of 72 poses per object [48]. Figure 2-3 shows dictionaries trained on the MNIST and COIL-100 datasets.

Each image patch is a column of the dictionary. The dictionary was trained using the SPArse Modeling Software (SPAMS) open source library [102, 176].

19 Figure 2-3. Dictionary of the 1st layer of MNIST (left); COIL-100 (right).

2.1.4 Pursuit Algorithms

Several algorithms exist for vector selection reﬂecting various approaches to the problem. It can be shown that the general Vector Selection problem for any dictionary is

NP-hard. This is done by reduction to the Exact Cover by 3-set problem [24]. This shows

that the general selection problem is difﬁcult for an arbitrary dictionary D. However,

efﬁcient optimal algorithms may exist for special D. For instance, if D is a complete basis, then this problem can be efﬁciently solved by using Principle Component Analysis

(PCA) [61] and then selecting the top components as needed.

+ In some cases, we are only interested in ||x||0 ≤ s, where s ∈ N is given. One m approach for vector selection is using a full search of all s sub-dictionaries and taking the Moore-Penrose pseudo-inverse A+ = (A∗A)−1A∗ of the induced sub-dictionary A. The pseudo-inverse ﬁnds the least square approximation to the ﬁrst principal component about the mean of a set of points can be represented by that line which most closely approaches the data points (as measured by the squared distance of the closest approach, i.e. perpendicular to the line). This is in contrast to the linear least squares, which tries to minimize the distance in the y direction only. Thus, although the two use a similar error metric, linear least squares is a method that treats one dimension of the

data preferentially, while PCA treats all dimensions equally.

20 Matching Pursuit algorithms, such as Orthogonal Matching Pursuit (OMP) [134], are a family of greedy approaches to sparse approximation. Similar approaches in

the statistics literature are referred to as Forward Selection [139], Stepwise Forward

Selection [139], and Least Angle Regression (LARS) [130].

If we take an L1 relaxation of the problem, we have min ||x||1 : y = Dx, this optimized with the convex Lagrangian expression:

min ||y − Dx||2 + λ||x||1 where λ > 0 is the Lagrangian multiplier. This problem is called Least Absolute

Shrinkage and Selection Operator (LASSO) [151] and can be solved using Basis Pursuit

[150], in which it is re-expressed as a Linear Programming (LP) problem [133].

How good is LASSO at ﬁnding a solution? This has been discussed by Zhao and Yu [101] and Donoho [31]. Both approaches used conditions on the dictionary D to evaluate

the efﬁcacy of LASSO. Zhao and Yu deﬁne the Irrepresentable Sign Condition

n n −1 n ≤ − C21(C11) sign(β ) 1 η

n n where C11 is the correlation between the dictionary’s active elements, and C21 is the correlation between the active and inactive elements. Active elements are the elements

that correspond to nonzero entries in x. This Irrepresentable Sign Condition is shown to

be equivalent to the Sign Consistency property deﬁned as the limit of obtaining a correct decomposition with proper sign coefﬁcients as d → ∞. This result shows that LASSO is consistent if and only if the maximum correlation is bounded:

1 | n| ≤ max Cij i,j smax

21 where smax is the maximum sparsity allowed. Although this result guarantees the success of LASSO in weakly correlated D, this is a negative result since on most dictionaries of interest are signiﬁcantly correlated.

Yet another well known technique, the FOCal Under-determined System Solver

[140], minimizes an objective function min(||y − Dx||2 + λ||x||p), with a p-norm, p < 1. As p gets close to zero, this term get close to the sparsity l0 norm. 2.1.5 Dimension Reduction

Dimension Reduction involves mapping data points from dimension d1 to d2, with

d2 < d1, while maintaining certain desired properties of the data. Decreasing the number of variables can make certain previously intractable computational problems

tractable when d2 = o(d1), where o(.) is the little-o notation. Particularly, this may

generate exponential size savings d2 = O(log d1). With a proper choice of reduction, these techniques have the important additional advantage of introducing higher noise

tolerance, due to the bias-variance trade off, and thus may avoid over-ﬁtting. If the

desired property is a pairwise metric, then dimension reduction can also be thought of

as a Metric Embedding problem [62].

A classic example is Principle Component Analysis (PCA) [61]. This can be thought of as a rotation to a basis whose components are consecutively ordered in decreasing

L2 variance error. A dimension reduction to a dimension d2 maintaining L2 variance error

can then be obtained by picking the top d2 components. Another well known result in Dimension Reduction is a Lemma by Johnson and Lindenstrauss which shows that, by using random orthogonal projections, one can

d1 log(n) 2 embed n points living in L2 to d2 = O( 2 ) with probability O(1/n ) [23]. Here is the pairwise metric distortion deﬁned as the minimum constant bounding the ratio of

pairwise distances of the points in d2 to the corresponding pairwise distances of the

points in d1. The probability can then be made arbitrary close to 1 in polynomial time.

22 In this work, we use a particular type of dimension reduction known as Compressed Sensing [30], which is suited for data that is sparse in a ﬁxed known basis B. We can

obtain a dimension reduction by applying a linear operator satisfying the Frame Property

(also known as the Restricted Isometry Property, or RIP, in Compressed Sensing

theory).

Deﬁnition 2 (Frame Property). A dictionary D is frame if for all y such that ||y||0 ≤ s, there exists a δs, for which it holds that:

kAxk2 (1 − δ ) ≤ 2 ≤ (1 + δ ) (2–1) s k k2 s x 2 where δs > 0 is the minimum possible value such that the two bounds apply. For constant sparsity, s, Compressed Sensing with a frame dictionary achieves exponential dimension reduction when the maintained property is approximate mutual distances. This can be seen by considering two s-sparse vectors x1 and x2, then:

kAx − Ax k2 (1 − δ ) ≤ 1 2 2 ≤ (1 + δ ). (2–2) 2s k − k2 2s x1 x2 2 Given an s-sparse vector of dimension n, a frame reduces the dimension to

O(s log(n)). Furthermore, the frame property guarantees exact recoverability of x,

from the compressed vector y = Ax, by using L1 minimization. Observation 1. If A is a frame, then solving the convex optimization problem min ||y −

Ax||2 + λ||x||1 gives the sparsest solution min ||x||0 : y = Ax, when x is sparse. For a proof of Observation 1, the reader is referred to [30].

Example 4. We can then apply a dimension reduction to the sparse representation obtained in Example 1 that preserves distances between representations. The representations correspond to the standard basis in d = 3. The best dimension reduction to

d = 2 is then simply the projection of the representations onto the plane perpendicu-

lar to (1, 1, 1). Whereby points on the unit basis project to the vertices of a triangle as

illustrated in Figure 4.

23 Figure 2-4. A dimension reduction for the 3 blue points from d = 3 to d = 2 that minimizes pairwise distances distortion.

Since we are using a frame, efﬁcient decompression is guaranteed using L1 minimization. The data can be recovered exactly using L1 minimization algorithms such as Basis Pursuit. Frames can be obtained probabilistically from matrices with random Gaussian entries. Alternatively, Frames can be obtained using sparse random matrices [56]. In

this thesis we follow the latter approach. The question of deterministically constructing

Frames with similar bounds is still open.

Next we turn our attention to the problem of ﬁnding the dictionary D, when it is unknown, from the data.

2.1.6 Dictionary Learning

Dictionary Learning obtains a sparse representation by learning vectors on which

the data xi can be written in sparse linear combinations.

d Deﬁnition 3. Given an input set X = [x1 ... xm], where xi ∈ R , Dictionary Learning ﬁnds

n D = [v1 ... vn] and θ = [θi ... θm], where θ ∈ R , such that xi = Dθi and kθik0 ≤ s.

24 where the k.k0 is the L0-norm or sparsity. If all entries of θi are restricted to be non-negative, we obtain Sparse-Non-negative Matrix Factorization (SNMF) [63, 64]. An optimization

version of Dictionary Learning can be written as:

min max min kθik0 : xi = Dθi. D∈Rd,n xi In practice, the Dictionary Learning problem is often relaxed to the Lagrangian:

min kX − Dθk2 + λkθk1

where X = [x0...xm] and θ = [θ0...θm]. Several dictionary learning algorithms work by iterating the following.

Step 1, solve the vector selection problem for all vectors X. This can be done using

your favorite vector selection algorithm, such as basis pursuit.

Step 2, given X, the optimization problem is now convex in D. Use your favorite

method to ﬁnd D. Using a maximum likelihood formalism, the Method of Optimal Dictionary (MOD)

[60] uses the pseudo-inverse to compute D:

D(i+1) = Xθ(i)T (θnθiT )−1 where Di and θi are the ith iteration candidate for D and θ respectively. The MOD can be extended to Maximum A-Posteriori probability setting with different priors to take into account preferences in the recovered dictionary. Similarly, k-SVD uses a two step iterative process, with a Truncated Singular Value Decomposition [154] to update D.

This is done by taking every vector in D and applying SVD to X and θ restricted to only

the columns that have a contribution from that vector.

When D is restricted to be of the form D = [B1,B2 ... BL] where Bi’s are orthonormal matrices, then a more efﬁcient pursuit algorithm is obtained for the sparse coding stage

using a block coordinate relaxation.

25 We will investigate to the problem of Dictionary Learning in detail in Chapters 3 and 4. We are now ready to introduce our DRDL element.

2.2 Our Contribution

2.2.1 DRDL Circuit Element

Our circuit element is a simple concatenation of a Dictionary Learning (DL) step followed by a Dimension Reduction (DR) step; DRDL is used as a shorthand.

The DL step learns a representation θi of the input that lives in a high dimension n. ≤ ≤ Let 1 θi1 , ... θis n be the nonzero entries of θi. Then its possible to obtain a dimension reduction by embedding the corresponding entries into θlow = (θi1 , i1, θi2 , i2 ... θis , is). The

number of bits needed to represent ij is at most log n, therefore the embedded space has dimension O(s log(n)).

The problem with this simple dimension reduction is that the metric distortion is

high. Compressed sensing with a frame ensures approximate metric preservation while

embedding the vetors into a dimension of order O(s log(n)). This additional property

enables us to use metric based algorithms to distinguish between different concept classes (from the output of the DRDL element). This property is also key to applying

the DRDL circuit element recursively (as in Section 2.2.2) especially in the presence of

noise.

We further assume that the DL step learns a frame. This condition will be useful for learning the dictionary (although we will loosen this assumption in Chapter 3). This also

enables Observation 2 bellow.

2.2.1.1 Relation between DR and DL

The DR and DL steps are intimately related. To show their relationship clearly,

we rewrite the two problems with the same variable names. These variables are only

relevant for this section. The two problems can be stated as:

1. DL asks for D and {x1 ... xm}, given {y1 ... ym}, for Dxi = yi, such that the sparsity kxik0 is minimized for a ﬁxed dimension of yi.

26 2. DR asks for D and {y1 ... ym}, given {x1 ... xm}, for Dxi = yi, such that the dimension of yi’s is minimized for a ﬁxed sparsity kxik0.

In practice, both problems use L1 approximation as a proxy for L0 optimization. This leads to the following observation Observation 2. The inverse of a DRDL is a DRDL.

This means that the space of mappings/functions of our model is the same as it’s

inverse. This property is useful for incorporating feedback.

2.2.1.2 Discussion of trade-offs in DRDL

DRDL can be thought of as a memory system (’memory pocket’) or a dimension

reduction technique for data that can be expressed sparsely in a dictionary. One

parameter trade-off is between n (the number of columns in D) and s (the sparsity of the representation).

On one hand, we note that the DR step puts the data in O(s log(n)) dimension.

Therefore, if we desire to maximize the reduction in dimension, increasing n by raising it to a constant power k is comparable to multiplying s by k. This means that we would much rather increase the number of columns in the dictionary than the sparsity. On the other hand, increasing the number of columns in D forces the columns to be highly correlated Which becomes problematic for Basis Pursuit vector selection.

This trade-off highlights the importance of investigating approaches to dictionary learning and vector selection that can go beyond current results into highly coherent dictionaries.

2.2.2 Our Hierarchical Sparse Representation

If we assume a hierarchical architecture modeling the topographic organization of the visual cortex, a singular DRDL element can be factorized and expressed as a tree

of simpler DRDL elements. With this architecture we can learn a Hierarchical Sparse

Representation by iterating DRDL elements.

27 2.2.2.1 Assumptions of the generative model

Our models assumes that the data is generated by a hierarchy of spatiotemporal

invariants. At any given level i, each node in the generative model is assumed to

be composed of a small number of vectors si . Generation proceeds by recursively decompressing the pattern from parent nodes then producing patterns to child nodes.

This input is fed to the learning algorithm below. In this chapter, we assume that both

the topology of generative model and the spatial and temporal extent of each node is

known. Discussion of algorithms for learning the topology and internal dimensions is left for future work.

Consider a simple data stream consisting of a spatiotemporal sequences from a

generative model deﬁned above. Figure 2-5 shows a potential learning hierarchy. For simple vision problems, we can consider all dictionaries within a layer as the same. In this chapter, processing proceeds bottom-up the hierarchy only.

Figure 2-5. A simple 3 layer Hierarchy with no cycles

28 2.2.2.2 Learning algorithm

The overall picture of the learning algorithm is relatively straightforward. Recursively divide the spatiotemporal signal xi to obtain a tree representing the known topographic

0 k hierarchy of spatiotemporal blocks. Let xi.j be the jth block at level 0. We denote by xi.j

k the jth block (in a given topographic order) at level k and by Dj the dictionary in the j position (in the same topographic order) at level k. Then, starting at the bottom of the tree, do:

k k 1. Learn a Dictionary Dj in which the spatiotemporal data xi,j can be represented k sparsely. This produces a vector of weights θi,j.

k k 2. Apply dimensionality reduction to the sparse representation to obtain ui,j = Aθi,j.

k+1 k 3. Generate xi,j by concatenating vectors ui,l for all l that is child of j in at level k in the tree. Replace k = k + 1. And now j ranges over elements of level k. If k is still less than the depth of the tree, go to Step 1.

Note that in domains such as computer vision, it is reasonable to assume that all

k k Dictionaries at level k are the same Dj = D . This algorithm attempts to mirror the generative model. It outputs an inference algorithm that induces a hierarchy of sparse representations for a given data point. This can be used to abstract invariant feature vectors in the new data. One can then use a supervised learning algorithm on top of the invariant feature vectors to solve classiﬁcation problems.

2.2.2.3 Representation inference

For new data points, the representation is obtained, in analogy to the learning algorithm, by recursively dividing the spatiotemporal signal to obtain a tree representing the known topographic hierarchy of spatiotemporal blocks. The representation is inferred naturally by iteratively applying Vector Selection and Compressed Sensing. For Vector Selection, we employ a common variational technique called Basis Pursuit De-Noising k − k2 k k (BPDN) [54], which minimizes Dθi xi 2 + λ θi 1. This technique produces optimal results when the sparsity

29 1 1 kθk < + (2–3) 0 2 2C where C is the coherence of the dictionary D,

T C = max(D D)k,l. (2–4) k,l

This is a limitation in practice since it is desirable to have highly coherent dictionaries. Inference proceeded by iteratively applying vector selection and dimension reduction.

2.2.2.4 Mapping between HSR and current models

This simple, yet powerful, toy model is used as a basis investigation tool for our

future work. We abstracted this model out of a need for conceptual simplicity and

generality. Several models inspired by the neocortex have been proposed, many sharing

similar characteristics. We present a few here and compare to our model.

H-Max is a hierarchy composed of a template matching step followed by a Max operation. Template Matching is simply picking the highest inner product between

the input and every column of the dictionary, which can be thought of as sparse

representation with s = 1. H-Max is mapped to HSR by replacing Basis Pursuit for

Sparse Approximation with Template Matching. H-Max uses templates (columns of the dictionary) that are sampled randomly from the input data [71, 110, 115–117], whereas

HSR uses Dictionary Learning. The ’max’ operation in H-Max can be understood in

terms of the operation of sparse coding, which produces local competition among

feature vectors representing slight variations in the data. Alternatively, the ’max’

operation can be viewed as a limited form of dimension reduction. HTM [50, 51, 55] can be mapped to HSR by considering time as additional

dimension of space. This way, bounded variable order Markov chains [47] can be written

as Sparse Approximation of spatiotemporal vectors representing sequences of spacial

vectors in time. HTM constructs a set of spatial coincidences that is automatically

shared across time. HSR may do the same when fed a moving window of spatial

30 sequences. Alternatively, HSR will simulate an HTM by alternating between time-only and space-only spatiotemporally extended blocks. In this view a single HTM node is mapped to two layer HSR nodes, one representing time, and the other representing space. Unlike HTM, treating time and space on equal footing has the added advantage that the same algorithms can be used for both. HTM with “winner-take-all” policy can then be mapped to our model by assuming sparsity s = 1. HTM with distribution on belief states can be mapped to our model with Template Matching instead of Sparse

Approximation in the inference step. Finally, HTM does not leverage the RIP dimension reduction step. HTM uses feedback connections for prediction, which is restricted to predictions forward in time. Extending HSR with feedback connections, which accounts for dependency between nodes that are not connected directly, enables feedback to affect all space-time.

2.2.2.5 What does HSR offer beyond current models?

One advantage of our approach is in providing invertibility without a dimensionality blow-up for hierarchically sparse data. Models [52, 53] falling under Bouvrie et al.’s framework [11, 12] (introduced in Chapter 1, Section 1.1.1) lose information as you proceed up the hierarchy due to the Pooling operation. This becomes problematic when extending the models to incorporate feedback. Moreover, this information loss forces an algorithm designer to hardcode what invariances a particular model must select for (such as translational invariance in H-max). On the other hand, invertible models such as HTM suffer from dimensionality blow up, when the number of vectors learned at a given level is greater than the input dimension to that level - as is usually the case.

Dimensionality reduction achieves both a savings in computational resources as well as better noise resilience by avoiding over-ﬁtting.

Dictionary learning represents the data by sparse combinations of dictionary columns. This can be viewed as a L0 or L1 regularization, which provides noise tolerance and better generalization. This type of regularization is intuitive and is well

31 motivated by neuroscience [108, 135] and the organization of complex systems. Our approach departs from current models that use simple template matching and leverages

the expressive power of Sparse Approximation. This provides a disciplined prescription

for learning vectors at every level. Finally, HSR is a conceptually simple model. This

elegance lends itself well to analysis. 2.2.2.6 Discussion of trade-offs in HSR

There are several design decisions when constructing an HSR. Informally, the

hierarchy is useful for reducing the sample complexity and dimensionality of the problem. For instance consider the simpliﬁed case of binary {0, 1} coefﬁcients and

translation invariance (such as vision). An HSR generative model of two layers will produce m2 patterns. Learning this with a single layer HSR would involve learning a s2

dictionary of m2 columns and s2 sparsity using |X| samples in dimension d. Using two

layers, we have the ﬁrst layer learning a dictionary of size m1 and sparsity s1 using k|X|

d samples in dimension k , the second layer learns a dictionary of size m2 columns and s2

with |X| samples in dimension kO(s1 log m1) < d. In effect, by adding a layer, we have divided the problem into two simpler problems in terms of dimensionality and sample

complexity. A more formal and complete discussion will be presented in future work. 2.2.3 Incorporating Additional Model Prior

Additional assumptions reﬂecting our knowledge of the environment can be incorporated as model prior and explicitly considered within the DRDL and HSR

framework. Additional prior information can be separately encoded at different levels

in the hierarchical topology and in the Dictionary Learning and Dimension Reduction

steps. Figure 2-6 illustrates a particular factorization obtained from incorporating prior

assumptions on the model. One type of knowledge is through invariances speciﬁed on the data. Invariances can be thought of as transformations that map our data to itself.

Speciﬁcally a set of points X is invariant under a transformation I if and only if I(X) = X.

These assumptions can be encoded to:

32 Figure 2-6. An example of factorizing HSR by incorporating additional prior. In step 1, we factor a single layer into a 3 layer hierarchy. In step 2, we factor into 3 dictionary learning elements and 2 dimension reduction elements. In step 3, the dimension reduction at level 2 is factored to fan out into two separate DRDL elements

The Structure of the Model. The assumption that our generative model can be

factored into a hierarchy is a structural prior. Further factorizations can reﬂect different structural organizations, such as the topology of the cortex. Other

kinds of assumptions may be imposed as well. For example, in Computer

Vision, a convenient assumption would be that nodes within the same level of

hierarchy share the same dictionary. We follow this assumption in our numerical

experiments. This is clearly invalid when dealing with multi-modal sensory data in lower levels. Figure 2-6 shows an example of progressive factorizations of the

model according to prior assumptions.

33 The DL Step. Adding invariance to the Dictionary Learning steps improves the sampling complexity. For instance, time and space share the property of being

shift-invariant. One can model the same spatiotemporal block with a single

dictionary or with a three level hierarchy of shift-invariant DRDL reﬂecting two

dimensions of space and one of time. Shift-invariant dictionaries have been studied in the context of learning audio sequences yielding improved performance

empirically [128].

The DR Step. Imposing invariance selectivity in the dimension reduction step lowers

the embedding dimension at the expense of invertibility. A more general approach

would be to trade-off selectivity for invariances with invertibility, whereby the model incorporates dimension reductions that selects for some property of interest. For

example, in Computer Vision one can impose selectivity for small shifts, rotations,

and scaling invariances through a dimension reduction matrix that pools over such

transformations. This is similar to the approach taken by H-Max [71]. 2.2.4 Experiments

In this section we elaborate on preliminary numerical experiments on classiﬁcation tasks performed with DRDL and HSR on basic standard Machine Learning datasets. We applied our model to the MNIST and COIL datasets and subsequently used the representation as a feature extraction step for a classiﬁcation algorithm such as Support

Vector Machines (SVM) or k-Nearest Neighbors (kNN). In practice, additional prior assumptions can be included in our model as discussed in Section 2.2.3. 2.2.4.1 MNIST Results

We applied our model to all pairs of the MNIST data set. For the RIP step, we tried random matrices and sparse-random matrices, and tested the efﬁcacy of the approach by reconstructing the training data with basis pursuit. We used only two layers. After the feature vectors are learned we applied a k-NN with k = 3. We refrained from tweaking the initial parameters since we expect our model to work off-shelf. For one layer of

34 Method Error % Reconstructive Dictionary Learning 4.33 Supervised Dictionary Learning 1.05 k-NN , l2 5.0 SVM-Gauss 1.4 One layer of DRDL 1.24 Two layers of HSR 2.01 Table 2-1. A comparison between DRDL, HSR, and standard techniques on a classiﬁcation task on the MNIST dataset.

Method Classiﬁcation % One layer of DRDL 87.8 SVM 84.9 Nearest Neighbor 81.8 VTU 89.9 CNN 84.8 Table 2-2. A comparison between DRDL and standard techniques on a classiﬁcation task on the COIL dataset.

DRDL we obtained an error rate of 1.24% with a standard deviation of 0.011. Using two

layers we obtained an error of 2.01% and standard deviation of 0.016. Table 2-1 presents

a comparison between DRDL, HSR, and standard techniques [175] on the MNIST dataset.

2.2.4.2 COIL Results

We applied our model to all pairs of the COIL-30 (which is a subset of 30 objects

out of the entire 100 objects in the COIL datasets). The data set consists of 72 images

for every class. We used only 4 labeled images per class for training. These are

taken at equally spaced angles (0, 90, 180, 270). We used the same procedure as the

MNIST experiment for obtaining and checking the RIP matrix. We applied a single layer DRDRL, then trained a k-NN with k = 3. We also refrained from tweaking the initial parameters. We obtained a mean error of 12.2%. Table 2-2 presents a comparison

between DRDL and standard techniques [48, 70] on the COIL dataset.

35 2.3 Discussion

We introduced a novel formulation of an elemental building block that could serve as a the bottom-up piece in the common cortical algorithm. As we shall see in the rest of the thesis, this model leads to several interesting theoretical questions. To help guide experiments, we also illustrated how additional prior assumptions on the generative model can be expressed within our integrated framework. Furthermore, as discussed in Chapter 5, this framework can also be extended to address feedback, attention, action, complimentary learning, and the role of time. In the next chapter, we focus on understanding the Dictionary Learning component.

36 CHAPTER 3 GEOMETRIC DICTIONARY LEARNING

This chapter investigates the Dictionary Learning problem from a geometric and algebraic point of view. We identify related problems and draw formal relationships among them. This leads to a view of dictionary learning as the minimum generating set of a subspace arrangement. This introduces an exhaustive method that learns a subspace arrangement using subspace clustering techniques then applies an intersection algorithm to recover the dictionary. We also discuss the special case of learning an orthogonal basis. 3.1 Preliminaries

Recall from Chapter 2, Section 2.1.6, our formal deﬁnition of Dictionary Learning. Before introducing our own approach, we also recall some of the known (statistical)

approaches to attacking the problem.

3.1.1 Statistical Approaches to Dictionary Learning

An optimization version of Dictionary Learning can be written as:

d,n k k minD∈R maxxi min yi 0 : xi = Dyi. In practice, the Dictionary Learning problem is often relaxed to the Lagrangian P n k − k k k min i=0( xi Dyi 2 + λ yi 1). Traditional approaches rely on heuristic methods such as EM. Several dictionary learning algorithms work by iterating the following two steps

[108, 120, 143, 167]:

1. Solve the sparse representation problem for all vectors X. This can be done using your favorite vector selection algorithm, such as basis pursuit [150].

2. Given X, the optimization problems is now convex in D. Use your favorite method to ﬁnd D.

Let X = [x1 ... xm] and Y = [y1 ... ym]. Using a maximum likelihood formalism, the Method of Optimal Dictionary (MOD) [60] uses the psuedoinverse to compute

D: D(i+1) = XY(i)T (YnYiT )−1. The MOD can be extended to Maximum A-Posteriori probability setting with different priors to take into account preferences in the recovered

37 dictionary. Similarly, k-SVD uses an two step iterative process, with a Truncated Singular Value Decomposition to update D. This is done by taking every atom in D and applying SVD to X and Y restricted to only the columns that that have contribution from that atom. When D is restricted to be of the form D = [B1,B2 ... BL] where Bi’s are orthonormal matrices, then a more efﬁcient pursuit algorithm is obtained for the sparse coding stage using a block coordinate relaxation.

3.1.2 Contribution and Organization

This chapter investigates the Dictionary Learning problem from a geometric point of view. In Section 2, we being by introducing a generative model for dictionary datasets and introduce a variety of related problems. In section 3, we observe an exhaustive solution by framing dictionary learning as the minimum generating set of a subspace arrangement. The method learns a subspace arrangement using subspace clustering techniques then applies an intersection algorithm to recover the dictionary. Section

4 discusses the case of an orthogonal dictionary. Further generalizations and open questions are left for Chapter 5 of the thesis.

3.2 The Setup

We aim towards formulating and studying the dictionary learning problem from a direct geometric point of view, seeking a concrete solution whose performance can be formally understood. A formal approach relies on modeling the data from a generative model and analyzing the complexity of learning.

3.2.1 Assumptions of The Generative Model

There are a few choices for generative models that produce data readily modeled by a dictionary. In its most general form we are asked to determine an unknown dictionary

D and a set of unknown points Y = {y1 ... yn} picked from a distribution PY, given a set of sample points X = {x1 ... xn} such that xi = Dyi where kyik0 ≤ s. A further complication arises in the form of noise xi = Dyi + i where the l2 norm of i is bounded.

38 We say that a set of vectors V, s-spans a point or subspace if and only if the point or subspace can be written as a linear combination of at most s elements of V. A common

property often imposed on dictionaries is s-regularity, deﬁned below:

Deﬁnition 4 (s-regularity). A dictionary D is s-regular if for all y, such that ||y||0 ≤ s, it holds that Dy =6 0.

For an s-regular dictionary, the general vector selection problem is ill defined. For instance D can be overcomplete leading to multiple solutions for yi. Overcoming this by framing the problem as a minimization problem is exceedingly difficult. Indeed under generic assumptions, even determining the minimum l0 norm yi when D and xi are known is NP-hard. Under this condition, we can make the vector selection problem well defined by enforcing 2s-regularity property on D.

Deﬁnition 5 (s-independence). A dictionary D is s-independent if for all y1, y2, such that

||y1||0 ≤ s and ||y2||0 ≤ s, it holds that Dy1 = Dy2 if and only if y1 = y2. s-independence is a minimal requirement for unique invertibility. Notice that the

deﬁnition given for s-independence is indeed equivalent to 2s-regularity.

We can further strengthen the constraints on D by assuming that D is a frame

(deﬁned in Chapter 2). This ensures that basic tasks, such as vector selection, are

tractable and noise tolerant.

In the following sections further constraints are imposed on D, PY, and , that further specialize the framework. Unless mentioned otherwise, we set = 0.

3.2.2 Problem Deﬁnitions

A number of independently interesting but intimately related problems arise from

this setup. The ﬁrst problem is the full-blown Dictionary Learning question under the

assumptions above.

Deﬁnition 6 (Geometric Dictionary Learning). Let X be a given set of vectors in Rn that

are known to be generated from a frame dictionary D, with |D| at most m, and xi = Dyi,

39 ´ ´ where ||yi||0 ≤ s. Find any frame dictionary D, such that ´m= |D| ≤ m, and for all xi ∈ X, ´m ´ there exists ´yi ∈ R where ´xi = D ´yi. I.e., D and D´ are such that each vector x in X can be represented by an s-sparse

combination of vectors in D. One approach to the Dictionary Learning problem is by

decomposing it into useful subproblems. In turn, these problems are revealed to be of independent interest. We deﬁne these problems below and then show their relationship

to the original question. We say that a set of vectors X lies on a set S of s-dimensional

subspaces if and only if for all xi ∈ X, there exists Si ∈ S such that xi ∈ Si. Deﬁnition 7 (Subspace Arrangement Learning). Let X be a given set of vectors that are known to lie on a set S of s-dimensional subspaces of Rn, where |S| is at most k. Further assume that the subspaces in S have bases such that their union is a frame. Find any subspace arrangement SX such that |SX| ≤ k, X lies on SX, and the union of the bases of

Si ∈ SX is a frame. The Subspace Arrangement is of independent interest to the machine learning and computer vision communities. In Section 3.4.1, we summarize known results on the problem.

The second of our key problems is an optimization problem for representing a union of subspaces. We say a set of vectors D s-spans a set of vectors X, if and only if for all

|D| xi ∈ X, there exists yi ∈ R such that xi = Dyi and |yi| ≤ s. Deﬁnition 8 (Smallest Spanning Set for Subspace Arrangement). Find a minimum cardinality set of vectors that s-span all the subspaces in an input subspace arrangement S,

that have been speciﬁed by giving bases for each subspace.

In general, the smallest spanning set is not necessarily unique even for s-regular

subspaces, as illustrated in the example bellow. Example 5. This example, see Figure 3-1, shows two possible solutions to the smallest spanning set for the given subspace arrangement.

40 Figure 3-1. Two possible solutions to the smallest spanning set problem of the line arrangement shown

The subspaces are of dimension s = 2 and live in d = 3. They are viewed projectively in the afﬁne plane. Notice that the union of the minimum generating set is

2-regular, i.e. no 3 of the shown points lie on the same line.

As is apparent from Figure 3-1, the intersection of the subspaces is key. This motivates the next of our problems.

Deﬁnition 9 (Intersection of Subspace Arrangement). Let S be an given set of s- dimensional subspaces of Rn (speciﬁed by giving their bases whose union is a frame).

It is promised that there is a set I of vectors with |I| at most m, that s-spans the union of their intersection. Find any set of vectors ´I that satisﬁes the above conditions.

3.2.3 Problem Relationships

We can now sketch intuitively how the above problems are related. The ﬁrst observation relates the Geometric Dictionary Learning problem with the Subspace

Arrangement Learning and Smallest Spanning Set for s-regular dictionaries.

Observation 3 (Decomposition of Dictionary Learning). The following two-step procedure solves the s-regular Geometric Dictionary Learning problem:

• Learn a Subspace Arrangement S for X (instance of Deﬁnition 9).

• Recover D by ﬁnding the smallest Spanning Set of S (instance of Deﬁnition 10).

41 Note that it is not true that the decomposition strategy should always be applied for the same sparsity s, the constant in the generative model (discussed in Section 3.3.1).

The following example illustrates this.

Example 6. Consider the arrangement shown in Figure 3-2, with s = 2 living in the projective hyperplane of dimension 3:

Figure 3-2. An arrangement of points that is sufﬁcient to determine the generators using s = 3 decomposition strategy but not s = 2.

There are not enough sample points to apply the decomposition strategy with s = 2.

Instead, if we use s = 3, all the planes of the simplex are determined by 4 of the given points. Therefore, we can learn a subspace arrangement (the union of the planes) and from that recover the dictionary as the vertices of the simplex.

Unless otherwise speciﬁed, the algorithms in this chapter start out with the minimum given value of s and are reapplied with iteratively higher s if a solution has not be obtained. We deploy this strategy due to the fact that most subspace arrangement learning algorithms (discussed in the next section) suffer exponential degraded performance in terms of s - with the notable exception of the special instances discussed in the ﬁnal section of this chapter.

Furthermore, we observe that, for s-independent sets, the Smallest Spanning Set can be obtained via intersection of the Subspace Arrangements.

Observation 4 (Smallest Spanning Set via Intersection). Under the condition that the Subspace Arrangement is coming from an s-independent dictionary, the Smallest

Spanning Set is the union of:

• The smallest spanning set I of the pairwise intersection of all the subspaces in S.

42 • Any points outside the pairwise intersections that, together with I, completely s-span the subspaces in S.

In the following section we obtain an algorithm for solving the s-independent

Smallest Spanning set problem, by applying the above observation recursively. 3.3 Cluster and Intersect Algorithm

In this section we illustrate our geometric strategy. Recall, from our deﬁnition of

the generative model in Section 3.3.1, that PY is the distribution from which yi’s are

generated. The Support of yi is the index set I of coordinates of yi that are non-zero. In this section PY is as follows:

• A set of k supports are picked uniformly from the set of 2m possibilities.

s • The values of the non-zero entries of yi are picked uniformly from R . P s m Given m column vectors in D, the number of possible supports for y is 1 i . This allows us to quantify the approach in terms of the number of subspaces k, which

could vary between the various settings encountering an instance of the Dictionary Learning problem. For instance D is often used to separate causes in an environment,

and naturally not all possible combinations of causes are realized.

We begin by discussing the case where all subspaces are of equal dimensions s,

that is d(Si) = s, for all i ∈ {1 ... k}, then generalize to arbitrary dimensions of support. If the supports are picked uniformly then we can ensure that every column of D is represented with probability approaching O(me−ks/m) using Θ(k) subspaces.

To see this, apply the trivial union bound to the probability that any given subspace − s qm → −qs contains a particular column di, which comes to (1 m ) e . We ﬁrst discuss methods of determining the subspace arrangement S from X, then show how to recover D.

3.3.1 Learning Subspace Arrangements

There are several known algorithms for learning subspace arrangements. For a survey the reader is referred to [98]. Since the subspaces are all of the same dimension

43 s, we project on a generic subspace of dimension s + 1. The projected subspaces preserve their dimensions and distinctiveness with probability 1. This enables us to work in Rs+1. Therefore, without loss of generality, we assume that the subspaces are hyperplanes for the remainder of this section.

3.3.1.1 Random sample consensus

Random Sample Consensus (RANSAC) [98] is an approach to learning subspace arrangements that isolates, one subspace at a time, via random sampling. When dealing with an arrangement of k s-dimensional subspaces, for instance, the method samples s + 1 points which is the minimum number of points required to fit an s-dimensional subspace. The procedure then finds and discards inliers by computing the residual to each data point relative to the subspace and selecting the points whose residual is below a certain threshold. The process is iterated until we have k subspaces or all points are fitted. RANSAC is robust to models corrupted with outliers. The following algorithm illustrates random and deterministic RANSAC used for the present problem.

Algorithm Find Subspaces randomized [deterministic]

Input: X, s, k

Output: S 1. S = ∅

2. while |S| < k [or until ﬁnished enumerating all s + 1-subsets]:

3. do pick { ´x1 ...s+1 ´x } ⊂ X at random [by enumeration]

4. let M be the matrix with columns { ´x1 ...s+1 ´x } 5. and ﬁnd rank(M) using a method such as Gaussian Elimination

6. if rank|M| < s + 1:

7. then add span(M) to S

8. remove all xi ∈ X if xi ∈ span(M) 9. repeat

44 3.3.1.2 Generalized principle component analysis

Generalized PCA (GPCA) is a method for subspace clustering using techniques from algebraic geometry [98]. GPCA ﬁts a union of k subspaces using a set of polynomials P of order k. To understand this, observe that every hyperplane Si can be parametrized by a vector bi normal to it. Since xi is drawn from the union of k subspaces

d in R , it can be represented with polynomials of the form P(x) = hb1, xi ... hbk, xi = 0. The procedure begins by ﬁtting a homogeneous polynomial of degree k to the points

{x1 ... xn}.

T P can be written as c k(x), where c is a vector of coefficients and vn is the vector n+k−1 of all n monomials of degree k in x. Therefore, to fit the polynomial, we can T T solve c vk(x1) = ... = c vk(xn) = 0 for c. In case of noisy data, c can be fitted using least-squares.

The set of vectors {b1 ... bk} is obtained by taking derivative of the P evaluated at a single point in each subspace. We note that GPCA can also determine k if it is unknown.

3.3.1.3 Combinatorics of subspace clustering

We can evaluate the combinatorial performance of the subspace clustering algorithms presented by viewing the problem as a balls and bins framework, which in general studies the distribution of a probabilistic experiment where a number of balls are independently and randomly thrown into bins [68]. Here, the balls are the sample points and the unknown subspaces are the bins.

Theorem 3.1. The expected number of subspaces that can be recovered from n P 1 n n − n−j−1 samples is E(n, k, s) = kn−1 j=s+1 j+1 (k 1) .

Proof. Observe that, in general position, the combinatorial structure of this problem is equivalent to the classic framework of ball and bins. Let E(n, j) denote the expected number of bins containing exactly s balls after n throws. Observe that E(n, j) satisﬁes − 1 − 1 − − the recurrence relation E(n, j) = (1 k )E(n 1, j) + k E(n 1, j 1) which solves to n − 1 n−j 1 E(n, j) = k j (1 k ) kj+1 . Then sum over all j > s.

45 We give a tight approach to analyzing for dictionary learning that relies on the algebraic and combinatorial geometry of the problem in the next chapter.

3.3.2 s-Independent Smallest Spanning Set

Given a subspace arrangement S coming from from an s-independent dictionary,

the smallest spanning set can be written recursively in terms of the union of: (a) the

spanning set of the arrangement, obtained by taking union of pairwise intersection of all

the subspaces in S, together with points (b): outside the pairwise intersections that are would be necessary and sufﬁcient to completely s-span the subspaces in S. This directly leads to a recursive algorithm for the smallest spanning set problem, as

follows:

Algorithm Smallest Spanning Set

Input: S Output: I

1. S∗ = Pairwise Intersection(S)

2. if S∗ =6 ∅

3. return Smallest Spanning Set (S∗ ∪ S)

4. else : 5. SortByIncreasingDimension(S)

6. I = ∅

7. for Si ∈ S:

8. ﬁnd Si ∩ s-span(I)

9. ﬁnd Si/I, such that Si/I ∪ (Si ∩ s-span(I)) is a basis for Si

10. I = I ∪ Si/I 11. return I

The dictionary is now obtained from picking m atoms from the intersection of the

subspaces and the remaining spanning sets.

46 3.4 Learning an Orthogonal Basis

We turn our attention to an interesting special case. In applications such as

Compressed Sensing, discussed in the Chapter 2, a signal is known to be sparse in some bases B. The is utilized to obtain signiﬁcant sampling complexity gains beyond the

Shannon-Nyquist bound. Here we address the problem of efﬁciently discovering B when

it is unknown.

Definition 10 (Basis Learning). Let X be a given set of vectors in Rd that are known to be generated from a frame orthogonal basis B, and xi = Byi, where ||yi||0 ≤ s. Find any ´ d ´ orthogonal basis B, such that for all xi ∈ X, there exists ýi ∈ R where ´xi = B ýi. √ For the special cases of a basis with non-negative coefficients yi, and s = o( d), we obtain a relatively simple algorithm for Basis Learning. This algorithm is based on

Lemma 1 bellow, which asserts that, in this setting, there is a signiﬁcant chance for random samples to be orthogonal. √ Lemma 1. Two random sparse samples xi and xj from a basis B and s = o( d) are orthogonal with constant probability as d → ∞ .

Proof. The probability generically depends on whether the supports are mutually Q Q 2 s − − s − s h i d s i d 2s d−s exclusive. Pr( xi, xj = 0) = i=0 d−i > i=0 d−s > e .

⊥ { } Given a point xi, and a set of points Xi = xi1 ... xil , that are all orthogonal ⊥ − ⊥ to xi, such that span(Xi ) = d s, we can partition the B as to Bi and Bi such that ⊥ ⊥ ⊥ Bi = span(Xi ) and Bi is the null space of Bi . This suggests a learning algorithm that recursively subdivides B into subspaces:

Algorithm Learn Basis

Input: X, s

Output: B 1. B = ∅

2. for xi ∈ X ⊥ { | ∈ h i } 3. let Xi = xj xj X, xi, xj = 0

47 ⊥ ≥ − 4. if rank(Xi ) d s ⊥ 5. then B1 = LearnBasis (Project X onto Xi , s) ⊥ 6. B2 = LearnBasis (Project X onto null(Xi ), s)

7. B = B1 ∪ B2 8. repeat 3.5 Summary

In this chapter, we illustrated how Dictionary Learning can be approached from a geometric point of view. A formal connection is established to other problems of independent interest. This disciplined way of looking at the problem leads to speciﬁc algorithms, with improvements for simpliﬁed instances that are encountered in practice, as in Section 3.4 (and as we shall see in Chapter 5). In the next chapter, we study the

complexity of Dictionary Learning. This is done with the help of a surrogate problem in which the combinatorics of the subspace supports are speciﬁed. The problem is then

approached using machinery from algebraic and combinatorial geometry. Speciﬁcally,

a rigidity-type theorem is obtained that characterizes the sample arrangements, that

recover a ﬁnite number of dictionaries, using a purely combinatorial property on the

supports.

48 CHAPTER 4 SAMPLING COMPLEXITY FOR GEOMETRIC DICTIONARY LEARNING

For motivations discussed earlier, we are interested in understanding the problem of geometric dictionary learning. From from Chapter 2, recall the deﬁnition of Geometric Dictionary Learning. As discussed in Chapter 2, even the problem of recovering Y given

X where D is known has been shown to be NP-hard by reduction to the Exact Cover by 3-set problem [24]. One is then tempted to conclude as corollary that Geometric

Dictionary Learning is also NP-hard. However, this can not be directly deduced, in

general. The error with this reasoning is that, even though adding a witness D turns the problem into an NP-hard problem (vector selection), it is possible that the Geometric

Dictionary Learning solves to produce a different dictionary D∗.

Now we introduce a surrogate problem where the combinatorics are speciﬁed. In this problem we are also given the supports of the input set in the unknown dictionary. This new problem is called the Restricted Dictionary Learning problem, properly deﬁned in what follows.

Deﬁnition 11 (Restricted Dictionary Learning). Let X be a given set of vectors in Rd.

We are also given an index set Si of the s columns of an unknown s-regular dictionary ´ D = [d1, ... , dm] such that xi ∈ span(dj, j ∈ Si). Find any frame dictionary D, such that ´ ´ ´m= |D| ≤ m, and for all xi ∈ X, ´xi ∈ span(dj, j ∈ Si). This simpliﬁcation enables us to analyze the problem using machinery from

algebraic and combinatorial geometry. For the time being, we restrict ourselves to

s = 2, d = 3. We can project the system on the afﬁne plane so that the di maps to pi,

a 2d point. For notational convenience, we call the projection of xk as simply xk, where the meaning is clear from the context. This deﬁnes a corresponding problem in the

projective plane:

Deﬁnition 12 (Pinned Line-Incidence Problem). Let X be a given set of points (pins) in

2 P (R). We are also given an index set Si of the lines passing through an unknown set of

49 points P = {p1, ... , pm}, such that no 3 points lie on the same line. Find any such set of

points P[X] that satisfy the given Si line incidences on P[X] and X. Pinned Line-Incidence is related to classic problems in the literature such as

Direction Networks [155, 168, 169], Direction Networks with pins on sliders [172],

the Molecular conjecture in 2D [160], and pin-collinear body-pin [171]. Pinned Line-Incidence can be viewed as:

• Direction Networks, except that we are given translations (the xi’s) instead of directions.

• A Body-Pin framework where each body is a line and each pin is on at most 2 bodies. We add that each pin is constrained to a slider.

• A Point-Line coincidences with at most 2 points on a line and each line is pinned by a globally ﬁxed (given) pin.

• A Point-Line coincidences with at most 2 lines on a point and each point is on a globally ﬁxed (given) slider.

However, while these previously studied problems are linear, we shall see, in Section 4.1, that our problem is further complicated by being non-linear (the constraints

are quadratic). The notion of Afﬁne Rigidity [164] is also related to our problem when

viewed in a coordinate-free manner (i.e. only relative positions are relevant). Afﬁne

rigidity asks, for a given set of points in Rn, when do measurements of the relative afﬁne

positions of some subsets of the points determine the positions of all the points, up to an overall afﬁne transformation.

Example 7. Figure 4-1 depicts two examples of the Pinned Line-Incidence Problem.

Our problem is similar in ﬂavor to body-pin frameworks, as in the 2D molecular

theorem, where each body is a line and each pin is on at most 2 bodies. We then add the constraint that each pin lies on a slider. It is also similar to point-line coincidences,

with at most 2 points on a line, and each line is pinned by a globally ﬁxed (given) pin.

Looking at the dual space, we may also view the problem as a point-line coincidences,

with at most 2 lines on a point and each point is on a globally ﬁxed (given) slider.

50 Figure 4-1. Two simple arrangements. The larger (blue) dots are the points P and the small (grey) dots are the given the pins X.

We prove combinatorial conditions that characterize the inputs that recover a ﬁnite number of solutions for P. Note that 2-regularity, i.e. the requirement that no 3 points in P are collinear, simpliﬁes the problem by avoiding badly behaved cases such as

Pappus’s Theorem [165].

4.1 Algebraic Representation

We can derive algebraic systems of equations representing our problem. Now consider a point xk in the line connecting pi and pj. Working in the projective plane and using homogeneous coordinates, the point can be written in terms of pi and pj as follows:

(1 − β)pi,1 + βpj,1 = xk,1

(1 − β)pi,2 + βpj,2 = xk,2.

Solving to remove β, we obtain:

51 pi,1pj,2 − pj,1pi,2 − xk,2(pi,1 − pj,1) + xk,1(pi,2 − pj,2) = 0

Note that is in fact the same statement as (di − xk) × (di − xk) = 0 in homogeneous space. The problem now reduces to solving a system of equations, each of the form

4–1. The system of equations sets a multivariate function to F(P, X) = 0:

    ... F(P, X) = − − − −  pi,1pj,2 pj,1pi,2 xk,2(pi,1 pj,1) + xk,1(pi,2 pj,2) = 0 (4–1)   ...

where F(P, X) is a vector valued function from R|P|+|X| to R|X|. Henceforth, when we view

|P| |X| X as a ﬁxed parameter we let FX(P) = F(P, X), a function from R to R parametrized by X.

Example 8. For the example in Figure 4-2, the polynomial system FX(P) = 0 can now be written as:

Figure 4-2. A simple arrangement of 6 points.

52    p p − p p − x (p − p ) + x (p − p ) = 0  1,1 2,2 1,1 2,2 1,2 1,1 2,1 1,1 1,2 2,2   − − − −  p1,1p2,2 p1,1p2,2 x2,2(p1,1 p2,1) + x2,1(p1,2 p2,2) = 0   p1,1p3,2 − p1,1p3,2 − x3,2(p1,1 − p3,1) + x3,1(p1,2 − p3,2) = 0   p p − p p − x (p − p ) + x (p − p ) = 0  1,1 3,2 1,1 3,2 4,2 1,1 3,1 4,1 1,2 3,2   − − − −  p2,1p3,2 p2,1p3,2 x5,2(p2,1 p3,1) + x5,1(p2,2 p3,2) = 0   p2,1p3,2 − p2,1p3,2 − x6,2(p2,1 − p3,1) + x6,1(p2,2 − p3,2) = 0 which can now be solved using your favorite systems solver such as the Sage open source mathematics software based on Python.

Notice that every point is a constraint that can, potentially, remove a single degree of freedom from our problem. Without any points, the arrangement has a total of 2m degrees of freedom.

4.2 Combinatorial Rigidity

As shown in the previous section, the problem can be viewed as ﬁnding the

common solutions of a system of polynomial equations (real algebraic variety). The

combinatorics of the problem can be viewed as a multi-graph G = (V, E), where each

edge represents the pin constraints xi’s. We describe the approach taken by the tradition of rigidity theory [155, 156, 159,

160, 168–172], and give some of the deﬁnitions [173].

Generally, a solutions exist if the polynomials are independent. However, checking the independence of the polynomial system is equivalent to checking whether one of the

polynomials of the system is in the ideal generated by the others. In general, checking

independence relative to the ideal generated by the variety is computationally hard and

best known algorithms, such as computing Grobner basis, are exponential in time and

space [58]. However, the algebraic system is usually linearized at regular (locally ﬂat, i.e. non-singular) points and then independence can be checked for generic conﬁgurations.

53 A Pinned Line-Incidence Framework (G, X, P), where G = (V, E) is a graph,

2 X: {x1, ... , xm} ⊆ R → E is an assignment of a given set of points xi to edges ∈ → 2m X(xi) = (vi1 , vi2 ) E; and P:V R is an embedding of each vertex vj into a point ∈ 2 pj = P(vj) R such that for each xi, the three points xi, pi1 , pi2 are collinear. Note: when

the context is clear, we use X to denote both the set of points {x1, ... , xm} , as well as the above assignment of these points to edges of G.

Two frameworks (G1,X1,P1) and (G2,X2,P2) are equivalent if G1 = G2 and X1 = X2, i.e. satisfy the same algebraic equations for the same labeled graph and ordered set of

pins. They are congruent if they are equivalent and P1 = P2. Independence of the algebraic system is deﬁned as none of the algebraic constraints is in the ideal generated by the others. Independence implies the existence

of a solution.

Rigidity is the existence of at most ﬁnitely many solutions to the algebraic system

4–1. Minimal Rigidity is the existence of a solution and at most ﬁnitely many solutions. Global Rigidity is the existence of at most 1 solution.

Under appropriate conditions of Lemma 3, rigidity and independence (based on nonlinear polynomials) can be captured by linear conditions in an inﬁnitesimal setting.

Consider the infinitesimal motion vector (infinitesimal flex) of a pair of points pi

and pj constraint by a pin xk, as in Figure 4-3, then their velocities vi and vj satisfy

hvi, nliri,k + hvj, nli(di,j − ri,k) = 0. where nl is the normal to the line, di,j the distance

between pi and pj, and ri,k is the distance between the pin and the pi.

Now consider nl the normal vector to the line li,j, then nl can be written as

[cosαi,j, sinαi,j]. If vP is a column vector whose entries are the velocities of all vi, then

then the constraint hvi, nliri,k + hvj, nli(di,j − ri,k) = 0 translates to a row of the form:

[0...0, ri,k cos αi,j, ri,k sin αi,j, 0...0, (di,j − ri,k) cos αi,j, (di,j − ri,k) sin αi,j, 0...0]vP = 0 .

54 Figure 4-3. The velocities of a pair of points pi and pj constraint by a pin xk.

If the pi’s are not coincident (i.e. di,j =6 0), we can divide the row by di,j. Moreover,

since the number of lines is ﬁnite, a coordinate system can be selected such that cos αi,j is not zero. Therefore, the row pattern can be simpliﬁed to:

[0...0, ak, akbk, 0...0, (1 − ak), (1 − ak)bk, 0...0]. (4–2)

A Rigidity Matrix is a matrix whose kernel is the inﬁnitesimal motions (ﬂexes).

Inﬁnitesimal Independence is deﬁned as independence of the rows of the rigidity

matrix, i.e. the number of rows of the rigidity matrix is the same as its rank. Inﬁnites-

imal Rigidity is the full rank of the rigidity matrix. inﬁnitesimal minimal rigidity is when there is exactly enough independent rows of the rigidity matrix to determine the

variables.

In algebraic geometry, a property is generic intuitively means that the property holds

on the open dense complement of an (real) algebraic variety. Formally,

Definition 13. A framework G(p) is generic w.r.t. a property Q if and only if there exists a neighborhood δ(p) such that for all q ∈ δ(p), p satisfies the property Q if and only if q satisfies Q.

Furthermore we can deﬁne generic properties of the graph,

55 Deﬁnition 14. A property Q of frameworks is generic (i.e, becomes a property of the graph alone) if for all graphs G, either all generic (w.r.t. Q) frameworks of G satisfy Q, or all generic (w.r.t Q) frameworks of G do not satisfy Q.

A framework is generic for Q if an algebraic variety VQ specific to Q is avoided by the given framework. Often, for convenience in relating Q to other properties, a more restrictive notion of genericity is used than stipulated by 13 or 14 above as in Lemma 3. ´ ´ I.e., for convenience, another variety VQ is chosen so that VQ ⊆ VQ. ´ Ideally, the variety VQ corresponding to the chosen notion of genericity should be as tight as possible for the property Q (necessary and sufficient for 13 and 14), and should be explicitly defined, or at least easily testable for a given framework. Once an appropriate notion of genericity is defined for which a property Q is generic, we can treat Q generically as a property of a graph.

The primary activity of the area of combinatorial rigidity is to additionally give purely graph characterizations of such generic properties Q. In the process of drawing such combinatorial characterizations, the notion of ´ genericity may have to be further restricted, i.e, the variety VQ is further expanded by so-called pure conditions that are needed for the combinatorial characterization to go through. (We will see this below in Theorem 4.1)

Note that the generic rank of a generic matrix M is at least as large as the rank of any speciﬁc realization M(p).

4.3 Required Graph Properties

The following gives a pure graph property that will be useful for our purposes.

Deﬁnition 15. The (2, 0)-tightness condition suitable for our problem can be deﬁned on a graph G such that:

• |E| = 2|V|.

• For any V´ ⊂ V, then G´ = (E,´ V)´ , the augmented induced subgraph, satisﬁes ´ ´ ´ |E| ≤ 2|V|. G is the vertex induced subgraph augmented with a self-loop at vi

56 ´ when there are two edges between the same vertex vj ∈ V − V to the same vertex ´ vi ∈ V. A graph is called (2, 0)-sparse if it satisfies the 2nd condition of Definition 15. Example 9. Figure 4-4 gives an example of a configuration whose support multi-graph is (2, 0)-tight.

Figure 4-4

This is a special case of the (k, l)-sparsity condition studied in [156, 158, 168]. A relevant concept from graph matroids, we deﬁne a circuit graph as follows.

Deﬁnition 16. A circuit is a (multi)graph G = (V, E), such that E = T ∪ e, where T is a minimal spanning tree of V and e is an arbitrary edge.

The following lemma gives a useful characterization of (2, 0)-tight graphs in terms of circuits.

Lemma 2. A graph G = (V, E) is composed of the union of 2-edge-disjoint circuits,

G1 = (V, E1) and G2 = (V, E2), if and only if the graph G is (2,0)-tight

Proof. A theorem by Tutte-Nash Williams [167] shows that a graph can be covered by 2-edge-disjoint spanning trees if and only if the graph is (2,2)-sparse. Since the circuits are made of trees and arbitrary edges, it is easy to see that any (multi)graph can be covered by 2-edge-disjoint circuits if and only if the graph is (2,0)-tight.

4.4 Rigidity Theorem in d = 3, s = 2

We state the main result here and then work our way to the proof.

57 Theorem 4.1. In d = 3, the restricted dictionary learning problem is generically minimally rigid (i.e. admits a ﬁnite number of solutions) if and only if the supports’

multi-graph is (2, 0)-tight.

The proof follows the tradition of rigidity theory [155, 156, 159, 160, 168–172]. In

particular we adopt an approach by White and Whiteley [155, 169], in proving rigidity of k-frames. The proof outline is as follows:

• We show that, for our system, inﬁnitesimal rigidity is equivalent rigidity, at a regular point (Lemma 3).

• We obtain a simple form for the Rigidity matrix (Lemma 4) and show that this matrix is equivalent to the Jacobian of the algebraic function in 4–1.

• We show that for a speciﬁc form of the rows of a matrix, deﬁned on a circuit graphs, the determinant is not identically zero (Lemma 5).

• We apply Laplace decomposition to the (2, 0)-tight graph, as a sum of two circuits, to show that the determinant of the Rigidity matrix is identically zero. (Proof of Main Theorem).

• The resulting polynomial is call the pure condition: the relationship between that the system has to satisfy in order for the combinatorial characterization to hold.

It is shown in [153] that, at a regular point, inﬁnitesimal rigidity is equivalent to

generic rigidity. We adapt this connection for our problem as follows. Lemma 3. If P and X are regular points, then generic inﬁnitesimal rigidity is equivalent generic rigidity.

Proof Sketch. First we show that if a framework is regular, inﬁnitesimal rigidity implies rigidity: Consider the polynomial system of equations 4–1, F(X, P). The Implicit Function

Theorem states that there exists a g(x), such that P = g(X) on some open interval, if

and only if the Jacobian JX(P) of F(X, P) w.r.t. P has full rank. Therefore, if the system is inﬁnitesimally rigid, then the solutions to the algebraic system are isolated points

(otherwise g(x) could not be explicit). Since the algebraic system contains ﬁnitely many components, there are only ﬁnitely many such solution and each solution is a 0

58 dimensional point. This implies that the total number of solutions is ﬁnite which is the deﬁnition of rigidity.

To show that generic rigidity implies generic inﬁnitesimal rigidity, we take the

contrapositive: if the system is not inﬁnitesimally rigid, we show that there is a ﬁnite

ﬂex. If (G, P, X) is not inﬁnitesimally rigid, then the rank r of the jacobian JX(P) is less than 2m. Let E∗ be a set of edges in G such that |E∗| = r and the corresponding rows in JX(P) are all independent. There are r independent rows as well. Let PE∗ be the components of P corresponding to those r rows and PE∗⊥ be the remaining components. The r-by-r submatrix, made of up of the corresponding independent rows and columns, is invertible. Then, by the Implicit Function Theorem, in a neighborhood of P there exists

∗ a continuous and differentiable function g such that PE = g(PE∗⊥ ). This identiﬁes ∗ P , whose components are PE∗ and the level set of g corresponding to PE∗ , such that

∗ FX(P ) = 0. The level set defines the finite flexing of the system. Therefore the system is not rigid.

Next we construct a simple Rigidity Matrix which is the Jacobian of Function 4–1.

Lemma 4. The rigidity matrix M whose rows are of the form:

[0...0, ak, akbk, 0...0, (1 − ak), (1 − ak)bk, 0...0]. is the Jacobian of Function 4–1.

Proof. The Jacobian JX(P) can be written by taking the derivatives of Function 4–1 w.r.t.

the points pi’s. The rows of the Jacobian are of the form:

... , pj,2 − xk,2, −pj,1 + xk,1, ... , −pi,2 + xk,2, pi,1 − xk,1, ... . (4–3)

This form can be readily seen as equivalent to Equation 4–2 above, by noticing that −→ −→ −→ −→ the row constraints correspond to (pj − xk) and (pj − xk) projected on the coordinate system.

59 The inﬁnitesimal motions are therefore given by the solutions to MvP = 0, where M is the rigidity matrix of the constraints, such that columns represent the two degrees of freedom of the pi’s. Rows represent constraints imposed by the pins on the velocities of the pi’s. Example 10. Consider the arrangement given in Example 8. If the pins are given:    −2 −4/3 −1 1 4/3 2/3  X =   1 0 −1 −1 0 1

Then the unknown dictionary is:    0 −2 2  D =   2 −1 −1

And the Rigidity Matrix is:    2/3 −1 0 0 8/3 2       4/3 −2 0 0 10/3 1       − − − −   2/3 1 4/3 2 0 0  M =      −4/3 −2 −2/3 −1 0 0       0 0 1 0 3 0    0 0 3 0 1 0 Which is full rank.

Note that there are several correct ways to write the rigidity matrix of the problem, depending on what you consider as the primary components of the columns (points, lines, or both), and whether one chooses to work in primal or dual space. We pick points for columns and work in primal space for the simplicity of the row pattern.

We develop one more lemma, on the generic rank of particular matrices deﬁned on circuit graphs, that will be helpful in analyzing the rank of M.

60 Lemma 5. A matrix N deﬁned on a directed circuit G = (V, E), such that columns are indexed by the vertices and rows by the edges, where the row for ei,j ∈ E has non-zeros entries only at the indices corresponding to vi and vj following the pattern:

[0...0, ak, 0...0, (1 − ak), 0...0] is generically full rank.

Proof. Any circuit G can be written as C ∪ {T1 ... Ts} where C is a single cycle of core vertices, and Ti are disjoint trees (Forest), such that the Ti’s have their root vertex ∈ vTi C.

The sub-matrix for the cycle N[VC,EC] can be shown to be generically full rank by writing the columns in their order in the cycle, and redirect the edges such that they all point in the same direction on the cycle. The determinant of the redirected cycle

NR[VC,EC] is:

Y Y − det(NR[VC,EC]) = avi + (1 avi ) (4–4)

which can be simpliﬁed to the form

X j1 j|vc| sign(j) a1 ... avc (4–5)

|vc| where j ranges from 0 to 2 −1, and ji is the ith base 2 digit in j. Redirecting amounts to

a change of variable of the form ´ak = 1 − ak. After the change of variables, Equation (2)

and (3) represent the condition for which the N[VC,EC] is full-rank, i.e. the pure condition for genericity.

For a given tree Ti and Gk ⊃ C We can show by induction on each level j of the tree, ∪ ∪ that the matrix N[VGk VTi ,EGk ETi ] has full rank.

• Base case: since the root of Ti is in C, then it is in Gk.

61 • If a given level i is full rank, then by Gaussian elimination level i+1 is full rank since level i and level i + 1 are connected.

Applying the above inductive argument inductively to each Ti, with the base case

G1 = C, completes the proof that circuits have full rank. To calculate the determinant, apply a Laplace expansion splitting the circle C and

the forest ∪Ti − C, then observe that the determinant of a forest has a single term

(either ak or 1-ak) for each vertex. Then: Q d(k) − 1−d(k) det(N) = det(NC) ak (1 ak)

where the product is taken over all edges ei in ∪Ti − C, and d(i) = 1 if and only if ei is directed towards C, otherwise d(i) = 0.

Now we are ready to prove the main theorem.

Proof of Main Theorem. The characterization follows from proving that the polynomial

from the determinant of M is not identically zero. Since the number of columns is 2m, it is trivial that 2m pins are necessary. It is also trivial to see that (2, 0) tightness is necessary since a subgraph G´ = (V,´ E)´ exceeding its count, |E´| > 2|V´|, implies that its vertex compliment G∗ = (V∗ = V − V,E∗ = E(V∗)) is under-determined, i.e. |E∗| < 2|V∗|.

Next we show that the converse is true: that 2m pins arranged generically in a

(2, 0)-tight pattern imply inﬁnitesimal rigidity. By grouping and factoring out the columns to separate groups of coordinates (the

a’s and b’s), a simpler matrix is obtained. This can be done by applying a Laplace

expansion to rewrite the determinate of the rigidity matrix, as a sum of products of

determinants (brackets) representing each of the coordinates taken separately: P det(M) = ∀X,Y det(M[A, X])det(M[B, Y])

where the sum is taken over all complementary sets of m rows X and Y. Observe that

M[A, X]’s now have rows of the form:

[0...0, ak, 0...0, (1 − ak), 0...0].

62 Moreover, since the determinant of a matrix is multi-linear in rows of that matrix,

´ det(M[B, Y]) = (bc1,l ... bcm,l)det(M[B, Y])

´ where {c1 ... cm} = c(l) indexes the Laplace partitions, and the rows of M[B, Y] are of the form:

[0...0, ak, 0...0, (1 − ak), 0...0].

Since both M[A, X] and M[B,´ Y] have the same form as N from Lemma 5, their

determinants are generically non-zero if the induced graphs are circuits. We conclude

that, P ´ det(M) = ∀c(l) (bc1,l ... bcm,l)det(M[A, Y])det(M[B, Y])

where the c(l)’s enumerate all the Laplace partitions. Observe that each element of the

sum has a unique multi-linear monomial (bc1,l ... bcm,l) that generically do not cancel with any of the others since det(M[A, Y])det(M[B,´ Y]) are independent of the b’s. This implies that the generic rank of M is 2m if the induced graphs are circuits.

Since, from Lemma 2, (2, 0)-tight graphs can be partitioned into edge disjoint union of

two circuits, this completes the proof. Moreover, substituting the values of det(M[A, Y])

and det(M[B,´ Y]) from Lemma 4 gives the pure condition for genericity.

The Theorem gives a pure condition that characterizes the badly behaved cases

(i.e. the conditions of nongenercity that breaks the combinatorial characterization of

the inﬁnitesimal rigidity). The pure condition is a function of the a’s and b’s which can be calculated from the particular realization (framework) using Lemma 5 and the main

theorem . Whether it is possible to efﬁciently test for genericity from the problem’s input

(the graph and xi’s), is an open problem. The Theorem requires the following genericities:

• The pure condition, which is a function of a given framework.

63 • Generic inﬁnitesimal rigidity, which is the generic rank of the matrix (i.e. the rank of the rigidity matrix of a realization be at least as large as the rank of all other realizations). The relationship between the two notions of genercities is an open question.

Whether is a one implies the other is an area of future development. However, each of

the above conditions is open and dense. Therefore the notion of genericity for the entire

theorem that satisﬁes all of the above conditions is also open, dense, subset of R6m. In

other words the theorem applies compliment to a closed nowhere set. Example 11. The pure condition for the polynomial system presented in Example 8 can be calculated from above. The term corresponding to partitioning at 1, 2, 3 is

(b4b5b6)(−a1a3a4a5 + a1a3a4a6 + a2a3a4a5 − a2a3a4a6 + a1a4a5 − a1a4a6 − a2a4a5 + a2a4a6),

which is the only term that contains b4b5b6 as a factor. The entire expression for the pure condition contain 20 such factors.

The next observation introduces a more succinct way of writing the pure condition

via Grassmann-Plucker¨ coordinates [174]: Observation 5. The pure condition for a M can be written as det(M) = hϕ(M[A]), ϕ(M[B])i. where: • m ϕ(M[.]) is the m/2 Grassmann-Plucker¨ coordinate.

• M[A] and M[B] are the m-by-2m submatrices of M consisting of the ﬁrst and second group of coordinates, respectively. Furthermore, ϕ(M[B]) = ϕ(M[A])∗ · P(b), where

• ϕ(M[A])∗ is ϕ(M[A]) coordinates swapped to complimentary indices. • m { } P(b) is the vector of m/2 combinations of b1 ... bm in reverse order.

• · is element-wise product.

4.4.1 Consequences

Now we related the restricted dictionary learning problem to the general geometric

dictionary learning problem. The following is a useful corollary to the main theorem.

64 3 Corollary 1. Given a set of n points X = x1, .., xn of points in R , generically there is a D

of size m such that Dyi = xi and ||yi||0 ≤ s only if m ≥ n/2. Conversely, if m = n/2, and

the supports of xi (the nonzero entries of the yi) are known to form a (2, 0)-tight graph G, then generically, there is at least one and at most ﬁnitely many such dictionaries.

Proof. One direction holds because for m < n/2, the system is generically overconstrained.

This implies that the system is generically rigid but not minimally rigid. The converse

is implied from our theorem since at m = n/2 we are guaranteed both generic

independence (the existence of a solution) and generic rigidity (at most ﬁnitely many

solutions).

4.5 Summary

In this Geometric Dictionary Learning is investigated theoretically with the help of a surrogate problem, in which the combinatorics of the subspace supports are speciﬁed.

The problem is then approached using machinery from algebraic and combinatorial geometry. Speciﬁcally, a rigidity-type theorem is obtained that characterizes the sample arrangements, that recover a ﬁnite number of dictionaries, using a purely combinatorial property on the supports. In the next chapter, we introduce minor/partial results are presented that need further development. New open questions and ideas relevant to our work are discussed.

65 CHAPTER 5 FUTURE WORK AND CONCLUSIONS

In this chapter we discuss new questions/ideas relevant to our work. We start with discussing minor results that need further development. 5.1 Minor Results

In this section we discuss some potentially interesting minor results that need

further development. We begin with a natural discretization of sparse approximation where the dictionary is picked from the hypercube {−1, 1}n.

5.1.1 Representation Scheme on The Cube, Moutlon Mapping

Suppose we want to restrict ourselves to dictionary atoms picked from the

n-dimensional hypercube {−1, 1}n. Then a natural question arises: given s, what is

the minimum approximation error that can be guaranteed for all x ∈ Rn. With a simple

linear transformation, L(x) = 2x − 1, one can re-frame the question by asking for a vector

x, on the unit ball in Rn, that is far away from every s-dimensional subspace spanned by vectors in {0, 1}n. When s = n, there is no such vector x.

When s < n, this question becomes the same as asking for a set of n numbers

{x1, ... , xn} for which there is no set T with |T| = s such that xi is the sum of some subset

Ti of T. This is an old number theory question by Moulton [152]. He asked whether the ﬁrst n powers of 2 can be obtained from a set T of cardinality less than n (as subset sums).

First, we can ask that for a ﬁxed T, what is the resulting approximation on xi. It is easy to see that the best possible T is T∗ = {20, 21, 22, ... , 2s−1}, therefore the maximum

1 n possible error on an xi is 2s , and so the total l2-square-error is bounded by 22s . However we know that all T ⊂ [2s − 1] = {0, 1, 2, 3, ... , 2s − 1}’s can be generated from T∗ with at

most s sums.

66 Therefore, one can approximate any vectors x using an s-sparse subset of the cube, with the constraint that the coefﬁcients are selected from the on the lattice { 1 2 }s n 0, 2s , 2s , ... , 1 with an error at most 22s . The next step for this is to show the optimal error by consider the points that are

the generalizations of the in-centers at the surfaces of the cube. This may obtain a tight lower bound on the error.

5.1.2 Cautiously Greedy Pursuit √ Consider a Sparse Approximation problem with s < d, the entries of yi are

√1 all non-negative, D is s-regular dictionary, and the error > s . In this setting, a greedy solution is possible. A sketch of the solution idea is as follows. Given x and

D = {d1, ... , dm}, the algorithm ﬁnds y by iteratively following the greedy direction on

D (i.e. pick the column di with largest inner product with the residual of y) for a step | | y 2 − n −n/s proportional to sqrt(s) . The L2 error converges as O((1 1/s) ) = O(e ) in n steps. Extending this approach may be a topic of future development.

5.2 Future Work From Chapter 2

This section proposes some directions for future development of our work in

Chapter 2. We discuss DRDL trade-offs and the Hierarchy question.

5.2.1 DRDL Trade-offs

One interesting direction is to investigate the effect of capping the dimension

reduction in various ways (e.g. pooling functions in Smale’s framework). Imposing

invariance selectivity in the dimension reduction step lowers the embedding dimension

at the expense of invertibility, similar to H-max. A more general approach would be to incorporate partial selectivity for invariances, whereby one learns dimension reductions

that are lossy for invariances but maintain some variance for a measure of interest. For

example, in vision one can attempt to impose partial selectivity for rotational invariance

and scale invariances by a dimension reduction that pools over small rotations or small

67 scale shifts. Partial-Invariance selectivity can be modeled as the addition of randomness to our generative model. We will explore this direction future work.

Adding invariance to the Dictionary Learning steps improves the sampling

complexity. For instance, time and space share the property of being shift-invariant.

One can model the same spatiotemporal block with a single dictionary or with a three level hierarchy of shift-invariant DRDL reﬂecting two dimensions of space and one of

time. Shift-invariant dictionaries have been studied in the context of learning audio

sequences yielding improved performance empirically [128].

5.2.2 The Hierarchy Question

Towards formally understanding the implications of using a hierarchy, we propose a

list of questions. [148] develops a theoretical framework with a similar goal based upon

the H-Max class of models, which we have shown to be qualitatively different from ours. The question of evaluating the (in)signiﬁcance of hierarchies can be divided into two

parts:

1. Capacity of Model: characterize the concept classes expressible relative to the number of levels.

2. Complexity of Learning: characterize the differences in sampling complexity, computational complexity, etc., as a function of the mismatch between the generative and learning models.

When is the hierarchical generative model a good assumption? There are

arguments from complex system theory and evolutionary theory on why our environment is modular [13]. The basic premise is informal and is usually along the lines that natural selection requires stable/robust modules before it can build more complicated ones in a sustainable manner. We seek a more comprehensive answer. To obtain a formal computational understanding of this question, we ask what potential classes of functions do our generative models encode? If we restrict ourselves to loss-less DR, then a single layer of DRDL in generative mode can be thought of as a generator of functions given it’s input as a seed, where a vector corresponds to the value of a single function over

68 it’s entire domain. The generative model of HSR then corresponds to a compositions of these function. This is directly connected to a classic goal of Approximation Theory,

understanding what happens when applying compositions and superpositions. An effort

to understand this question gave rise to Hilbert’s 13th problem: whether the 3 variable

polynomial x7 + a ∗ x3 + b ∗ x2 + c ∗ x + 1 = 0 can be written as a the concatenation of a ﬁnite number of two-variable functions. A general answer to Hilbert’s 13th problem was given by Arnol’d showing that every continuous function of three variables be expressed as a of ﬁnitely many continuous functions of two variables [147].

Given a hierarchical generative model, what are the complexity trade-offs of using a different depth for the learning hierarchy? For instance we might want to understand the complexity cost of learning a hierarchical generative model with a single large layer. We touched upon this issue in Section 2.2.2.6. A more comprehensive analysis involving

any mismatch, such as different depth hierarchies and different complexity within each

layer, is a potential area of future development. 5.3 Future Work From Chapter 3

This section touches upon future extensions and applications of our work in Chapter

3. We discuss potential generalization and application of the Cluster and Intersect Algorithm.

5.3.1 Cluster and Intersect Extensions

The Cluster and Intersect algorithm can be extended to different dimensions of

support in a relatively straightforward way. Both RANSAC and GPCA can be extended to

work correctly for a mixture of subspace dimensions. The intersection algorithm can be

modiﬁed as well.

The geometric approach can be extended to non-zeros noise as well. Both RANSAC and GPCA algorithms extend to noisy subspaces. However, their performance

is limited by the amount of noise. For a comprehensive survey, we refer the reader to

[98]. The intersection of two subspaces can now be casted as ﬁnding whether a set of

69 system of linear inequalities has a feasible solution. Given two subspaces S1 and S2 and a bound on the error , the problem of intersecting the subspaces can then be cast as:

min z, s.t.

⊥ S1 z < (5–1) ⊥ S2 z <

⊥ ⊥ − where S1 and S2 are matrices spanning the orthogonal nullspace of dimension d s of

S1 and S2 respectively. In this case of hyperplanes they reduce to the deﬁning vectors b. The algorithm then proceeds as above, except that the intersection is done by Linear

Programming [133] to ﬁnd a feasible solution. A comprehensive analysis of the algorithm under the noisy case, possibly shedding light into quality degradation as a function of

noise, is an interesting direction for future work.

5.3.2 Temporal Coherence

Temporal coherence is often encountered in practical instances of the Dictionary

Learning problem, such as object recognition or feature learning from a recorded video

sequence or the visual sensory stream input to the brain. Temporal coherence can be

modeled through PY, which is no longer Independently and Identically Distributed (i.i.d.), and depends on a process.

For instance, temporal coherence can be modeled as a random walk over the space

of supports which can be a union of s-dimensional subspace arrangements. In this

case, the samples follow a random walk over the corresponding subspace arrangement.

Observe that the subspace clustering step can be performed efﬁciently by modifying the RANSAC algorithm to sample the inputs in sequence of s + 1 consecutive pairs, then all

s + 1 subsets of s + 2 consecutive pairs, and so on. Furthermore, the subspace clustering algorithm can attempt to learn subspaces of dimensions greater than s then intersect

those. In this case, increasing s is advantageous to the algorithm.

70 The analysis of the algorithm in the temporal coherence setting depends on the details of the random process. An area of future work would be to apply this algorithm to commonly encountered processes and model the performance of this algorithm both empirically and theoretically (based on empirically motivated idealization of the generative process). 5.4 Future Work From Chapter 4

This section discusses some directions for future extensions and generalizations of our main result in Chapter 4. 5.4.1 Uniqueness and Number of Solutions

The rigidity theorem of Chapter 4 shows that (2, 0)-sparsity generically guarantees a ﬁnite number of solutions. In similar problems studied in the literature [155, 160, 168,

169, 171, 172], the resulting algebraic system is linear, therefore generic rigidity implies global rigidity, i.e. a generically rigid system has a single solution. However, this is not the case for our problem due to non-linearity. Conditions that characterize global rigidity remain an open question. Example 12. Consider the pin conﬁguration in Figure 5-1 with two distinct solutions:

Figure 5-1. A k-5 conﬁguration with two distinct solutions.

take any K5 (complete graph with |V| = 5) and rotate a copy of itself as illustrated above. The pins are the intersection points between the two K5’s.

71 Using Bezout’s theorem [65], we can obtain a weak upper bound on the number of solutions. This follows from the fact that the number of intersections of algebraic surfaces in projective space, counting multiplicities, is simply the product of the degrees of the equations of the surfaces.

5.4.2 Higher Dimensions

Generalizing the theorem to higher dimension takes the problem to the domain of hypergraphs. A corresponding generalization of Direction Networks [155, 169] to higher dimensions may provide the mathematical tools and background for undertaking this task.

One caveat is that the sparsity to be used in the analysis may depend on the arrangement graph. For example, consider Figure 5-2, if we use s = 2 then there is not enough points to determine the dictionary.

Figure 5-2. An arrangement of 6 points in d = 3 that is sufﬁcient to determine their minimum generating dictionary of size 4.

However, if we work with s = 3, and notice that each point is lying on two planes

(the faces of the simplex), then it is easy to see that all the faces of the simplex are ﬁxed.

This implies that the realization is rigid. Therefore, one approach is to view the problem’s combinatorics as a series of hypergraphs for increasing s < d, The given pins are counted in multiple hyper-edges to account for the resulting degeneracy.

72 5.4.3 Genericty

The relationship between the notions of genercities introduced is an open question.

In particular, whether inﬁnitesimal rigidity and the pure condition are related is an area of future development.

Yet another open direction is to investigate whether it is possible to efﬁciently test for genercity from the problem’s input (the graph and xi’s). 5.4.4 Computing The Realization

Another open question is computing the realization from the input. Chapter 3 discusses algorithms for solving the Geometric Dictionary Learning problem, but these do not translate to optimal solutions of the Restricted Dictionary Learning problem. An algorithm that meets the theoretical bound of Theorem 1 is an interesting direction of research.

5.5 Future Extension For The Model

This section discusses directions for future development of our model. The proposed model can naturally be extended to incorporate feedback. In neuroscience, the role of feedback is debated. We view feedback as a Bayesian process by which a generative model predicts incoming sensory stream. In this view, both attention [18] and action are naturally manifest. Action can be interpreted as active inference [37], i.e. sampling the sensorium to minimize free-energy.

For attention and action, we are interested in feedback forward in time. A dynamic model can be obtained from DRDL/HSR. The ﬁrst step is to modify the vector selection algorithms for use in prediction in time by inferring only on the portion of the input representing the present and past.

We may extend the model with the ability to learn its own topology. We start with a simple hierarchical topology, reﬂecting the topographic mapping (in biology this is learned on the evolutionary scale) of the sensory cortex. Over time, connections are modiﬁed and new connections are created via experience. A possible on-line approach

73 to this can be inspired from the Complementary Learning model of the Hippocampus [19]. This views the cortex as a slow learner, with the Hippocampus as a fast learner

on top. The Hippocampus remembers associations across the cortex and later, such as

during sleep, syndicates its information by modifying the connections within the cortex.

5.6 Conclusion

We introduced a novel formulation of an elemental building block that could serve as a the bottom-up piece in the common cortical algorithm. This model leads to several interesting theoretical questions. We illustrated how additional prior assumptions on the generative model can be expressed within our integrated framework. Furthermore, this framework can also be extended to address feedback, attention, action, complimentary learning and the role of time.

Dictionary Learning has been approached from a geometric point of view. A connection was established to other problems of independent interest. This disciplined way of looking at Dictionary Learning introduced speciﬁc algorithms, with improvements for simpliﬁed instances, such as temporal coherence, which are commonly encountered in practice.

We investigated Dictionary Learning theoretically using machinery from algebraic and combinatorial geometry. Speciﬁcally, a rigidity-type theorem is obtained that characterizes the sample arrangements, that recover a ﬁnite number of dictionaries, using a purely combinatorial property on the supports.

Finally, we discussed some minor results and questions, and outlined the next steps for this work.

74 REFERENCES [1] Michal Aharon, Michael Elad, and Alfred Bruckstein. K -SVD : An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. Structure, 54(11):4311–4322, 2006.

[2] Hisham E Atallah, Michael J Frank, and Randall C O’Reilly. Hippocampus, cortex, and basal ganglia: insights from computational models of complementary learning systems. Neurobiology of learning and memory, 82(3):253–67, November 2004.

[3] K Kreutz-Delgado, Joseph F. Murray, Bhaskar D. Rao, Kjersti Engan, Te-Won Lee, Terrence J. Senjowski Dictionary Learning Algorithms for Sparse Representation. Structure Neural Computation, 2002.

[4] Francis Bach. Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning. Distribution, (2).

[5] Francis Bach, Inria Willow Project-team, and Guillermo Sapiro. Online Learning for Matrix Factorization and Sparse Coding. pages 1–45.

[6] Lucy A Bates, Phyllis C Lee, Norah Njiraini, Joyce H Poole, Katito Sayialel, Soila Sayialel, Cynthia J Moss, and Richard W Byrne. Do Elephants Show Empathy? (10):204–225, 2008.

[7] a J Bell. Levels and loops: the future of artiﬁcial intelligence and neuroscience. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 354(1392):2013–20, December 1999.

[8] a J Bell and T J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–59, November 1995.

[9] Yoshua Bengio and Yann Lecun. Scaling Learning Algorithms towards AI. Large- Scale Kernel Machines, MIT Press, 2007.

[10] Gary G Blasdel. Orientation Selectivity , Striate Cortex Preference , and Continuity in Monkey. Cortex, 12(August), 1992. [11] Jake Bouvrie, Tomaso Poggio, Lorenzo Rosasco, Steve Smale, and Andre Wibisono. Generalization and Properties of the Neural Response. Computer Science and Artiﬁcial Intelligence Laboratory Technical Report, 2010. [12] Jake Bouvrie, Lorenzo Rosasco, and Tomaso Poggio. On Invariance in Hierarchical Models. Advances in Neural Information, 2009.

[13] HA Simon The Architecture of Complexity. Proceedings of the American philosophical society, 1962

75 [14] Emmanuel J Cand, Yonina C Eldar, Deanna Needell, and Paige Randall. Compressed Sensing with Coherent and Redundant Dictionaries. Communi- cations, pages 1–21, 2010. [15] E Candes. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9-10):589–592, May 2008.

[16] Gunnar Carlsson, Tigran Ishkhanov, Vin Silva, and Afra Zomorodian. On the Local Behavior of Spaces of Natural Images. International Journal of Computer Vision, 76(1):1–12, June 2007.

[17] Damon M Chandler and David J Field. Estimates of the information content and dimensionality of natural scenes from proximity distributions. Journal of the Optical Society of America. A, Optics, image science, and vision, 24(4):922–41, April 2007. [18] Sharat S. Chikkerur, Thomas Serre, and Tomaso Poggio. A Bayesian inference theory of attention : neuroscience and algorithms. October, 2009.

[19] McClelland, OReilly, McNaughton. Why There Are Complementary Learning Systems in the Hippocampus and Neocortex: Insights From the Success and Failures of Connectionist Model of Learning and Memory. Psychological Review, 1995.

[20] Sharat S. Chikkerur, Thomas Serre, Cheston Tan, and Tomaso Poggio. What and where: A Bayesian inference theory of attention. Vision Research, May 2010.

[21] Sharat S. Chikkerur, Cheston Tan, Thomas Serre, and Tomaso Poggio. An integrated model of visual attention using shape-based features. 2009.

[22] Y Dan, J J Atick, and R C Reid. Efﬁcient coding of natural scenes in the lateral geniculate nucleus: experimental test of a computational theory. The Journal of neuroscience : the ofﬁcial journal of the Society for Neuroscience, 16(10):3351–62, May 1996. [23] Sanjoy Dasguta and Anupam Gupta. An Elementary Proof of the Johnson-Lindernstrauss Lemma, 1999.

[24] Geo Davis. Adaptive Nonlinear Approximations, PhD Thesis, New York University, 1994.

[25] Thomas Dean. Scalable Inference in Hierarchical Generative Models. Pro- ceedings of the Ninth International Symposium on Artiﬁcial Intelligence and Mathematics, 2006.

[26] Thomas Dean, Glenn Carroll, and Richard Washington. On the Prospects for Building a Working Model of the Visual Cortex. Science, 1999.

76 [27] Thomas Dean, Rich Washington, and Greg Corrado. Sparse Spatiotemporal Coding for Activity Recognition. Science, (March), 2010.

[28] R Devore. Deterministic constructions of compressed sensing matrices. Journal of Complexity, 23(4-6):918–925, August 2007.

[29] Alexander G Dimitrov, Aurel a Lazar, and Jonathan D Victor. Information theory in neuroscience. Journal of computational neuroscience, 30(1):1–5, February 2011.

[30] D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, April 2006.

[31] D.L. Donoho, M. Elad, and V.N. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1):6–18, January 2006.

[32] Richard Durbin and Mitchison Graeme. A dimension reduction framework for understanding cortical maps. Group, 1990. [33] M. Elad and Alfred Bruckstein. A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Transactions on Information Theory, 48(9):2558–2567, September 2002.

[34] Harriet Feldman and Karl J Friston. Attention, uncertainty, and free-energy. Frontiers in human neuroscience, 4(December):215, January 2010.

[35] Alyson K Fletcher and Sundeep Rangan. Orthogonal Matching Pursuit from Noisy Measurements : A New Analysis . Electrical Engineering, pages 1–9.

[36] Karl J Friston. Embodied Inference or I think therefore I am if I am what I think. Optimization, pages 89–125. [37] Karl J Friston. Hierarchical models in the brain. PLoS computational biology, 4(11):e1000211, November 2008.

[38] Karl J Friston, Jean Daunizeau, and Stefan J Kiebel. Reinforcement learning or active inference? PloS one, 4(7):e6421, January 2009.

[39] Karl J Friston, Jean Daunizeau, James Kilner, and Stefan J Kiebel. Action and behavior: a free-energy formulation. Biological cybernetics, 102(3):227–60, March 2010.

[40] Karl J Friston, James Kilner, and Lee Harrison. A free energy principle for the brain. Journal of physiology, Paris, 100(1-3):70–87. [41] Karl J Friston, Jer´ emie´ Mattout, and James Kilner. Action understanding and active inference. Biological cybernetics, pages 137–160, February 2011.

[42] Nello Cristianini, John Shawe-Taylor An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. 2000.

77 [43] David Haussler Overview of the Probably Approximately Correct (PAC) Learning Framework, 1995.

[44] Michael J. Kearns, Umesh Vazirani. An Introduction to Computational Learning Theory, MIT Press, 1994.

[45] Karl J Friston and Klaas E Stephan. Free-energy and the brain. Synthese, 159(3):417–458, December 2007.

[46] Jiri Najemnik. Eye movement statistics in humans are consistent with an optimal search strategy. Journal of Vision, 8:1–14, 2008.

[47] Peter Buhlmann, Abraham Wyner Variable Length Markov Chains. Annals of Statistics, Volume 27, Number 2, 480-513, 1999. [48] S. A. Nene, S. K. Nayar and H. Murase Columbia Object Image Library (COIL-100). Technical Report CUCS-006-96, February 1996.

[49] Imola K. Fodor A survey of dimension reduction techniques.. LLNL Technical Report, 2002.

[50] Jeff Hawkins, Sandra Blakeslee On Intelligence. Times Books, 2005.

[51] Dileep George and Jeff Hawkins. Towards a mathematical theory of cortical micro-circuits. PLoS computational biology, 5(10):e1000532, October 2009.

[52] Larochelle, Bengio, Louradour, Lamblin. Exploring Strategies for Training Deep Neural Networks. Journal of Machine Learning Research, 2009.

[53] Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time-series. The Handbook of Brain Theory and Neural Networks. MIT Press, 1995

[54] SS Chen, DL Donoho Atomic decomposition by basis pursuit. SIAM review, 2001. [55] Dileep George. How the Brain Might Work: A Hiearchical and Temporal Model for Learning and Recognition PhD Thesis, Stanford, 2008.

[56] Anna Gilbert and Piotr Indyk. Sparse Recovery Using Sparse Matrices. Proceed- ings of the IEEE, 98(6):937–947, June 2010.

[57] Joshua Gluckman. Higher order whitening of natural images. Computer Vision and Pattern Recognition, 2005. [58] Johannes Mittmann Grobner Bases: Computational Algebraic Geometry and its Complexity, 2007.

[59] Ben Goertzel, Itamar Arel, and Matthias Scheutz. Toward a Roadmap for Human-Level Artiﬁcial General Intelligence : Embedding HLAI Systems in Broad,

78 Approachable, Physical or Virtual Contexts Preliminary Draft. Intelligence, pages 1–6, 2009.

[60] K. Engan, S. O. Aase, and J. H. Husy, Method of optimal directions for frame design. Proc. ICASSP, Vol. 5, pp. 2443 - 2446, 1999.

[61] I.T. Jolliffe Principal Component Analysis, Springer, 2nd edition, 2002.

[62] Jiri Matousek Lectures on Discrete Geometry, Chapter 15, 2002.

[63] Jingu Kim and Haesun Park Sparse Nonnegative Matrix Factorization for Clustering, 2008.

[64] Patrik O. Hoyer Sparse Nonnegative Matrix Factorization with Sparseness Constraints Journal of Machine Learning Research, 2004.

[65] Igor V. Dolgachev. Introduction to Algebraic Geometry. 2010. [66] Noah D Goodman, Tomer D Ullman, and Joshua B Tenenbaum. Learning a theory of causality. Psychological review, 118(1):110–9, January 2011.

[67] M. J. D. Powell Approximation Theory and Methods. Cambridge University Press, 1981.

[68] Devdatt P. Dubhashi, Alessandro Panconesi. Concentration of Measure for the Analysis of Randomised Algorithms. Cambridge University Press, 2005. [69] I.F. Gorodnitsky and B.D. Rao. Introduction to Approximation Theory. IEEE Transactions on Signal Processing, 45(3):600–616, March 1997.

[70] Hossein Mobahi, Ronan Collobert, Jason Weston Deep Learning from Temporal Coherence in Video. International Conference on Machine Learning, 2009.

[71] T. Serre. Learning a dictionary of shape-components in visual cortex: Comparison with neurons, humans and machines. PhD Thesis, Massachusetts Institute of Technology, Cambridge, MA, April, 2006

[72] Karol Gregor and Yann Lecun. Learning Fast Approximations of Sparse Coding. International Conference on Machine Learning, 2010. [73] Geoffrey E Hinton. Learning multiple layers of representation. Trends in cognitive sciences, 11(10):428–34, October 2007.

[74] J. J. Hopﬁeld. Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proceedings of the National Academy of Sciences, 79(8):2554–2558, April 1982.

[75] J. J. Hopﬁeld. Neurons with Graded Response Have Collective Computational Properties like Those of Two-State Neurons. Proceedings of the National Academy of Sciences, 81(10):3088–3092, May 1984.

79 [76] Jonathan C Horton and Daniel L Adams. The cortical column: a structure without a function. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 360(1456):837–62, April 2005. [77] Patrik O Hoyer. Non-negative sparse coding. Neural Networks for Signal Processing XII (Proc. IEEE Workshop on Neural Networks for Signal Processing, 2002. [78] Patrik O Hoyer. Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine Learning Research, 5:1457–1469, 2004.

[79] a Hyvarinen,¨ P O Hoyer, and M Inki. Topographic independent component analysis. Neural computation, 13(7):1527–58, July 2001.

[80] E M Izhikevich. Simple model of spiking neurons. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 14(6):1569–72, January 2003.

[81] Eugene M Izhikevich. Which model to use for cortical spiking neurons? IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 15(5):1063–70, September 2004.

[82] Vernon B. Mountcastle Perceptual Neuroscience: The Cerebral Cortex. Harvard University Press 1 edition, 1998. [83] Herbert Jaeger and Harald Haas. Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, April 2004. [84] Rodolphe Jenatton, Inria Fr, and Francis Bach. Proximal Methods for Sparse Hierarchical Dictionary Learning. Proceedings of the International Conference on Machine Learning, 2010 [85] Rodolphe Jenatton, Guillaume Obozinski, and Francis Bach. Structured Sparse Principal Component Analysis. Journal of Machine Learning Research, 2010.

[86] Anatoli Juditsky. On Veriﬁable Sufﬁcient Conditions for Sparse Signal Recovery via L1 Minimization, 2008.

[87] Yan Karklin and Michael S Lewicki. Learning higher-order structures in natural images. Network (Bristol, England), 14(3):483–99, August 2003. [88] Thomas P Karnowski, Itamar Arel, and Derek Rose. Deep Spatiotemporal Feature Learning with Application to Image Classiﬁcation. Electrical Engineering.

[89] Kenneth Kreutz-Delgado, Joseph F Murray, Bhaskar D Rao, Kjersti Engan, Te-Won Lee, and Terrence J Sejnowski. Dictionary learning algorithms for sparse representation. Neural computation, 15(2):349–96, February 2003.

80 [90] M F Land and R D Fernald. The evolution of eyes. Annual review of neuroscience, 15(1990):1–29, January 1992.

[91] Ann B. Lee. Treelets A Tool for Dimensionality Reduction and Multi-Scale Analysis of Unstructured Data. Journal of Machine Learning Research, 2007.

[92] Ann B. Lee, Boaz Nadler, and Larry Wasserman. TreeletsAn adaptive multi-scale basis for sparse unordered data. Annals of Applied Statistics, 2(2):435–471, June 2008.

[93] Honglak Lee and Andrew Y Ng. Efﬁcient sparse coding algorithms. Neural Information Processing Systems, 2006.

[94] Tai Sing Lee and David Mumford. Hierarchical Bayesian inference in the visual cortex. Journal of the Optical Society of America. A, Optics, image science, and vision, 20(7):1434–48, July 2003.

[95] Jing Lei, Nicolai Meinshausen, David Purdy, and Vince Vu. The Composite Absolute Penalties Family For Grouped Hierarchical Variable Selection. Annals of Statistics, 2009. [96] D a Leopold, a J O’Toole, T Vetter, and V Blanz. Prototype-referenced shape encoding revealed by high-level aftereffects. Nature neuroscience, 4(1):89–94, January 2001. [97] M S Lewicki and T J Sejnowski. Learning overcomplete representations. Neural computation, 12(2):337–65, February 2000.

[98] R. Vidal A Tutorial On Subspace Clustering. In Press. 2011.

[99] Tianyang Lv, Shaobin Huang, Xizhe Zhang, and Zheng-xuan Wang. A Robust Hierarchical Clustering Algorithm and its Application in 3D Model Retrieval. First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS’06), pages 560–567, June 2006.

[100] K Zhang. Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: a theory. The Journal of neuroscience, 16(6):2112–26, March 1996.

[101] Peng Zhao and Bin Yu. On Model Selection Consistency of Lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. [102] Julien Mairal, Francis Bach, Inria Willow Project-team, and Guillermo Sapiro. Online Learning for Matrix Factorization and Sparse Coding. Journal of Machine Learning Research, 11:19–60, 2010. [103] E. H. Mckinney. Generalized Birthday Problem. The American Mathematical Monthly, 73(4):385, April 1966.

81 [104] Martin P Meyer and Stephen J Smith. Evidence from in vivo imaging that synaptogenesis guides the growth and branching of axonal arbors by two distinct mechanisms. The Journal of neuroscience : the ofﬁcial journal of the Society for Neuroscience, 26(13):3604–14, March 2006.

[105] Hossein Mobahi, Ronan Collobert, and Jason Weston. Deep learning from temporal coherence in video. Proceedings of the 26th Annual International Conference on Machine Learning, pages 1–8, 2009.

[106] Cristopher M Niell, Martin P Meyer, and Stephen J Smith. In vivo imaging of synapse formation on a growing dendritic arbor. Nature neuroscience, 7(3):254–60, March 2004.

[107] Bruno A Olshausen, C H Anderson, and D C Van Essen. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. The Journal of Neuroscience, 13(11):4700–19, November 1993.

[108] Bruno A Olshausen and D J Field. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision research, 37(23):3311–25, December 1997.

[109] Tomaso Poggio and D Marr. Cooperative Computation of Stereo Disparity. Advancement Of Science, 194(4262):283–287, 2008.

[110] M Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019–25, November 1999.

[111] S T Roweis and L K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science (New York, N.Y.), 290(5500):2323–6, December 2000. [112] Christopher J Rozell, Don H Johnson, Richard G Baraniuk, and Bruno a Olshausen. Sparse coding via thresholding and local competition in neural circuits. Neural computation, 20(10):2526–63, October 2008. [113] Sylvain Sardy, Andrew G. Bruce, and Paul Tseng. Block Coordinate Relaxation Methods for Nonparametric Wavelet Denoising. Journal of Computational and Graphical Statistics, 9(2):361, June 2000.

[114] Terrence J Sejnowski and Zachary Mainen. Reliability of Spike Timing in Neocortical Neurons. Advancement Of Science, 268(5216):1503–1506, 2008.

[115] Thomas Serre, Gabriel Kreiman, Minjoon Kouh, Charles Cadieu, Ulf Knoblich, and Tomaso Poggio. A quantitative theory of immediate visual recognition. Brain, 165:33–56, 2007.

[116] Thomas Serre, Aude Oliva, and Tomaso Poggio. A feedforward architecture accounts for rapid categorization. Proceedings of the National Academy of Sciences of the United States of America, 104(15):6424–9, April 2007.

82 [117] Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and Tomaso Poggio. Robust object recognition with cortex-like mechanisms. IEEE transactions on pattern analysis and machine intelligence, 29(3):411–26, March 2007. [118] M N Shadlen and W T Newsome. Noise, neural codes and cortical organization. Current opinion in neurobiology, 4(4):569–79, August 1994.

[119] W R Softky and C Koch. The highly irregular ﬁring of cortical cells is inconsistent with temporal integration of random EPSPs. The Journal of Neuroscience, 13(1):334–50, January 1993.

[120] Pablo Sprechmann and Guillermo Sapiro. Dictionary Learning and Sparse Coding for Unspervised Clustering. IMA, 2009.

[121] Downing Street and United Kingdom. Biological Cybernetics. Nature, 237(5349):55–56, May 1972.

[122] Dmitri B Strukov, Gregory S Snider, Duncan R Stewart, and R Stanley Williams. The missing memristor found. Nature, 453(7191):80–3, May 2008.

[123] J B Tenenbaum, V de Silva, and J C Langford. A global geometric framework for nonlinear dimensionality reduction. Science (New York, N.Y.), 290(5500):2319–23, December 2000.

[124] Misha Tsodyks. Attractor neural networks and spatial maps in hippocampus. Neuron, 48(2):168–9, October 2005.

[125] Richard Turner. A theory of cortical responses , Karl Friston , 2005. Neuroscience, 2005.

[126] Michael A Webster, Daniel Kaping, Yoko Mizokami, and Paul Duhamel. Adaptation to natural facial categories. Nature, 428(April):357–360, 2004.

[127] Heiko Wersing and Edgar Korner.¨ Learning optimized features for hierarchical models of invariant object recognition. Neural computation, 15(7):1559–88, July 2003.

[128] Boris Mailhe,´ Sylvain Lesage, Remi´ Gribonval, Fred´ eric´ Bimbot, and Pierre Vandergheynst. Shift-Invariant dictionary Learning For Sparse Representations : Extending k-SVD. Proc. EUSIPCO, 2008.

[129] Julien Mairal, Francis Bach, Andrew Zisserman, and Guillermo Sapiro. Supervised Dictionary Learning Neural Information Processing, 2008.

[130] Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Least Angle Regression. Statistics Department, Stanford University., 2003.

83 [131] Timothee Masquelier, Thomas Serre, and Tomaso Poggio. Learning complex cell invariance from natural videos : A plausibility proof CBCL Paper, Massachusetts Institute of Technology, Cambridge, MA, USA, 2007. [132] Kenneth Miller, Joseph Keller, and Michael Stryker. Ocular Dominance Column Development: Analysis and Simulation. Science, 1989.

[133] Alan Sultan. Linear Programming: An Introduction With Applications CreateS- pace, 2011.

[134] Pati, Y.C., CARezaiifar, R. , Krishnaprasad, P.S. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition, Signals, Systems and Computers 1993.

[135] Bruno a Olshausen and David J Field. Emergence of simple cell receptive ﬁeld properties by learning a sparse code for natural images. Nature, 1996.

[136] Steve Smale, Tomaso Poggio, Andrea Caponnetto, and Jake Bouvrie. Derived Distance : towards a mathematical theory of visual cortex . Artiﬁcial Intelligence, 2007. [137] Evan C Smith and Michael S Lewicki. Efﬁcient auditory coding. Nature, 439(7079):978–82, February 2006.

[138] W R Softky. Simple codes versus efﬁcient codes. Current opinion in neurobiology, 5(2):239–47, April 1995.

[139] Shelley Derksen, H. J. Keselman. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables., British Journal of Mathematical and Statistical Psychology, Volume 45, Issue 2, pages 265282, November, 1992.

[140] Irina F. Gorodnitsky, Bhaskar D. Rao Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Re-weighted Minimum Norm Algorithm. IEEE Transac- tions on Signal Processing, 1997.

[141] Mark D Plumbley. Dictionary Learning for L1-Exact Sparse Coding. Proceedings of the 7th international conference on Independent component analysis and signal separation, Pages 406-413, 2007.

[142] Tomaso Poggio and Steve Smale. The Mathematics of Learning : Dealing with Data 1 Introduction. Notices of the American Mathematical Society, 2003.

[143] Ignacio Ramirez, Pablo Sprechmann, and Guillermo Sapiro. Classiﬁcation and Clustering via Dictionary Learning with Structured Incoherence and Shared Features. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2010.

84 [144] Bhaskar D Rao. Signal Processing with the Sparseness Constraint. IEEE International Conference on Acoustics, Speech and Signal Processing, 1998.

[145] Herbert Simon. The Organization of Complex Systems Hierarchy Theory - The Challenge of Complex Systems, Goerge Braziller, New York, pages: 1-27, 1973.

[146] Eero P Simoncelli and Bruno A Olshausen. Natural Image Statistics and Neural Representation. Annual Review of Neuroscience, vol. 24, pp. 1193-1216, May 2001.

[147] Ziqin Feng Hilberts 13th Problem. PhD Thesis, University of Pittsburgh, 2010.

[148] Steve Smale and Felipe Cucker. On the Mathematical Foundation of Learning. Bull. Amer. Math. Soc., 39, 1-49, 2002. [149] Philipp Robbel and Deb Roy. Exploiting Feature Dynamics for Active Object Recognition. Change.

[150] Scott Shaobing Chen, David Donoho, Michael SAunders. Atomic Decomposition By Basis Pursuit. Society for Industrial and Applied Mathematics, Volume 43 Issue 1, Pages 129 - 159, 2001.

[151] Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol 58, No 1, pages 267-288, 1996.

[152] David Petrie Moulton Number Theory and Groups. PhD Thesis, University of California, Berkeley, 1995. [153] L. Asimow, B. Roth The Rigidity of Graphs. Transactions of the AMS 245, 279289, 1978.

[154] Pet Christian Hansen The truncated SVD as a method for regularization. BIT Computer Science and Numerical Mathematics, Volume 27 Issue 4, Pages 534 - 553,Oct. 1987

[155] White and Whiteley The algebraic geometry of stresses in frameworks. SIAM. J. on Algebraic and Discrete Methods, 4(4), 481511. 1987.

[156] Whiteley The union of matroids and the rigidity of frameworks. SIAM Journal on Discrete Mathematics, Volume 1 Issue 2, Pages 237 - 255, May 1988. [157] Ming Li and Paul Vitanyi An Introduction to Kolmogorov Complexity and Its Applications, 2nd Edition Springer, 1997.

[158] Lee, Streinu and Theran Graded Sparse Graphs and Matroids Journal of Universal Computer Science, Vol 13, Issue 11, Pages 1671 -1679, Nov 2007.

[159] Streinu and Theran Slider-Pinning Rigidity: a MaxwellLaman-Type Theorem. Discrete and Computational Geometry, Volume 44, Issue 4, pages 812-837, December 2010.

85 [160] Servatius Molecular conjecture in 2D 16th Fall Workshop on Computational and Combinatorial Geometry, 2006.

[161] Laurenz Wiskott and Terrence Senjowski. Slow Feature Analysis: Unsupervised Learning of Invariances. Neural Computation Vol. 14, No. 4, Pages 715-770, April 2002.

[162] David H Wolpert, Nna D, Harry Road, San Jose, and William G Macready. No Free Lunch Theorems for Optimization 1 Introduction. 1996.

[163] Jeremy M Wolfe. Guided Search 4.0 Current Progress With a Model of Visual Search. Integrated Models of Cognitive Systems, 2006.

[164] Steven J. Gortler, Craig Gotsman, Ligang Liu, and Dylan P. Thurston On Afﬁne Rigidity. arXiv:1011.5553, 2010.

[165] H.S.M. Coxeter Projective Geometry Springer. 2nd Edition. 2003. In Press.

[166] Honghao Shan, Lingyun Zhang, and Garrison W Cottrell. Recursive ICA. Ad- vances in Neural Information Processing Systems 19, pages 1273-1280, 2007.

[167] Michal Aharon, Michael Elad, and Alfred Bruckstein. K -SVD : An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, Nov 2006.

[168] Whiteley, W. Some matroids from discrete applied geometry. Matroid Theory. Contemporary Mathematics, American Mathematical Society, 171-311, 1996. [169] White and Whiteley The Algebraic Geometry of Motions of Bar-and-Body Frameworks. SIAM Journal on Algebraic and Discrete Methods, 1987.

[170] K. Haller, A. Lee, M. Sitharam, I. Streinu, N. White ACM-SAC Geometric constraints and Reasoning, 2009 and FwCG 2008. Computational Geometry Theory and Applications. CoRR abs/1006.1126: (2010) Body-and-Cad Constraint Systems. ACM-SAC Geometric constraints and Reasoning, 2009. Computational Geometry Theory and Applications, CoRR abs/1006.1126:, 2010.

[171] B. Jackson and T. Jordan Pin-collinear Body-and-Pin Frameworksand the Molecular Conjecture. Technical Report, 2006.

[172] Theran, Louis. Problems in generic combinatorial rigidity: sparsity, sliders, and emergence of components PhD Thesis, University of Massachusetts, Amherst, 2010.

[173] Ruijin Wu, CIS 6930, Geometric complexity, University of Florida, Lecture notes, 7-12. 2011

[174] Dhruv Ranganathan. A Gentle Introduction to Grassmanianns. In Press. 2010. [175] URL: http://yann.lecun.com/exdb/mnist/

86 [176] URL: http://spams-devel.gforge.inria.fr/

87 BIOGRAPHICAL SKETCH Mohammad Tariﬁ completed his Bachelor of Engineering in computers and communications, together with minors in mathematics and physics, at the American

University of Beirut. He went on to obtain a master’s degree in computer engineering, researching quantum computing and theoretical computer science, at the University of Florida. Concurrently with his Doctor of Philosophy studies, Mohammad worked full-time for several years in industry.