Distance Geometry in Data Science

Distance Geometry in Data Science Leo Liberti, CNRS LIX Ecole Polytechnique [email protected] CNMAC 2017 1 / 160 Line of reasoning for this talk 1. Graphs and weighted graphs necessary to model data 2. Computers can “reason by analogy” (clustering) 3. Clustering on vectors allows more exibility 4. Need to embed (weighted) graphs into Euclidean spaces 5. High dimensions make clustering expensive/unstable 6. Use random projections to reduce dimensions 2 / 160 Outline Reasoning Direct methods Relations, graphs, distances Semidenite Programming Clustering Diagonal Dominance Clustering in graphs Barvinok’s naive algorithm Clustering in Euclidean spaces Isomap for the DGP Metric embeddings Distance resolution limit Fréchet embeddings in `1 When to start worrying Embeddings in `2 Random projections Classic MDS More ecient clustering PCA Random projections in LP Distance Geometry Projecting feasibility DGP applications Projecting optimality Complexity of the DGP Solution retrieval Number of solutions Quantile regression Solution methods The end 3 / 160 Reasoning The philosophical motivation to distance geometry 4 / 160 Modes of rational thought abduction hypothesis observation induction deduction prediction [Arist. 24a, Peirce CP, Putnam 79, Eco 83] 5 / 160 Modes of rational thought All humans are mortal, Socrates is human, hence Socrates is mortal | {z } | {z } | {z } hypothesis prediction observation I deduction(hypothesis + prediction ! observation) TRUTH; logician, mathematician I induction(observation + prediction ! hypothesis) CAUSALITY; physicist, chemist, biologist I abduction (hypothesis + observation ! prediction) PLAUSIBILITY; everyone else 6 / 160 Abduction might infer falsehoods I Peirce: 1. All beans in this bag are white 2. There is a white bean next to this bag 3. The bean was in the bag 4. but what if the bean wasn’t in the bag? I Sherlock Holmes wannabe: 1. People who walk in the park have their shoes full of dirt 2. John’s shoes are dirty 3. John walked in the park 4. but what if John did not walk in the park? Only deduction infers truth [Desclés, Jackiewicz 2006] 7 / 160 Statistician’s abduction hypothesis1 ! prediction1 hypothesis2 ! prediction2 observation hypothesis3 ! prediction3 hypothesis4 ! prediction4 hypothesis5 ! prediction5 I Evaluate P(observation j hypothesisi ! predictioni) 8i I Choose inference i with largest probability 8 / 160 Example bag of white beans!bean was in bag 0.3 white bean eld closeby!bean came from eld 0.25 0.1 white bean beside bag farmer market yesterday!bean came from market 0.15 0.2 kid was playing with beans!kid lost a bean UFOs fueled with beans!bean clearly a UFO sign I Repeat experiences, collect data frequency I Probability distribution ( personal conviction 9 / 160 Compare diferent observations 0.01 bag of white beans!bean was in bag 0.3 0.25 0.01 white bean beside bag white bean eld closeby!bean came from eld 0.1 0.15 0.49 farmer market yesterday!bean came from market 0.2 0.29 red bean beside bag kid was playing with beans!kid lost a bean 0.2 UFOs fueled with beans!bean clearly a UFO sign I Repeat experiences, collect data frequency I Probability distribution ( personal conviction 10 / 160 Subsection 1 Relations, graphs, distances 11 / 160 Modelling a good prediction Observation graph I set V of observations I set I of inferences (hypotheses ^ predictions) v I 8v 2 V get probability distribution P on I I relation E if u; v 2 V have similar distributions on I I F = (V; E): observation graph I relation ∼ if h; k 2 I not contradictory I Densest subgraphs U with every hu ∼ ku (for u 2 U) richest observation sets with non-contradictory inferences Think of Sherlock Holmes: set of clues compatible with most likely consistent explanations 12 / 160 Example I V = fu; vg where u = white bean, v = red bean lots of beans (both red and white) found next to all-white bean bag I largest combined probabilities: 1. farmer market: 0:59 2. kid playing: 0:34 3. UFO fuel: 0:4 I UFO hovering above market square ! farmer market disbanded ) 1 ∼ 2 ^ 2 ∼ 3 but :(1 ∼ 3) I Observation graph: I P(u [ v j 1 _ 2) = 0:93 > 0:74 = P(u [ v j 2 _ 3) I ) U = V = fu; vg;E = ffu; vgg I with scaled edge weight 0:93=(0:93 + 0:74) = 0:55 13 / 160 Where we are and where we are going I Relations on observations encoding most likely compatible predictions ) graphs I Similarity probability / magnitude / intensity ) weighted graphs I “Machine intelligence” by analogy: clustering in graphs I More rened clustering techniques? I pull in tools from linear algebra I work with vectors rather than graphs I Euclidean embeddings of weighted graphs I Distances lose “resolution” in high dimensions I Project into lower dimensional spaces 14 / 160 Outline Reasoning Direct methods Relations, graphs, distances Semidenite Programming Clustering Diagonal Dominance Clustering in graphs Barvinok’s naive algorithm Clustering in Euclidean spaces Isomap for the DGP Metric embeddings Distance resolution limit Fréchet embeddings in `1 When to start worrying Embeddings in `2 Random projections Classic MDS More ecient clustering PCA Random projections in LP Distance Geometry Projecting feasibility DGP applications Projecting optimality Complexity of the DGP Solution retrieval Number of solutions Quantile regression Solution methods The end 15 / 160 Clustering “Machine intelligence”: analogy based on proximity 16 / 160 Subsection 1 Clustering in graphs 17 / 160 Example graph I Goal: ﬁnd partition in densest subgraphs 18 / 160 Modularity clustering “Modularity is the fraction of the edges that fall within a cluster minus the expected fraction if edges were distributed at random.” I “at random” = random graphs over same degree sequence I degree sequence = (k1; : : : ; kn) where ki = jN(i)j I “expected” = all possible “half-edge” recombinations I expected edges between u; v: kukv=(2m) where m = jEj I mod(u; v) = (Auv − kukv=(2m)) P I mod(G) = mod(u; v)xuv fu;vg2E xuv = 1 if u; v in the same cluster and 0 otherwise P P I “Natural extension” to weighted graphs: ku = v Auv, m = uv Auv [Girvan & Newman 2002] 19 / 160 Use modularity to dene clustering I What is the “best clustering”? I Maximize discrepancy between actual and expected “as far away as possible from average” P 9 max mod(u; v)xuv = fu;vg2E 8u 2 V; v 2 V xuv 2 f0; 1g ; I Issue: trivial solution x = 1 “one big cluster” I Idea: treat clusters as cliques (even if zero weight) then clique partitioning constraints for transitivity 8i < j < k xij + xjk − xik ≤ 1 8i < j < k xij − xjk + xik ≤ 1 8i < j < k − xij + xjk + xik ≤ 1 8fi; jg 62 E xij = 0 if i; j 2 C and j; k 2 C then i; k 2 C [Aloise et al. 2010] 20 / 160 Maximizing the modularity of a graph I Formulation above is a Mathematical Program (MP) I MP is a formal language for describing optimization problems I each MP consists of: I parameters (input) I decision variables (output) I objective function(s) I explicit and implicit constraints I broad MP classiﬁcation: LP, SDP, cNLP, NLP, MILP, cMINLP, MINLP I Modularity Maximization MP is a MILP I MILP is NP-hard but 9 technologically advanced solvers I Otherwise, use (fast) heuristics I This method decides the number of clusters [Caeri et al. 2014] 21 / 160 The resulting clustering 22 / 160 Subsection 2 Clustering in Euclidean spaces 23 / 160 Minimum sum-of-squares clustering I MSSC, a.k.a. the k-means problem m I Given points p1; : : : ; pn 2 R , nd clusters C1;:::;Cd X X 2 min kpi − centroid(Cj)k2 j≤k i2Cj 1 P where centroid(Cj) = pi jCj j i2Cj I k-means alg.: given initial clustering C1;:::;Cd 1: 8j ≤ d compute yj = centroid(Cj) 2: 8i ≤ n; j ≤ d if yj is the closest centroid to pi let xij = 1 else 0 3: 8j ≤ d update Cj fpi j xij = 1 ^ i ≤ ng 4: repeat until stability In “k-means”, “k” is the number of clusters, here denoted by d note that d is given [MacQueen 1967, Aloise et al. 2012] 24 / 160 MP formulation P P 2 9 min kpi − yjk2 xij > x;y;s i≤n j≤d > 1 P > 8j ≤ d pixij = yj > sj > i≤n > P > 8i ≤ n xij = 1 => j≤d P (MSSC) 8j ≤ d xij = sj > i≤n > m > 8j ≤ dy j 2 R > nd > x 2 f0; 1g > d > s 2 N ; MINLP: nonconvex terms; continuous, binary and integer variables 25 / 160 Reformulations The (MSSC) formulation has the same optima as: P P 9 min Pij xij x;y;P > i≤n j≤d > 2 > 8i ≤ n; j ≤ d kpi − yjk2 ≤ Pij > P P > 8j ≤ d pixij = yjxij > i≤n i≤n => P 8i ≤ n xij = 1 j≤d > > 8j ≤ dy j 2 ([min pia]; max pia j a ≤ d) > i≤n i≤n > x 2 f0; 1gnd > > P 2 [0;P U ]nd ; I Only nonconvexities: products of bounded by binary variables I Caveat: cannot have empty clusters 26 / 160 Products of binary and continuous vars. I Suppose term xy appears in a formulation I Assume x 2 f0; 1g and y 2 [0; 1] is bounded I means “either z = 0 or z = y” I Replace xy by a new variable z I Adjoin the following constraints: z 2 [0; 1] y − (1 − x) ≤ z ≤ y + (1 − x) −x ≤ z ≤ x I ) Everything’s linear now! [Fortet 1959] 27 / 160 Products of binary and continuous vars. I Suppose term xy appears in a formulation L U I Assume x 2 f0; 1g and y 2 [y ; y ] is bounded I means “either z = 0 or z = y” I Replace xy by a new variable z I Adjoin the following constraints: z 2 [min(yL; 0); max(yU ; 0)] y − (1 − x) max(jyLj; jyU j) ≤ z ≤ y + (1 − x) max(jyLj; jyU j) −x max(jyLj; jyU j) ≤ z ≤ x max(jyLj; jyU j) I ) Everything’s linear now! [L.

Load more