Distance Geometry and Data Science
Leo Liberti, CNRS & Ecole Polytechnique [email protected]
Winter School in GCS @FieldsInstitute 210111-15
Most material from [L., TOP 28:271-339, 2020] http://www.lix.polytechnique.fr/~liberti/dgds.pdf
1 / 131 DG’s best known result
DG’s best known result Formulations in position variables Proof of Heron’s theorem Formulations in matrix variables Metric spaces representability Diagonal dominance Missing distances Dual DD Noisy distances Dimensional reduction Principal Component Analysis Barvinok’s Naive Algorithm Summary: Isomap High-dimensional weirdness Distance geometry problem Random projections Applications Distance instability Complexity Are these results related? Number of solutions Graph embeddings for ANN MP formulations A clustering task
2 / 131 A gem in Distance Geometry
I Heron’s theorem I Heron lived around year 0
I Hang out at Alexandria’s library
A = ps(s − a)(s − b)(s − c)
c b I A = area of triangle 1 I s = 2 (a + b + c)
a
Useful to measure areas of agricultural land
3 / 131 Subsection 1
Proof of Heron’s theorem
4 / 131 Heron’s theorem: Proof [M. Edwards, high school student, 2007] A. 2α + 2β + 2γ = 2π ⇒ α + β + γ = π
r + ix = ueiα r + iy = veiβ r + iz = weiγ
⇒ (r +ix)(r +iy)(r +iz) = (uvw)ei(α+β+γ) = iπ uvw e = −uvw ∈ R ⇒ Im((r + ix)(r + iy)(r + iz)) = 0 2 q xyz ⇒ r (x + y + z) = xyz ⇒ r = x+y+z 1 B. s = 2 (a + b + c) = x + y + z s − a = x + y + z − y − z = x s − b = x + y + z − x − z = y s − c = x + y + z − x − y = z 1 a + b + c A = (ra + rb + rc) = r = rs = ps(s − a)(s − b)(s − c) 2 2 5 / 131 Metric spaces representability
DG’s best known result Formulations in position variables Proof of Heron’s theorem Formulations in matrix variables Metric spaces representability Diagonal dominance Missing distances Dual DD Noisy distances Dimensional reduction Principal Component Analysis Barvinok’s Naive Algorithm Summary: Isomap High-dimensional weirdness Distance geometry problem Random projections Applications Distance instability Complexity Are these results related? Number of solutions Graph embeddings for ANN MP formulations A clustering task
6 / 131 Representing metric spaces in Rn
I Given metric space (X, d) with dist. matrix D = (dij), embed X in a Euclidean space with same dist. matrix
I Consider i-th row xi = (di1, . . . , din) of D D n I Embed i ∈ X by vector U (i) = xi ∈ R D n D dene U : {1, . . . , n} → R s.t. U (i) = xi D I Thm.: (U , `∞) is a metric space with dist. matrix D i.e. ∀i, j ≤ n kxi − xjk∞ = dij I U D is called Universal Isometric Embedding (UIE) also known as Fréchet’s embedding I Practical issue: embedding is high-dimensional (Rn)
[Kuratowski 1935]
7 / 131 Proof
I Consider i, j ∈ X with distance d(i, j) = dij I Then
kxi − xjk∞ = max |dik − djk|≤ max |dij| = dij k≤n k≤n
ineq. ≤ above from triangular inequalities in metric space:
dik ≤ dij + djk ∧ djk ≤ dij + dik
⇒ dik − djk ≤ dij ∧ djk − dik ≤ dij
⇒| dik − djk| ≤ dij
If valid ∀i, j then valid for max
I max |dik − djk| over k ≤ n achieved when k ∈ {i, j}
⇒ kxi − xjk∞ = dij
8 / 131 Subsection 1
Missing distances
9 / 131 UIE from incomplete metrics
I If your metric space is missing some distances I Get incomplete distance matrix D I Cannot dene vectors U D(i) in UIE I Note: D denes a graph √ 1 2 0 1 2 1 1 0 1 ? D = √ 4 3 2 1 0 1 1 ? 1 0
I Complete this graph with shortest paths: d24 = 2
10 / 131 Floyd-Warshall algorithm 1/2
I Given n × n partial matrix D computes all shortest path lengths I For each triplet u, v, z of vertices in the graph, test: when going u → v, is it convenient to pass through z?
I If so, then change the path length
11 / 131 Floyd-Warshall algorithm 2/2 # initialization for u ≤ n, v ≤ n do if duv =? then duv ← ∞ end if end for # main loop for z ≤ n do for u ≤ n do for v ≤ n do if duv > duz + dzv then duv ← duz + dzv end if end for end for end for 12 / 131 Subsection 2
Noisy distances
13 / 131 Schoenberg’s theorem
I [I. Schoenberg, Remarks to Maurice Fréchet’s article “Sur la définition axiomatique d’une classe d’espaces distanciés vectoriellement applicable sur l’espace de Hilbert”, Ann. Math., 1935] I Question: When is a given matrix a Euclidean Distance Matrix (EDM)?
Thm. 1 2 2 2 D = (dij) is an EDM if 2 (d1i + d1j − dij | 2 ≤ i, j ≤ n) is PSD of rank K
14 / 131 Gram in function of EDM
K I x = (x1, . . . , xn) ⊆ R , written as n × K matrix > I matrix G = xx = (xi · xj) is the Gram matrix of x Lemma: G 0 and each M 0 is a Gram matrix of some x I Useful variant of Schoenberg’s theorem Relates EDMs and Gram matrices 1 G = − JD2J (?) 2
2 2 I where D = (dij) and 1 1 1 1 − n − n · · · − n − 1 1 − 1 · · · − 1 J = I − 1 11> = n n n n n . . .. . . . . . 1 1 1 − n − n ··· 1 − n
15 / 131 Multidimensional scaling (MDS)
I Often get approximate EDMs D˜ from raw data (dissimilarities, discrepancies, dierences) ˜ 1 ˜ 2 I G = − 2 JD J is an approximate Gram matrix I Approximate Gram ⇒ spectral decomposition P Λ˜P > has Λ˜ 6≥ 0 I Let Λ closest PSD diagonal matrix to Λ˜: zero the negative components of Λ˜ √ I x = P Λ is an “approximate embedding” of D˜
16 / 131 Classic MDS: Main result
1. Prove lemma: matrix is Gram if it is PSD 1 2 2. Prove that G = − 2 JD J
17 / 131 Proof of lemma
I Gram ⊆ PSD I x is an n × K real matrix I G = xx> its Gram matrix n I For each y ∈ R we have
> > > > > > 2 yGy = y(xx )y = (yx)(x y ) = (yx)(yx) = kyxk2 ≥ 0
I ⇒ G 0 I PSD ⊆ Gram I Let G 0 be n × n I Spectral decomposition: G = P ΛP > (P orthogonal,√Λ ≥ 0 diagonal) Λ ≥ 0 ⇒ Λ exists I √ √ √ √ > > > > I G = P ΛP√ = (P Λ)( Λ P ) = (P Λ)(P Λ) I Let x = P Λ, then G is the Gram matrix of x
18 / 131 Schoenberg’s theorem proof (1/2)
I Assume zero centroid WLOG (can translate x as needed) 2 2 I Expand: dij = kxi − xj k2 = (xi − xj )(xi − xj ) = xixi + xj xj − 2xixj (∗) 2 I Aim at “inverting” (∗) to express xixj in function of dij 0 by zero centroid P 2 P P: I Sum (∗) over i: i dij = i xixi + nxj xj − 2xji xi I Similarly for j and divide by n, get:
1 X 2 1 X d = xixi + xj xj (†) n ij n i≤n i≤n
1 X 2 1 X d = xixi + xj xj (‡) n ij n j≤n j≤n
I Sum (†) over j, get:
1 X 2 1 X X X d = n xixi + xj xj = 2 xixi n ij n i,j i j i
I Divide by n, get: 1 X 2 2 X d = xixi (∗∗) n2 ij n i,j i
19 / 131 Schoenberg’s theorem proof (2/2)
I Rearrange (∗), (†), (‡) as follows:
2 2xixj = xixi + xj xj − dij (1)
1 X 2 1 X xixi = d − xj xj (2) n ij n j j
1 X 2 1 X xj xj = d − xixi (3) n ij n i i
I Replace LHS of Eq. (2)-(3) in RHS of Eq. (1), get
1 X 2 1 X 2 2 2 X 2xixj = d + d − d − x x n ik n kj ij n k k k k k
∗∗ 2 P x x 1 P d2 I By ( ) replace n i i with n2 ij , get i i,j
1 X 2 2 2 1 X 2 2xixj = (d + d ) − d − d (?) n ik kj ij n2 hk k h,k
2 expressing xixj in function of D (showing (?) ≡ (2G = −JD J) is messy but easy)
20 / 131 Subsection 3
Principal Component Analysis
21 / 131 Principal Component Analysis (PCA)
I Given an approximate distance matrix D nd x = MDS(D) I √ I However, you want x = P Λ in K dimensions but rank(Λ) > K I Only keep K largest components of Λ zero the rest I Get embedding in desired (lower) dimension
22 / 131 Example 1/3 Mathematical genealogy skeleton
23 / 131 Example 2/3 A partial view
Euler Thibaut Pfaf Lagrange Laplace Möbius Gudermann Dirksen Gauss Kästner 10 1 1 9 8 2 2 2 2 Euler 11 9 1 3 10 12 12 8 Thibaut 2 10 10 3 1 1 3 Pfaf 8 8 1 3 3 1 Lagrange 2 9 11 11 7 Laplace 9 11 11 7 Möbius 4 4 2 Gudermann 2 4 Dirksen 4
0 10 1 1 9 8 2 2 2 2 10 0 11 9 1 3 10 12 12 8 1 11 0 2 10 10 3 1 1 3 1 9 2 0 8 8 1 3 3 1 9 1 10 8 0 2 9 11 11 7 D = 8 3 10 8 2 0 9 11 11 7 2 10 3 1 9 9 0 4 4 2 2 12 1 3 11 11 4 0 2 4 2 12 1 3 11 11 4 2 0 4 2 8 3 1 7 7 2 4 4 0
24 / 131 Example 3/3
In 2D In 3D
25 / 131 Subsection 4
Summary: Isomap
26 / 131 Isomap for DG
1. Let D0 be the (square) weighted adjacency matrix of G 2. Complete D0 to approximate EDM D˜ 3. Perform PCA on D˜ given K dimensions (a) Let B˜ = −(1/2)JDJ˜ , where J = I − (1/n)11> (b) Find eigenval/vects Λ,P so B˜ = P >ΛP (c) Keep ≤ K largest nonneg. eigenv. of Λ to get Λ˜ p (d) Let x˜ = P > Λ˜
VaryStep 2 to generate Isomap heuristics 27 / 131 Variants for Step 2
A. Floyd-Warshall all-shortest-paths algorithm on G (classic Isomap) B. Find a spanning tree (SPT) of G and compute a random realization in x¯ ∈ RK , use its sqEDM C. and more...
[Liberti et al., SEA 17]
28 / 131 Distance geometry problem
DG’s best known result Formulations in position variables Proof of Heron’s theorem Formulations in matrix variables Metric spaces representability Diagonal dominance Missing distances Dual DD Noisy distances Dimensional reduction Principal Component Analysis Barvinok’s Naive Algorithm Summary: Isomap High-dimensional weirdness Distance geometry problem Random projections Applications Distance instability Complexity Are these results related? Number of solutions Graph embeddings for ANN MP formulations A clustering task
29 / 131 The Distance Geometry Problem (DGP) Given K ∈ N and G = (V, E, d) with d : E → R+, nd x : V → RK s.t.
2 2 ∀{i, j} ∈ E kxi − xjk2 = dij
Defn. x is a realization of G in RK
Given a weighted graph , draw it so edges are drawn as segments
with lengths = weights
30 / 131 Subsection 1
Applications
31 / 131 Some applications
I clock synchronization (K = 1) I sensor network localization (K = 2) I molecular structure from distance data (K = 3) I autonomous underwater vehicles (K = 3) I distance matrix completion (whatever K) I finding graph embeddings
32 / 131 Clock synchronization
From [Singer, Appl. Comput. Harmon. Anal. 2011]
Determine a set of unknown timestamps from partial measurements of their time dierences
I K = 1 I V : timestamps I {u, v} ∈ E if known time diference between u, v I d: values of the time diferences
Used in time synchronization of distributed networks
33 / 131 Clock synchronization
34 / 131 Sensor network localization
From [Yemini, Proc. CDSN, 1978]
The positioning problem arises when it is necessary to locate a set of geographically distributed objects using measurements of the distances between some object pairs
I K = 2 I V : (mobile) sensors I {u, v} ∈ E if distance between u, v is measured I d: distance values
Used whenever GPS not viable (e.g. underwater) duv ∼∝ battery consumption in P2P communication betw. u, v
35 / 131 Sensor network localization
36 / 131 Molecular structure from distance data From [Liberti et al., SIAM Rev., 2014]
I K = 3 I V : atoms I {u, v} ∈ E if distance between u, v is known I d: distance values
Used whenever X-ray crystallography does not apply (e.g. liquid) Covalent bond lengths and angles known precisely Distances / 5.5 measured approximately by NMR
37 / 131 Graph embeddings
I Relational knowledge best represented by graphs I We have fast algorithms for clustering vectors I Task: represent a graph in Rn I “Graph embeddings” and “distance geometry”: almost synonyms I Used in Natural Language Processing (NLP) obtain “word vectors” & “concept vectors”
38 / 131 Subsection 2
Complexity
39 / 131 Complexity
I DGP1 with d : E → Q+ is in NP I if instance YES ∃ realization x ∈ Rn×1 I if some component xi 6∈ Q translate x so xi ∈ Q I consider some other xj P suv I let ` = |sh. path p : i → j| = (−1) duv ∈ Q {u,v}∈p for some suv ∈ {0, 1} I then xj = xi ± ` → xj ∈ Q I ⇒ verication of
∀{i, j} ∈ E |xi − xj| = dij
in polytime
I DGPK may not be in NP for K > 1 nK don’t know how to check kxi − xjk2 = dij in polytime for x 6∈ Q
40 / 131 Hardness Partition is NP-hard n P P Given a = (a1, . . . , an) ∈ N , ∃ I ⊆ {1, . . . , n} s.t. ai = ai ? i∈I i6∈I
I Reduce Partition to DGP1 I a −→ cycle C V (C) = {1, . . . , n}, E(C) = {{1, 2},..., {n, 1}}
I For i < n let di,i+1 = ai dn,n+1 = dn1 = an I E.g. for a = (1, 4, 1, 3, 3), get cycle graph:
[Saxe, 1979] 41 / 131 Partition is YES ⇒ DGP1 is YES
P P I Given: I ⊂ {1, . . . , n} s.t. ai = ai i∈I i6∈I I Construct: realization x of C in R 1. x1 = 0 // start 2. induction step: suppose xi known if i ∈ I let xi+1 = xi + di,i+1 // go right else let xi+1 = xi − di,i+1 // go left I Correctness proof: by the same induction but careful when i = n: have to show xn+1 = x1
42 / 131 Partition is YES ⇒ DGP1 is YES
X X (1) = (xi+1 − xi) = di,i+1 = i∈I i∈I X X = ai = ai = i∈I i6∈I X X = di,i+1 = (xi − xi+1) = (2) i6∈I i6∈I
X X X (1) = (2) ⇒ (xi+1 − xi) = (xi − xi+1) ⇒ (xi+1 − xi) = 0 i∈I i6∈I i≤n
⇒ (xn+1 − xn) + (xn − xn−1) + ··· + (x3 − x2) + (x2 − x1) = 0
⇒ xn+1 = x1
43 / 131 Partition is NO ⇒ DGP1 is NO
I By contradiction: suppose DGP1 is YES, x realization of C
I F = {{u, v} ∈ E(C) | xu ≤ xv}, E(C) r F = {{u, v} ∈ E(C) | xu > xv} I Trace x1, . . . , xn: follow edges in F (→) and in E(C) r F (←)
X X (xv − xu) = (xu − xv ) {u,v}∈F {u,v}6∈F X X |xu − xv | = |xu − xv | {u,v}∈F {u,v}6∈F X X duv = duv {u,v}∈F {u,v}6∈F
I Let J = {i < n | {i, i + 1} ∈ F } ∪ {n | {n, 1} ∈ F } X X ⇒ ai = ai i∈J i6∈J I So J solves Partition instance, contradiction
I ⇒ DGP is NP-hard, DGP1 is NP-complete
44 / 131 Subsection 3
Number of solutions
45 / 131 Number of solutions
I (G, K): DGP instance I X˜ ⊆ RKn: set of solutions I Congruence: composition of translations, rotations, re ections I C = set of congruences in RK I x ∼ y means ∃ρ ∈ C (y = ρx): distances in x are preserved in y through ρ I ⇒ if |X˜| > 0, |X˜| = 2ℵ0
46 / 131 Number of solutions modulo congruences
I Congruence is an equivalence relation ∼ on X˜ (re exive, symmetric, transitive)
I Partitions X˜ into equivalence classes I X = X/˜ ∼: sets of representatives of equivalence classes I Focus on |X| rather than |X˜|
47 / 131 Rigidity, exibility and |X|
I infeasible ⇔ |X| = 0 I rigid graph ⇔ |X| < ℵ0 I globally rigid graph ⇔ |X| = 1 I exible graph ⇔ |X| = 2ℵ0 I |X| = ℵ0: impossible by Milnor’s theorem
48 / 131 Milnor’s theorem implies |X|= 6 ℵ0
I System S of polynomial equations of degree 2
∀i ≤ m pi(x1, . . . , xnK ) = 0
nK I Let X be the set of x ∈ R satisfying S I Number of connected components of X is O(3nK ) [Milnor 1964]
I Assume |X| is countable; then G cannot be exible ⇒ each incongruent rlz is in a separate component ⇒ by Milnor’s theorem, there’s nitely many of them
49 / 131 Examples
50 / 131 Mathematical Programming formulations
DG’s best known result Formulations in position variables Proof of Heron’s theorem Formulations in matrix variables Metric spaces representability Diagonal dominance Missing distances Dual DD Noisy distances Dimensional reduction Principal Component Analysis Barvinok’s Naive Algorithm Summary: Isomap High-dimensional weirdness Distance geometry problem Random projections Applications Distance instability Complexity Are these results related? Number of solutions Graph embeddings for ANN MP formulations A clustering task
51 / 131 Short introduction to MP 1/3 I Formal language to describe optimization problems I Sentences are called formulations min f(x) ∀i ≤ m gi(x) ≤ 0 [P ] ∀j ∈ Z xj ∈ Z I f, g: represented by expression language
I Formulation entities: set param dec. var obj constr Z m x f(x) gi(x) ≤ 0 xj ∈ Z
52 / 131 Short introduction to MP 2/3
I Syntax highlights: I no explicit open sets: use ≤, ≥, = not <, >, 6= P I no vars in quantifiers (e.g. yi) i∈I xi=1 I Semantics: assignment of values to decision variables I carried out by solver: solution alg. for MP subclass defined by given math. properties I requires formulation taxonomy w.r.t. solvers LP, SOCP, SDP, QP, QCP, QCQP, NLP, MILP, MIQP, MIQCP, MIQCQP, MINLP c(MI)QP, c(MI)QCP, c(MI)QCQP, c(MI)NLP, BLP, BQP, MOP, BLevP, SIP, ... I MP shifts focus from algorithmics to modelling
53 / 131 Short introduction to MP 3/3
I Solvers can be: global or local, exact, approximate or heuristic, (worst-case) polytime or exponential, fast or slow I Examples: I LP: simplex alg., interior point method (IPM); CPLEX, GuRoBi, XPressMP, GLPK, CLP; global, exact, fast, IPM is polytime I SOCP, SDP: IPM; Mosek, SCS; global, -approx., fast, polytime I QP, QCP, QCQP, NLP: Succ. Quadr. Prog. (SQP), IPM; Snopt, IPOPT; local, heuristic (-approx. if cvx), fast I BLP, MILP: Cutting plane (CP), Branch-and-Bound (BB); CPLEX, XPressMP, GLPK; global, exact, slow, exponential I cMIQP, cMIQCP, cMIQCQP, cMINLP: BB, Outer Approx. (OA), CP; GuRoBi, BonMin, alphaECP; global, -approx., slow, exponential I MIQP, MIQCP, MIQCQP, MINLP: spatial BB (sBB); Baron, GuRoBi, Couenne, Antigone; global, -approx., slow, exponential
54 / 131 Subsection 1
Formulations in position variables
55 / 131 QCP and QCQP
Original DGP system is a QCP:
2 2 ∀{u, v} ∈ E kxu − xvk = duv (4)
Computationally: useless
Reformulate using slack variables, get QCQP:
X 2 2 2 min suv ∀{u, v} ∈ E kxu − xvk = duv + suv (5) x,s {u,v}∈E
56 / 131 Unconstrained NLP
X 2 2 2 min (kxu − xvk − duv) (6) x∈RnK {u,v}∈E Sum of squares, but nonconvex in x (quartic polynomial)
Globally optimal obj. fun. value of (6) is 0 if x solves (4)
Computational experiments in [Liberti et al., 2006]: I Solvers from 15 years ago I randomly generated protein data: ≤ 50 atoms I cubic crystallographic grids: ≤ 64 atoms
57 / 131 “Push-and-pull” QCQP
P 2 2 I min |kxu − xvk − d | exactly reformulates (4) nK uv x∈R {u,v}∈E I Relax objective f to concave part, remove constant term, rewrite min −f as max f I Reformulate convex part of obj. fun. to convex constraints I Exact reformulation
P 2 ) maxx kxu − xvk {u,v}∈E (7) 2 2 ∀{u, v} ∈ E kxu − xvk2 ≤ duv
Objective pulls points apart, constraints push them together Thm. At a glob. opt. x∗ of a YES instance, all constraints of (7) are active
58 / 131 Multiplicative Weights Update (MWU) alg.
P 2 ) maxx kxu − xvk {u,v}∈E [QCQP] 2 2 ∀{u, v} ∈ E kxu − xvk2 ≤ duv
2 1. kxu − xvk2 = h(xu − xv), (xu − xv)i = hwuv, (xu − xv)i x w ∈ RK , get cNLP P 2 2 min{ hwuv, (xu − xv)i | ∀{u, v} ∈ E kxu − xvk2 ≤ duv} {u,v}∈E solve eciently, get x0 with EDM D0
0 |Duv−duv| 2. ∀{u, v} ∈ E dene ψuv 0 , max(Duv,duv) relative error between D0 and d
3. ∀{u, v} ∈ E update wuv ← wuv(1 − η ψuv) where η is a constant in (0, 1) 4. repeat for T iterations, record best solution
59 / 131 MWU relative guarantee I index quantities by iteration t: (x0)t, wt, ψt t t t I get distribution ρuv = wuv/h1, w i P t t t t I ψuvρuv = hψ , ρ i {u,v}∈E weighted relative error between D0 and d at itn t I can prove that ! t t 1 3 X t minhψ , ρ i ≤ 2 ln m + min ψuv t≤T T 2 {u,v}∈E t≤T
I relative UB to error of MWU output in terms of best cumulative distance error
[D’Ambrosio et al., DCG 17, Mencarelli et al. EJCO 17]
60 / 131 Subsection 2
Formulations in matrix variables
61 / 131 Linearization
Replace nonlinear terms with additional variables: 2 2 Back to original system ∀{i, j} ∈ E kxi − xjk2 = dij
2 2 2 ⇒ ∀{i, j} ∈ E kxik2 + kxjk2 − 2xi · xj = dij
∀{i, j} ∈ EX + X − 2X = d2 ⇒ ii jj ij ij X = x x>
> Note: X = xx ⇔[∀i, j ≤ n Xij = xixj] 2 X11 X12 ··· X1n x1 x1x2 ··· x1xn 2 X12 X22 ··· X2n x1x2 x2 ··· x2xn ⇔ . . . . = . . . . . . .. . . . .. . 2 X1n X2n ··· Xnn xnx1 xnx2 ··· xn implies that X must have rank ≤ 1
62 / 131 Relaxation
Replace nonconvex set with a convex relaxation:
X = x x> > ⇒ X − x x = 0 all eigs. = 0 > (relax) ⇒ X − x x 0 all eigs. ≥ 0 I x> ⇒ Schur(X, x) K 0 , x X
I If x does not appear elsewhere in formulation can eliminate it (e.g. choose x = 0) I ⇒ Replace Schur(X, x) 0 by X 0
63 / 131 SDP relaxation
min F • X 2 ∀{i, j} ∈ EXii + Xjj − 2Xij = dij X 0
How do we choose F ?
F • X = tr(F >X)
64 / 131 Some possible objective functions
I For protein conformation: X min (Xii + Xjj − 2Xij) {i,j}∈E
with = changed to ≥ in constraints (or max and ≤) SDP relaxation of “push-and-pull” QCQP
I [Ye, 2003], application to wireless sensors localization
min tr(X)
−1 −1 P tr(X) = tr(P ΛP ) = tr(P P Λ) = tr(Λ) = i λi ⇒ hope to minimize rank
I How about “just random”? [Barvinok, DCG 1995]
65 / 131 How do you choose?
for want of some better criterion... TEST!
I Download protein les from Protein Data Bank (PDB) they contain atom realizations I Mimick a Nuclear Magnetic Resonance experiment Keep only pairwise distances < 5.5 I Try and reconstruct the protein shape from those weighted graphs I Quality evaluation of results:
I LDE(x) = max | kxi − xjk − dij | {i,j}∈E 1 P I MDE(x) = |E| | kxi − xjk − dij | {i,j}∈E
66 / 131 Empirical choice
I Ye very fast but often imprecise I Random good but nondeterministic 2 I Push-and-Pull: can relax Xii + Xjj − 2Xij = dij to 2 Xii + Xjj − 2Xij ≥ dij easier to satisfy feasibility, useful later on I Heuristic: add +ηtr(X) to objective, with η 1 might help minimize solution rank P I min (Xii + Xjj − 2Xij) + ηtr(X) {i,j}∈E
67 / 131 Subsection 3
Diagonal dominance
68 / 131 When SDP solvers hit their size limit
I SDP solver: technological bottleneck I Can we use an LP solver instead? I Diagonally Dominant (DD) matrices are PSD I Not vice versa: inner approximate PSD cone Y 0 I Idea by A.A. Ahmadi [Ahmadi & Hall 2015]
69 / 131 Diagonally dominant matrices
n × n symmetric matrix X is DD if X ∀i ≤ n Xii ≥ |Xij|. j6=i
1 0.1 −0.2 0 0.04 0 0.1 1 −0.05 0.1 0 0 E.g. −0.2 −0.05 1 0.1 0.01 0 0 0.1 0.1 1 0.2 0.3 0.04 0 0.01 0.2 1 −0.3 0 0 0 0.3 −0.3 1
70 / 131 Gershgorin’s circle theorem I Let A be symmetric n × n P I ∀i ≤ n let Ri = |Aij| and Ii = [Aii − Ri,Aii + Ri] j6=i I Then ∀λ eigenvalue of A ∃i ≤ n s.t. λ ∈ Ii Proof I Let λ be an eigenvalue of A with eigenvector x
I Normalize x s.t. ∃i ≤ nx i = 1 and ∀j 6= i |xj| ≤ 1 let i = argmaxj |xj |, divide x by sgn(xi)|xi| P P I Ax = λx ⇒ Aijxj + Aiixi = Aijxj + Aii = λxi = λ j6=i j6=i P I Hence Aijxj = λ − Aii j6=i
I Triangle inequality and |xj| ≤ 1 for all j 6= i ⇒ P P P |λ − Aii| = | Aijxj| ≤ |Aij|| xj|≤ |Aij| = Ri j6=i j6=i j6=i hence λ ∈ Ii
71 / 131 DD ⇒ PSD
I Assume A is DD, λ an eigenvalue of A P I ⇒ ∀i ≤ n Aii ≥ j6=i |Aij| = Ri I ⇒ ∀i ≤ n Aii − Ri ≥ 0 I By Gershgorin’s circle theorem λ ≥ 0 I ⇒ A is PSD
72 / 131 DD Linearization X ∀i ≤ n Xii ≥ |Xij| (∗) j6=i
I linearize | · | by additional matrix var T ⇒ write |X| as T I ⇒ (∗) becomes X Xii ≥ Tij j6=i I add “sandwich” constraints −T ≤ X ≤ T I Can easily prove (∗) in case X ≥ 0 or X ≤ 0: X X Xii ≥ Tij ≥ Xij j6=i j6=i X X Xii ≥ Tij ≥ −Xij j6=i j6=i
73 / 131 DD Programming (DDP)
2 ∀{i, j} ∈ EXii + Xjj − 2Xij = dij X isDD
2 ∀{i, j} ∈ EXii + Xjj − 2Xij = dij P ∀i ≤ n Tij ≤ Xii ⇒ j≤n j6=i −T ≤ X ≤ T
74 / 131 The issue with inner approximations
DDP could be infeasible!
75 / 131 Exploit push-and-pull
I Enlarge the feasible region I From 2 ∀{i, j} ∈ EXii + Xjj − 2Xij = dij I Relax to “pull” constraints
2 ∀{i, j} ∈ EXii + Xjj − 2Xij ≥ dij
I Then use “push” objective X min Xii + Xjj − 2Xij ij∈E
76 / 131 Hope to achieve LP feasibility
77 / 131 DDP formulation for the DGP
min P (X + X − 2X ) ii jj ij {i,j}∈E 2 ∀{i, j} ∈ EXii + Xjj − 2Xij ≥ dij P ∀i ≤ n Tij ≤ Xii j≤n j6=i −T ≤ X ≤ T T ≥ 0
78 / 131 Subsection 4
Dual DD
79 / 131 Cones I Set C is a cone if: ∀A, B ∈ C, α, β ≥ 0 αA + βB ∈ C
I If C is a cone, the dual cone is C∗ = {y | ∀x ∈ C hx, yi ≥ 0}
vectors making acute angles with all elements of C I If C ⊂ Sn (set n × n symmetric matrices) C∗ = {Y | ∀X ∈ C (Y • X ≥ 0)}
n I A n × n matrix cone C is finitely generated by X ⊂ R if
p X > X = {x1, . . . , xp} ∧ ∀X ∈ C ∃δ ∈ R+ X = δ` x`x` `≤p
80 / 131 Representations of DD
+ − I Consider Eii,Eij ,Eij in Sn ± Define E0 = {Eii | i ≤ n}, E1 = {Eij | i < j}, E = E0 ∪ E1
I Eii = diag(0,..., 0, 1i, 0,..., 0) + 1ii 1ij I Eij has minor , 0 elsewhere 1ji 1jj − 1ii −1ij I Eij has minor , 0 elsewhere −1ji 1jj
I Thm. DD = cone generated by E [Barker & Carlson 1975] Pf. Rays in E are extreme, all DD matrices generated by E I Cor. DD nitely gen. by XDD = {ei | i ≤ n} ∪ {(ei ± ej) | i < j ≤ n} > ± > Pf. Verify Eii = eiei , Eij = (ei ± ej)(ei ± ej) , where ei is the n i-th std basis element of R
81 / 131 Finitely generated dual cone representation Thm. If C nitely gen. by X , then C∗ = {Y ∈ Sn | ∀x ∈ X (Y • xx> ≥ 0)} ∗ n recall C , {Y ∈ S | ∀X ∈ CY • X ≥ 0}
I (⊇) Let Y s.t. ∀x ∈ X (Y • xx> ≥ 0) P > I ∀X ∈ C, X = δxxx (by n. gen.) x∈X P > I hence Y • X = x δxY • xx ≥ 0 (by defn. of Y ) I whence Y ∈ C∗ (by defn. of C∗) I (⊆) Suppose Z ∈ C∗ r {Y | ∀x ∈ X (Y • xx> ≥ 0)} I then ∃X 0 ⊂ X s.t. ∀x ∈ X 0 (Z • xx> < 0) P > I consider any Y = δxxx ∈ C with δ ≥ 0 x∈X 0 P > ∗ I then Z • Y = δxZ • xx < 0 so Z 6∈ C x∈X 0 I contradiction ⇒ C∗ = {Y | ∀x ∈ X (Y • xx> ≥ 0)}
82 / 131 Dual cone constraints
I Remark: X • vv> = v>Xv I Use nitely generated dual cone theorem I Decision variable matrix X I Constraints: ∀v ∈ X v>Xv ≥ 0
∗ I Cor. PSD ⊂ DD Pf. X ∈ PSD if ∀v ∈ Rn vXv ≥ 0, so certainly valid ∀v ∈ X I If |X | polysized, get compact formulation otherwise use column generation 2 I |XDD| = |E| = O(n )
83 / 131 Dual cone DDP formulation for DGP
P min (Xii + Xjj − 2Xij) {i,j}∈E 2 ∀{i, j} ∈ EXii + Xjj − 2Xij = dij > ∀v ∈ XDD v Xv ≥ 0
> I v Xv ≥ 0 for v ∈ XDD equivalent to:
∀i ≤ n Xii ≥ 0
∀{i, j} 6∈ EXii + Xjj − 2Xij ≥ 0
∀i < j Xii + Xjj + 2Xij ≥ 0
Note we went back to equality “pull” constraints
Quantifier ∀{i, j} 6∈ E should be ∀i < j but we already have those constraints ∀{i, j} ∈ E
84 / 131 Properties of the Dual DDP formulation
I SDP relaxes original problem I DualDDP relaxes SDP hence also relaxes original problem
I Yields extremely tight obj fun bounds w.r.t. SDP I Solutions may have large negative rank
85 / 131 Subsection 5
Dimensional reduction
86 / 131 Retrieving realizations in RK
I SDP/DDP yield n × n PSD matrix X∗ I We need n × K realization matrix x∗ I Recall PSD ⇔ Gram I Apply PCA to X∗, keep K largest comps, get x0 I This yields solutions with errors I Renement: Use x0 as starting point for local NLP solver on QCQP
87 / 131 Subsection 6
Barvinok’s Naive Algorithm
88 / 131 Concentration of measure
From [Barvinok, 1997] The value of a “well behaved” function at a random point of a “big” probability space X is “very close” to the mean value of the function.
and
In a sense, measure concentration can be considered as an extension of the law of large numbers.
89 / 131 Concentration of measure
Given Lipschitz function f : X → R s.t.
∀x, y ∈ X |f(x) − f(y)| ≤ Lkx − yk2
for some L ≥ 0, there is concentration of measure if ∃ constants c, C s.t. −Cε2/L2 ∀ε > 0 Px(|f(x) − E(f)| > ε) ≤ c e
where E(·) is w.r.t. given Borel measure µ over X
≡ “discrepancy from mean is unlikely”
90 / 131 Barvinok’s theorem
Consider: n > k I for each k ≤ m, manifolds Xk = {x ∈ R | x Q x = ak} where m ≤ poly(n) T ? I feasibility problem F ≡ k≤m Xk 6= ∅ k ¯ I SDP relaxation ∀k ≤ m (Q • X = ak) ∧ X 0 with soln. X I Algorithm: T ← factor(X¯); y ∼ Nn(0, 1); x0 ← T y Then:
I ∃c > 0, n0 ∈ N such that ∀n ≥ n0 q 0 Prob ∀k ≤ m dist(x , Xk) ≤ c kX¯k2 ln n ≥ 0.9.
91 / 131 Algorithmic application
0 0 I x is “close” to each Xk: try local descent from x
I ⇒ Feasible QP solution from an SDP relaxation
92 / 131 Elements of Barvinok’s formula
q 0 ¯ Prob ∀k ≤ m dist(x , Xk) ≤ c kXk2 ln n ≥ 0.9.
p ¯ ¯ I kXk2 arises from T (a factor of X) √ I ln n arises from concentration of measure I 0.9 follows by adjusting parameter values in “union bound”
93 / 131 Application to the DGP
2 2 I ∀{i, j} ∈ E Xij = {x | kxi − xjk2 = dij}
T ? I DGP can be written as Xij 6= ∅ {i,j}∈E
2 I SDP relaxation Xii + Xjj − 2Xij = dij ∧ X 0 with soln. X¯
I Diference with Barvinok: x ∈ RKn, rk(X¯) ≤ K
y ∼ NnK (0, √1 ) I IDEA: sample K I Thm. Barvinok’s theorem works in rank K
94 / 131 Proof structure > k K ¯ I Show that, on average, ∀k ≤ m tr((T y) Q (T y)) = Q • X = ak I compute multivariate integrals I bilinear terms disappear because y normally distributed I decompose multivariate int. to a sum of univariate int. I Exploit concentration of measure to show errors happen rarely I a couple of technical lemmata yielding bounds I ⇒ bound Gaussian measure µ of ε-neighbourhoods of − n×K i i ¯ Ai = {y ∈ R | Q (T y) ≤ Q • X} + n×K i i ¯ Ai = {y ∈ R | Q (T y) ≥ Q • X} n×K i i ¯ Ai = {y ∈ R | Q (T y) = Q • X}. − + I use “union bound” for measure of Ai (ε) ∩ Ai (ε) − + I show Ai (ε) ∩ Ai (ε) = Ai(ε) I use “union bound” to measure intersections of Ai(ε) I appropriate values for some parameters ⇒ result
95 / 131 The heuristic
1. Solve SDP relaxation of DGP, get soln. X¯ use DDP+LP if SDP+IPM too slow 2. a. T = factor(X¯) y ∼ NnK (0, √1 ) b. K c. x0 = T y 3. Use x0 as starting point for a local NLP solver on formulation
X 2 2 2 min kxi − xjk − dij x {i,j}∈E
and return improved solution x
96 / 131 MP formulations for the DGP: Summary
I Try nonconvex formulations in position vars I Quadratic nonconvex too dicult? I Solve SDP relaxation I SDP relaxation too large? I Solve DDP approximation I Get n × n matrix solution, need K × n I Use PCA or Barvinok’s alg. and refine w/local NLP
97 / 131 High-dimensional weirdness
DG’s best known result Formulations in position variables Proof of Heron’s theorem Formulations in matrix variables Metric spaces representability Diagonal dominance Missing distances Dual DD Noisy distances Dimensional reduction Principal Component Analysis Barvinok’s Naive Algorithm Summary: Isomap High-dimensional weirdness Distance geometry problem Random projections Applications Distance instability Complexity Are these results related? Number of solutions Graph embeddings for ANN MP formulations A clustering task
98 / 131 Subsection 1
Random projections
99 / 131 The gist
I Let A be a m × n data matrix (columns in Rm, m 1) I T short & fat, normally sampled componentwise
. . . ! ······ . . . . . . ······ . . . = ...... | {z } | {z } T | {z } TA A
I Then ∀i < j kAi − Ajk2 ≈ kTAi − TAjk2 “wahp”
Some more dimensional reduction!
100 / 131 wahp “wahp” = “with arbitrarily high probability” the probability of Ek (depending on some parameter k) approaches 1 “exponentially fast” as k increases
−k P(Ek) ≈ 1 − O(e )
101 / 131 Johnson-Lindenstrauss Lemma
Thm. m 1 Given A ⊆ R with |A| = n and ε > 0 there is k ∼ O( ε2 ln n) and a k × m matrix T s.t.
∀x, y ∈ A (1 − ε)kx − yk ≤ kT x − T yk ≤ (1 + ε)kx − yk
k × m T N(0, √1 ) If matrix is sampled componentwise from k , then 1 P(A and TA are approximately congruent) ≥ n result follows by probabilistic method
102 / 131 In practice
1 I P(A and TA are approximately congruent) ≥ n I re-sampling suciently many times gives wahp I Empirically, sample T very few times (e.g. once will do!) ET (kT x − T yk) = kx − yk probability of error decreases exponentially fast
Surprising fact: k is independent of the original number of dimensions m
103 / 131 Clustering Google images
[L. & Lavor, 2017]
104 / 131 Clustering without random projections
VHimg = Map[Flatten[ImageData[#]] &, Himg];
VHcl = Timing[ClusteringComponents[VHimg, 3, 1]] Out[29]= {0.405908, {1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3}}
Too slow!
105 / 131 Clustering with random projections
Get["Projection.m"]; VKimg = JohnsonLindenstrauss[VHimg, 0.1]; VKcl = Timing[ClusteringComponents[VKimg, 3, 1]] Out[34]= {0.002232, {1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3}}
From 0.405s CPU time to 0.00232s Same clustering
106 / 131 Approximating the identity
I Let n ∈ N and ε ∈ R+ with n 1 and ε ∈ (0, 1/2) I Let T be a k × n Random Projection (RP) matrix with k = O(ε−2 ln n) > > I Look at relations between TT and Ik and T T and In (k+1) ln(2k/δ) I By [Zhang et al., COLT 13], ∃C ≥ 1/4 s.t. ∀δ n ≥ Cε−2