Stochastic Optimization
Marc Schoenauer
DataScience @ LHC Workshop 2015 Content
1 Setting the scene
2 Continuous Optimization
3 Discrete/Combinatorial Optimization
4 Non-Parametric Representations
5 Conclusion Content
1 Setting the scene
2 Continuous Optimization
3 Discrete/Combinatorial Optimization
4 Non-Parametric Representations
5 Conclusion Stochastic Optimization
Hypotheses Search Space Ω with some topological structure Objective function F assume some weak regularity
Hill-Climbing
Randomly draw x0 ∈ Ω and compute F(x0) Initialisation Until(happy)
y = Best neighbor(xt) neighbor structure on Ω Compute F(y) If F(y) F (xt) then xt+1 = y accept if improvement else xt+1 = xt
Comments Find closest local optimum defined by neighborhood structure
Iterate from different x0’s Until(very happy) Stochastic Optimization
Hypotheses Search Space Ω with some topological structure Objective function F assume some weak regularity
Stochastic (Local) Search
Randomly draw x0 ∈ Ω and compute F(x0) Initialisation Until(happy)
y = Random neighbor(xt) neighbor structure on Ω Compute F(y) If F(y) F (xt) then xt+1 = y accept if improvement else xt+1 = xt
Comments Find one close local optimum defined by neighborhood structure Iterate, leaving current optimum IteratedLocalSearch Stochastic Optimization
Hypotheses Search Space Ω with some topological structure Objective function F assume some weak regularity
Stochastic (Local) Search – alternative viewpoint
Randomly draw x0 ∈ Ω and compute F(x0) Initialisation Until(happy)
y = Move(xt) stochastic variation on Ω Compute F(y) If F(y) F (xt) then xt+1 = y accept if improvement else xt+1 = xt
Comments Find one close local optimum defined by the Move operator Iterate, leaving current optimum IteratedLocalSearch Stochastic Optimization
Hypotheses Search Space Ω with some topological structure Objective function F assume some weak regularity
Stochastic Search aka Metaheuristics
Randomly draw x0 ∈ Ω and compute F(x0) Initialisation Until(happy)
y = Move(xt) stochastic variation on Ω Compute F(y) xt+1 = Select(xt, y) stochastic selection, biased toward better points
Comments Can escape local optima e.g., Select = Boltzman selection Iterate ? Stochastic Optimization
Hypotheses Search Space Ω with some topological structure Objective function F assume some weak regularity
Population-based Metaheuristics i i Randomly draw x0 ∈ Ω and compute F(x0), i = 1, . . . , µ Initialisation Until(happy) i 1 µ y = Move(xt ,..., xt ), i = 1, . . . , λ stochastic variations on Ω Compute F(yi), i = 1, . . . , λ 1 µ 1 µ 1 λ (xt+1,..., xt+1) = Select(xt ,..., xt , y ,..., y ) (µ + λ) selection = Select(y1,..., yλ) (µ, λ) selection
Evolutionary Metaphor: ’natural’ selection and ’blind’ variations i i y = Movem(xt) for some i: mutation i i j y = Movec(xt, xt ) for some (i, j): crossover Metaheuristics: an Alternative Viewpoint
Population and variation operators define a distribution on Ω −→ directly evolve a parameterized distribution P (θ)
Distribution-based Metaheuristics
Initialize distribution parameters θ0 Until(happy)
Sample distribution P (θt) → y1,..., yλ ∈ Ω Evaluate y1,..., yλ on F Update parameters θt+1 ← Fθ(θt, y1,..., yλ, F(y1),..., F(yλ)) includes selection
Covers Estimation ofDistributionAlgorithms Population-based Metaheuristics EvolutionaryAlgorithms ParticleSwarmOptimization,DifferentialEvolution, . . . Deterministic algorithms This is (mostly) Experimental Science
Theory lagging behind practice though rapidly catching up in recent years not discussed here
Results need to be experimentally validated
Experiments Validation relies on grounded statistical validation CPU costly informed (meta)parameter setting/tuning CPU costly ++ reproducibility Open Science From Solution Space to Search Space
Choices Objective function Search space representation and move operators / parameterized distribution
Approximation bias vs optimization error
This talk Non-convex, non-smooth objective parametric continuous optimization → Rank-based ML seeded identification of influencers Search spaces and crossover operators the TSP Non-parametric representations exploration vs optimization Content
1 Setting the scene
2 Continuous Optimization The Evolution Strategy The CMA-ES (Covariance Matrix Adaptation Evolution Strategy) The Rank-based DataScience
3 Discrete/Combinatorial Optimization
4 Non-Parametric Representations
5 Conclusion Content
1 Setting the scene
2 Continuous Optimization The Evolution Strategy The CMA-ES (Covariance Matrix Adaptation Evolution Strategy) The Rank-based DataScience
3 Discrete/Combinatorial Optimization
4 Non-Parametric Representations
5 Conclusion Stochastic Optimization Templates
Population-based Metaheuristics i i Randomly draw x0 ∈ Ω and compute F(x0), i = 1, . . . , µ Initialisation Until(happy) i 1 µ y = Move(xt ,..., xt ), i = 1, . . . , λ stochastic variations on Ω Compute F(yi), i = 1, . . . , λ 1 µ 1 µ 1 λ (xt+1,..., xt+1) = Select(xt ,..., xt , y ,..., y ) (µ + λ) selection = Select(y1,..., yλ) (µ, λ) selection
Distribution-based Metaheuristics
Initialize distribution parameters θ0 Until(happy)
Sample distribution P (θt) → y1,..., yλ ∈ Ω Evaluate y1,..., yλ on F Update parameters θt+1 ← Fθ(θt, y1,..., yλ, F(y1),..., F(yλ)) Stochastic Optimization Templates
Distribution-based Metaheuristics
Initialize distribution parameters θ0 Until(happy)
Sample distribution P (θt) → y1,..., yλ ∈ Ω Evaluate y1,..., yλ on F Update parameters θt+1 ← Fθ(θt, y1,..., yλ, F(y1),..., F(yλ)) The (µ, λ)−Evolution Strategy
Ω = Rn P (θ): normal distributions Initialize distribution parameters m,σ, C, set population size λ ∈ N
While not terminate 2 n 1 Sample distribution N m,σ C → y1,..., yλ ∈ R
yi ∼ m + σ Ni(0, C) for i = 1,...,λ
2 Evaluate y1,..., yλ on F
Compute F(y1),..., F(yλ)
3 Select µ best samples y1,..., yµ
4 Update distribution parameters
m,σ, C ← F (m,σ, C, y1,..., yµ, F(y1),..., F(yµ)) The (µ, λ)−Evolution Strategy (2)
Gaussian Mutations mean vector m ∈ Rn is the current solution the so-called step-size σ ∈ R+ controls the step length the covariance matrix C ∈ Rn×n determines the shape of the distribution ellipsoid
How to update m, σ, and C? i=µ 1 X m = x “Crossover” of best µ individuals µ i:λ i=1 Adaptive σ and C Need for adaptation
P 2 Sphere function f(x) = xi , n = 10 Content
1 Setting the scene
2 Continuous Optimization The Evolution Strategy The CMA-ES (Covariance Matrix Adaptation Evolution Strategy) The Rank-based DataScience
3 Discrete/Combinatorial Optimization
4 Non-Parametric Representations
5 Conclusion Cumulative Step-Size Adaptation (CSA)
Measure the length of the evolution path
path of the mean vector m in the generation sequence
↓ ↓ decrease σ increase σ
Compare to random walk without selection Covariance Matrix Adaptation Rank-One Update
Pµ m ← m + σyw, yw = i=1 wi yi:λ, yi ∼ Ni (0, C)
initial distribution, C = I
T new distribution: C ← 0.8 × C + 0.2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update
Pµ m ← m + σyw, yw = i=1 wi yi:λ, yi ∼ Ni (0, C)
yw, movement of the population mean m (disregarding σ)
T new distribution: C ← 0.8 × C + 0.2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update
Pµ m ← m + σyw, yw = i=1 wi yi:λ, yi ∼ Ni (0, C)
T mixture of distribution C and step yw, C ← 0.8 × C + 0.2 × yw yw T new distribution: C ← 0.8 × C + 0.2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update
Pµ m ← m + σyw, yw = i=1 wi yi:λ, yi ∼ Ni (0, C)
new distribution (disregarding σ)
T new distribution: C ← 0.8 × C + 0.2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update
Pµ m ← m + σyw, yw = i=1 wi yi:λ, yi ∼ Ni (0, C)
movement of the population mean m
T new distribution: C ← 0.8 × C + 0.2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update
Pµ m ← m + σyw, yw = i=1 wi yi:λ, yi ∼ Ni (0, C)
mixture of distribution C and step yw, T C ← 0.8 × C + 0.2 × yw yw T new distribution: C ← 0.8 × C + 0.2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update
Pµ m ← m + σyw, yw = i=1 wi yi:λ, yi ∼ Ni (0, C)
T new distribution: C ← 0.8 × C + 0.2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again CMA-ES in one slide
n Input: m ∈ R , σ ∈ R+, λ Initialize: C = I, and pc = 0, pσ = 0, 2 2 Set: cc ≈ 4/n, cσ ≈ 4/n, c1 ≈ 2/n , cµ ≈ µw/n , c1 + cµ ≤ 1, p µw 1 dσ ≈ 1 + n , and wi=1...λ such that µw = Pµ 2 ≈ 0.3 λ i=1 wi
While not terminate
xi = m + σ yi, yi ∼ Ni (0, C) , for i = 1,...,λ sampling Pµ Pµ m ← i=1 wi xi:λ = m + σyw where yw = i=1 wi yi:λ update mean √ p 2√ pc ← (1 − cc) pc + 1I{kpσ k<1.5 n} 1 − (1 − cc) µw yw cumulation for C 1 p 2√ − pσ ← (1 − cσ) pσ + 1 − (1 − cσ) µw C 2 yw cumulation for σ T Pµ T C ← (1 − c1 − cµ) C + c1 pc pc + cµ i=1 wi yi:λyi:λ update C σ ← σ × exp cσ kpσ k − 1 update of σ dσ EkN(0,I)k
Not covered on this slide: termination, restarts, active or mirrored sampling, outputs, and boundaries Invariances: Guarantee for Generalization
Invariance properties of CMA-ES
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Invariance to order preserving transformations 0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2 in function space 0.1 0.1 0 0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 like all comparison-based algorithms
3 3 Translation and rotation invariance 2 2 1 1 to rigid transformations of the search space 0 0 −1 −1
−2 −2
−3 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 → Natural Gradient in distribution space NES, Schmidhuber et al., 08 IGO, Olliver et al., 11
CMA-ES is almost parameterless Tuning of a small set of functions Hansen & Ostermeier 2001 Default values generalize to whole classes Exception: population size for multi-modal functions but see IPOP-CMA-ES Auger & Hansen, 05 and BIPOP-CMA-ES Hansen, 09 BBOB – Black-Box Optimization Benchmarking
ACM-GECCO workshops, 2009-2015 http://coco.gforge.inria.fr/ Set of 25 benchmark functions, dimensions 2 to 40 With known difficulties (ill-conditioning, non-separability, . . . ) Noisy and non-noisy versions
1.0 best 2009 f1-24 BIPOP-saACM-ES BIPOP-CMA-ES MOS Competitors include IPOP-saACM-ES IPOP-ACTCMA-ES AMALGAM IPOP-CMA-ES iAMALGAM VNS MA-LS-CHAIN BFGS (Matlab version), DE-F-AUC DEuniform CMAEGS 1plus1 1komma4mirser G3PCX Fletcher-Powell, AVGNEWUOA NEWUOA CMA-ESPLUSSEL NBC-CMA Cauchy-EDA ONEFIFTH DFO (Derivative-Free 0.5 BFGS PSO_Bounds GLOBAL ALPS Optimization, Powell 04) FULLNEWUOA oPOEMS NELDER NELDERDOERR EDA-PSO POEMS Differential Evolution PSO ABC MCS Rosenbrock RCGA
Proportion of functions of Proportion LSstep Particle Swarm Optimization LSfminbnd GA DE-PSO DIRECT BAYEDA SPSA and many more 0.0 RANDOMSEARCH 0 1 2 3 4 5 6 7 8 9 log10 of (ERT / dimension) BBOB – Black-Box Optimization Benchmarking
ACM-GECCO workshops, 2009-2015 http://coco.gforge.inria.fr/ Set of 25 benchmark functions, dimensions 2 to 40 With known difficulties (ill-conditioning, non-separability, . . . ) Noisy and non-noisy versions
1.0 best 2009 f1-24 BIPOP-saACM-ES BIPOP-CMA-ES MOS Competitors include IPOP-saACM-ES IPOP-ACTCMA-ES AMALGAM IPOP-CMA-ES iAMALGAM VNS MA-LS-CHAIN BFGS (Matlab version), DE-F-AUC DEuniform CMAEGS 1plus1 1komma4mirser G3PCX Fletcher-Powell, AVGNEWUOA NEWUOA CMA-ESPLUSSEL NBC-CMA Cauchy-EDA ONEFIFTH DFO (Derivative-Free 0.5 BFGS PSO_Bounds GLOBAL ALPS Optimization, Powell 04) FULLNEWUOA oPOEMS NELDER NELDERDOERR EDA-PSO POEMS Differential Evolution PSO ABC MCS Rosenbrock RCGA
Proportion of functions of Proportion LSstep Particle Swarm Optimization LSfminbnd GA DE-PSO DIRECT BAYEDA SPSA and many more 0.0 RANDOMSEARCH 0 1 2 3 4 5 6 7 8 9 log10 of (ERT / dimension) (should be) available in ROOT6 Benazera, Hansen, K´egl,15 Content
1 Setting the scene
2 Continuous Optimization The Evolution Strategy The CMA-ES (Covariance Matrix Adaptation Evolution Strategy) The Rank-based DataScience
3 Discrete/Combinatorial Optimization
4 Non-Parametric Representations
5 Conclusion Seeded Influencer Identification Philippe Caillou & Mich`eleSebag, coll SME Augure
Goal Find the influencers in a given social network from public data sources tweets, blogs, articles, . . .
Commercial goal: so as to bribe them :-) Scientific goal: with little user input What is an Influencer?
No clear definition # followers? Highly retweeted? but # retweets disagrees with # followers Sources of information cascades? # invitations to join? Topic-specific PageRanking? Presence on Wikipedia?
Need for Topic-specific features and graph features and user input no generic ground truth Seeded Influencer Ranking
Input A social network (Big) Data of user interactions Tweets, blogs, messages, . . . Some identified influencers
The global picture Derive features representing the users using content and traces Optimize scoring function giving highest scores to known influencers return best-k scoring users the most (known + new) influencers Search Space and Learning Criterion
The scoring function d Each individual is described by d features xi ∈ IR Non-linear score h defined by pair (v, a) in IRd × IRd
d X hv,a(x) = vi|xi − ai| i=1
A rank-based criterion
Each score hv,a induces a ranking Rv,a on the dataset Goal: Known influencers (KIs) ranked highest Best has rank n
Non-convex optimization X ArgMax(v,a) { Rv,a(xi)}
xi∈KI −→ Stochastic Optimizer e.g., sep-CMA-ES Data
Domains Fashion: 18/23 influencers with twitter account SmartPhones: 69/206 influencers with twitter account Social Influence: 39/39 influencers with twitter account Wine: 28/235 influencers with twitter account, coming from public sources (e.g., Time Magazine) Human Resources: 99 influencers from the 100 most influencial people in HR and recruiting on twitter
Sources random 10% of all retweets of November 2014 100M tweets, 45M unique tweets, with origin-destination pairs → 43M nodes graph only words in at least 0.001% and at most 10% tweets considered → 10.5M candidates Features
100 Content-based Features foreach medium eventually Foreach user x, identify W(x), the N words with max tf-idf term frequency - inverse document frequency 50 words that appear most often in all W(influencer) 50 words with max sum of tf-idf over S W(influencer) each selected word is a feature for x: 0 if not present in W(x), ti-idf otherwise Many are null for most candidates
6 Network Features from the weighted graph of retweets centrality, PageRank, . . .
5 Social Features from tweeter user profiles # tweets, # followers, . . . Seeded Influencer Identification: Discussion
Results Better than using only social features Found unexpected influencers Confirmed by experts
Forthcoming application Unveil some influencial xxx-sceptics Global warming, genocides, . . . Content
1 Setting the scene
2 Continuous Optimization
3 Discrete/Combinatorial Optimization The Crossover Controversy The TSP Illustration
4 Non-Parametric Representations
5 Conclusion Content
1 Setting the scene
2 Continuous Optimization
3 Discrete/Combinatorial Optimization The Crossover Controversy The TSP Illustration
4 Non-Parametric Representations
5 Conclusion Stochastic Optimization Templates (reminder)
Population-based Metaheuristics i i Randomly draw x0 ∈ Ω and compute F(x0), i = 1, . . . , µ Initialisation Until(happy) i 1 µ y = Move(xt ,..., xt ), i = 1, . . . , λ stochastic variations on Ω Compute F(yi), i = 1, . . . , λ 1 µ 1 µ 1 λ (xt+1,..., xt+1) = Select(xt ,..., xt , y ,..., y ) (µ + λ) selection = Select(y1,..., yλ) (µ, λ) selection
Distribution-based Metaheuristics
Initialize distribution parameters θ0 Until(happy)
Sample distribution P (θt) → y1,..., yλ ∈ Ω Evaluate y1,..., yλ on F Update parameters θt+1 ← Fθ(θt, y1,..., yλ, F(y1),..., F(yλ)) Stochastic Optimization Templates (reminder)
Population-based Metaheuristics i i Randomly draw x0 ∈ Ω and compute F(x0), i = 1, . . . , µ Initialisation Until(happy) i 1 µ y = Move(xt ,..., xt ), i = 1, . . . , λ stochastic variations on Ω Compute F(yi), i = 1, . . . , λ 1 µ 1 µ 1 λ (xt+1,..., xt+1) = Select(xt ,..., xt , y ,..., y ) (µ + λ) selection = Select(y1,..., yλ) (µ, λ) selection Crossover or not Crossover
Issues with crossover Crossover ubiquitous in nature though not in all bacteria Showed to mainly work on dynamical problems Compared to ILS, SA, Ch. Papadimitriou et al., 15 Crossover or not Crossover
Issues with crossover Crossover ubiquitous in nature though not in all bacteria Showed to mainly work on dynamical problems Compared to ILS, SA, Ch. Papadimitriou et al., 15 =6 Yes, but Most artificial crossover operators exchange chunks of solutions Nature does not exchange “organs” but chunks of programs that builds the solution Crossover or not Crossover
Issues with crossover Crossover ubiquitous in nature though not in all bacteria Showed to mainly work on dynamical problems Compared to ILS, SA, Ch. Papadimitriou et al., 15 =6 Yes, but Most artificial crossover operators exchange chunks of solutions Nature does not exchange “organs” but chunks of programs that builds the solution
Better no crossover than a poor crossover Content
1 Setting the scene
2 Continuous Optimization
3 Discrete/Combinatorial Optimization The Crossover Controversy The TSP Illustration
4 Non-Parametric Representations
5 Conclusion The Traveling Salesman Problem
Find shortest Hamiltonian Permutation Representation circuit on n cities Ordered list of cities: no semantic link with the objective Blind exchange of chunks meaningless, and leads to unfeasible circuits
Edge Representation List of all edges of the tour: unordered clearly related to objective but blind exchange of edges doesn’t dk11455.tsp work either
Challenge or opportunity? A strong exact solver Concorde A very strong heuristic solver LKH-2 Lin-Kernighan-Helsgaun heuristic Helsgaun, 06
LKH-2 Local Search Deep k-opt moves efficient implementation Many ad hoc heuristics
Multi-trial LKH-2 Iterated Local Search Using . . . a crossover operator between best-so-far and new local optima Iterative Partial Transcription, Mobius et al., 99-08 a particular case of GPX next slide
Most best known results on non-exactly solved instances, up to 10M cities Generalized Partition Crossover (GPX) Whitley, Hains, & Howe, 09-12
Respectful and Transmitting crossover Find k-partitions of cost 2 if exist O(n) Chooses best of 2k − 2 solutions O(k) All common edges are kept No new edge is created
Two possible hybridizations Evolutionary Algorithm using GPX and iterated 2-opt as mutation operator Some best results on small instance < 10k cities Replacing IPT in LKH-2 heuristic Improved some state-of-the-art results Edge Assembly Crossover Nagata & Kobayashi, 06-13
The EAX crossover 1 Merge A and B 2 Alternate A/B edges 3 Local, then global choice strategy 4 Use sub-tours to remove parent edges 5 Local search for minimal new edges
The EAX algorithm Foreach randomly chosen parent pair typically 300 iterations apply EAX N times typically 30 keep best of parents + N offspring diversity-preserving selection Best known results on several instances . . . ≤ 200k cities Evolutionary TSP: Discussion
Meta or not Meta? GPX can be applied to other graph problems EAX is definitely a specialized TSP operator
Take Home Message Efficient crossovers might exist for semantically relevant representations but require domain knowledge . . . and fine-tuning Content
1 Setting the scene
2 Continuous Optimization
3 Discrete/Combinatorial Optimization
4 Non-Parametric Representations The Velocity Identification The Planning Decomposition
5 Conclusion Content
1 Setting the scene
2 Continuous Optimization
3 Discrete/Combinatorial Optimization
4 Non-Parametric Representations The Velocity Identification The Planning Decomposition
5 Conclusion Identification of Spatial Geological Models MS, Ehinger, Braunschweig, 98
measure points Source * * * * * * * * * * * * *
The direct problem Function φ applied to model M Find result R = φ(M) (simulated) 3D seismogram → numerical simulation
New first seismograms: shot 1 Plane 1 Trace 10 20 30 40 50 60 70 80 90 100 110 120 0.00 0.00 0.10 0.10 0.20 0.20 The inverse problem 0.30 0.30 0.40 0.40 0.50 0.50 0.60 0.60 0.70 0.70 ∗ 0.80 0.80 Given experimental results R 0.90 0.90 1.00 1.00 1.10 1.10 1.20 1.20 ∗ 1.30 1.30 1.40 1.40 find model M 1.50 1.50 1.60 1.60 1.70 1.70 1.80 1.80 ∗ ∗ 1.90 1.90 such that φ(M ) = R 2.04 2.04 Trace 10 20 30 40 50 60 70 80 90 100 110 120 Plane 1 ∗ 2 → Arg MinM{||φ(M) − R | } 120 × 2D seismogram Voronoi Representation
The solution space Velocities in some 3D domain Blocky model piecewise constant Variable unordered list of Values on a grid/mesh Voronoi sites, and corresponding ’colored’ Voronoi diagram Thousands of variables Uncorrelated