Stochastic Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Stochastic Optimization Marc Schoenauer DataScience @ LHC Workshop 2015 Content 1 Setting the scene 2 Continuous Optimization 3 Discrete/Combinatorial Optimization 4 Non-Parametric Representations 5 Conclusion Content 1 Setting the scene 2 Continuous Optimization 3 Discrete/Combinatorial Optimization 4 Non-Parametric Representations 5 Conclusion Stochastic Optimization Hypotheses Search Space Ω with some topological structure Objective function F assume some weak regularity Hill-Climbing Randomly draw x0 2 Ω and compute F(x0) Initialisation Until(happy) y = Best neighbor(xt) neighbor structure on Ω Compute F(y) If F(y) F (xt) then xt+1 = y accept if improvement else xt+1 = xt Comments Find closest local optimum defined by neighborhood structure Iterate from different x0's Until(very happy) Stochastic Optimization Hypotheses Search Space Ω with some topological structure Objective function F assume some weak regularity Stochastic (Local) Search Randomly draw x0 2 Ω and compute F(x0) Initialisation Until(happy) y = Random neighbor(xt) neighbor structure on Ω Compute F(y) If F(y) F (xt) then xt+1 = y accept if improvement else xt+1 = xt Comments Find one close local optimum defined by neighborhood structure Iterate, leaving current optimum IteratedLocalSearch Stochastic Optimization Hypotheses Search Space Ω with some topological structure Objective function F assume some weak regularity Stochastic (Local) Search { alternative viewpoint Randomly draw x0 2 Ω and compute F(x0) Initialisation Until(happy) y = Move(xt) stochastic variation on Ω Compute F(y) If F(y) F (xt) then xt+1 = y accept if improvement else xt+1 = xt Comments Find one close local optimum defined by the Move operator Iterate, leaving current optimum IteratedLocalSearch Stochastic Optimization Hypotheses Search Space Ω with some topological structure Objective function F assume some weak regularity Stochastic Search aka Metaheuristics Randomly draw x0 2 Ω and compute F(x0) Initialisation Until(happy) y = Move(xt) stochastic variation on Ω Compute F(y) xt+1 = Select(xt; y) stochastic selection, biased toward better points Comments Can escape local optima e.g., Select = Boltzman selection Iterate ? Stochastic Optimization Hypotheses Search Space Ω with some topological structure Objective function F assume some weak regularity Population-based Metaheuristics i i Randomly draw x0 2 Ω and compute F(x0), i = 1; : : : ; µ Initialisation Until(happy) i 1 µ y = Move(xt ;:::; xt ), i = 1; : : : ; λ stochastic variations on Ω Compute F(yi), i = 1; : : : ; λ 1 µ 1 µ 1 λ (xt+1;:::; xt+1) = Select(xt ;:::; xt ; y ;:::; y ) (µ + λ) selection = Select(y1;:::; yλ) (µ, λ) selection Evolutionary Metaphor: 'natural' selection and 'blind' variations i i y = Movem(xt) for some i: mutation i i j y = Movec(xt; xt ) for some (i; j): crossover Metaheuristics: an Alternative Viewpoint Population and variation operators define a distribution on Ω −! directly evolve a parameterized distribution P (θ) Distribution-based Metaheuristics Initialize distribution parameters θ0 Until(happy) Sample distribution P (θt) ! y1;:::; yλ 2 Ω Evaluate y1;:::; yλ on F Update parameters θt+1 Fθ(θt; y1;:::; yλ; F(y1);:::; F(yλ)) includes selection Covers Estimation ofDistributionAlgorithms Population-based Metaheuristics EvolutionaryAlgorithms ParticleSwarmOptimization,DifferentialEvolution, . Deterministic algorithms This is (mostly) Experimental Science Theory lagging behind practice though rapidly catching up in recent years not discussed here Results need to be experimentally validated Experiments Validation relies on grounded statistical validation CPU costly informed (meta)parameter setting/tuning CPU costly ++ reproducibility Open Science From Solution Space to Search Space Choices Objective function Search space representation and move operators / parameterized distribution Approximation bias vs optimization error This talk Non-convex, non-smooth objective parametric continuous optimization ! Rank-based ML seeded identification of influencers Search spaces and crossover operators the TSP Non-parametric representations exploration vs optimization Content 1 Setting the scene 2 Continuous Optimization The Evolution Strategy The CMA-ES (Covariance Matrix Adaptation Evolution Strategy) The Rank-based DataScience 3 Discrete/Combinatorial Optimization 4 Non-Parametric Representations 5 Conclusion Content 1 Setting the scene 2 Continuous Optimization The Evolution Strategy The CMA-ES (Covariance Matrix Adaptation Evolution Strategy) The Rank-based DataScience 3 Discrete/Combinatorial Optimization 4 Non-Parametric Representations 5 Conclusion Stochastic Optimization Templates Population-based Metaheuristics i i Randomly draw x0 2 Ω and compute F(x0), i = 1; : : : ; µ Initialisation Until(happy) i 1 µ y = Move(xt ;:::; xt ), i = 1; : : : ; λ stochastic variations on Ω Compute F(yi), i = 1; : : : ; λ 1 µ 1 µ 1 λ (xt+1;:::; xt+1) = Select(xt ;:::; xt ; y ;:::; y ) (µ + λ) selection = Select(y1;:::; yλ) (µ, λ) selection Distribution-based Metaheuristics Initialize distribution parameters θ0 Until(happy) Sample distribution P (θt) ! y1;:::; yλ 2 Ω Evaluate y1;:::; yλ on F Update parameters θt+1 Fθ(θt; y1;:::; yλ; F(y1);:::; F(yλ)) Stochastic Optimization Templates Distribution-based Metaheuristics Initialize distribution parameters θ0 Until(happy) Sample distribution P (θt) ! y1;:::; yλ 2 Ω Evaluate y1;:::; yλ on F Update parameters θt+1 Fθ(θt; y1;:::; yλ; F(y1);:::; F(yλ)) The (µ, λ)−Evolution Strategy Ω = Rn P (θ): normal distributions Initialize distribution parameters m;σ; C, set population size λ 2 N While not terminate 2 n 1 Sample distribution N m;σ C ! y1;:::; yλ 2 R yi ∼ m + σ Ni(0; C) for i = 1;:::;λ 2 Evaluate y1;:::; yλ on F Compute F(y1);:::; F(yλ) 3 Select µ best samples y1;:::; yµ 4 Update distribution parameters m;σ; C F (m;σ; C; y1;:::; yµ; F(y1);:::; F(yµ)) The (µ, λ)−Evolution Strategy (2) Gaussian Mutations mean vector m 2 Rn is the current solution the so-called step-size σ 2 R+ controls the step length the covariance matrix C 2 Rn×n determines the shape of the distribution ellipsoid How to update m, σ, and C? i=µ 1 X m = x \Crossover" of best µ individuals µ i:λ i=1 Adaptive σ and C Need for adaptation P 2 Sphere function f(x) = xi ; n = 10 Content 1 Setting the scene 2 Continuous Optimization The Evolution Strategy The CMA-ES (Covariance Matrix Adaptation Evolution Strategy) The Rank-based DataScience 3 Discrete/Combinatorial Optimization 4 Non-Parametric Representations 5 Conclusion Cumulative Step-Size Adaptation (CSA) Measure the length of the evolution path path of the mean vector m in the generation sequence # # decrease σ increase σ Compare to random walk without selection Covariance Matrix Adaptation Rank-One Update Pµ m m + σyw; yw = i=1 wi yi:λ; yi ∼ Ni (0; C) initial distribution, C = I T new distribution: C 0:8 × C + 0:2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update Pµ m m + σyw; yw = i=1 wi yi:λ; yi ∼ Ni (0; C) yw, movement of the population mean m (disregarding σ) T new distribution: C 0:8 × C + 0:2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update Pµ m m + σyw; yw = i=1 wi yi:λ; yi ∼ Ni (0; C) T mixture of distribution C and step yw, C 0:8 × C + 0:2 × yw yw T new distribution: C 0:8 × C + 0:2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update Pµ m m + σyw; yw = i=1 wi yi:λ; yi ∼ Ni (0; C) new distribution (disregarding σ) T new distribution: C 0:8 × C + 0:2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update Pµ m m + σyw; yw = i=1 wi yi:λ; yi ∼ Ni (0; C) movement of the population mean m T new distribution: C 0:8 × C + 0:2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update Pµ m m + σyw; yw = i=1 wi yi:λ; yi ∼ Ni (0; C) mixture of distribution C and step yw, T C 0:8 × C + 0:2 × yw yw T new distribution: C 0:8 × C + 0:2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again Covariance Matrix Adaptation Rank-One Update Pµ m m + σyw; yw = i=1 wi yi:λ; yi ∼ Ni (0; C) T new distribution: C 0:8 × C + 0:2 × yw yw ruling principle: the adaptation increases the probability of successful steps, yw, to appear again CMA-ES in one slide n Input: m 2 R , σ 2 R+, λ Initialize: C = I, and pc = 0, pσ = 0, 2 2 Set: cc ≈ 4=n, cσ ≈ 4=n, c1 ≈ 2=n , cµ ≈ µw=n , c1 + cµ ≤ 1, p µw 1 dσ ≈ 1 + n , and wi=1:::λ such that µw = Pµ 2 ≈ 0:3 λ i=1 wi While not terminate xi = m + σ yi; yi ∼ Ni (0; C) ; for i = 1;:::;λ sampling Pµ Pµ m i=1 wi xi:λ = m + σyw where yw = i=1 wi yi:λ update mean p p 2p pc (1 − cc) pc + 1Ifkpσ k<1:5 ng 1 − (1 − cc) µw yw cumulation for C 1 p 2p − pσ (1 − cσ) pσ + 1 − (1 − cσ) µw C 2 yw cumulation for σ T Pµ T C (1 − c1 − cµ) C + c1 pc pc + cµ i=1 wi yi:λyi:λ update C σ σ × exp cσ kpσ k − 1 update of σ dσ EkN(0;I)k Not covered on this slide: termination, restarts, active or mirrored sampling, outputs, and boundaries Invariances: Guarantee for Generalization Invariance properties of CMA-ES 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Invariance to order preserving transformations 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 in function space