On Some Statistical Properties of Multivariate Q-Gaussian Distribution and Its Application to Smoothed Functional Algorithms

Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

On some Statistical Properties of Multivariate q-Gaussian Distribution and its application to Smoothed Functional Algorithms

Debarghya Ghoshdastidar

Ph.D. Candidate Computer Science & Automation

Work done with: Dr. Ambedkar Dukkipati Prof. Shalabh Bhatnagar Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Outline 1 Multivariate q-Gaussian Distribution Nonextensive Information Theory q-Gaussian distribution Moments of q-Gaussian Generating multivariate q-Gaussian 2 Smoothed Funtional Algorithms Stochastic Optimization Framework Smoothed Functional method Optimization using SF 3 q-Gaussian based SF Algorithms q-Gaussian as smoothing kernel Proposed two-timescale algorithm Convergence of algorithm 4 Discussions Summary Origin of q-Gaussian q-Central Limit Theorem Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Boltzman-Gibbs-Shannon entropy For a discrete probability mass function p, X H(p) = − p(x) ln p(x), x∈X

Uncertainty of random variable. Maximum entropy principle states: choose distribution, satisfying given constraints, that maximizes entropy. uniform, exponential, normal distributions can be formulated as maximum entropy distributions. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Tsallis entropy For a discrete probability mass function p,

X q Hq(p) = − p(x) lnq p(x) , x∈X

x1−q−1 where q-logarithm lnq(x) = 1−q , q ∈ R, q 6= 1.

Proposed in context of thermodynamics1. Tends to Shannon entropy as q → 1. Pseudo-additive in nature, i.e., for X and Y independent,

Hq(X,Y ) = Hq(X) + Hq(Y ) + (1 − q)Hq(X)Hq(Y )

The q-logarithm is same as Box-Cox transformation used in statistics to “make the data more normal distribution-like”. 1C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of Statiscal Physics 52 (1-2), 1988. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Tsallis entropy functional For a contrinuous probability distribution p, Z 1 − p(x)q dx H (p) = X , q ∈ . q q − 1 R

q-expectation Corresponding generalization of expectation,

R f(x)p(x)q dx hfi = = E [f(X)], q R p(x)q dx pq

p(x)q where escort distribution, pq(x) = R p(x)q dx Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

q-Gaussian distribution: Nonextensive generalization of Gaussian distribution. Associated with L´evy super-diﬀussion process2. Obtained from Tsallis entropy maximization under constraints

q-mean, hXiq = µq 2 q-variance, h(X − µq) iq = σq

2D. Prato and C. Tsallis. Nonextensive foundation of L´evydistributions. Physical Review E. 60 (2), 2398–2401, 1999. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

q-Gaussian distribution

1 1 (1 − q) 1−q G (x) = 1 − (x − µ )2 q σ K (3 − q)σ2 q q q q + 2 1 (x − µq) = expq − 2 for all x ∈ R, σqKq (3 − q)σq

y+ = max(y, 0) is Tsallis cut-oﬀ condition

 2−q √ √ Γ  √π 3−q 1−q for −∞ < q < 1  1−q 5−3q  Γ 2(1−q) Kq = √ √ 3−q  Γ 2(q−1)  √π 3−q for1 < q < 3  1−q 1  Γ q−1 Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Properties : Probability distribution only for q < 3. It provides a family of distributions, with behavior controlled by parameter q.

E[X] = µq for q < 2. q 3−q 2 5 Var[X] = 5−3q σq for q < 3 . Finite support for q < 1. Inﬁnite support and power-law nature for q > 1. One-to-one correspondence with Students-t for q > 1.

Special cases : Gaussian distribution as q → 1 Cauchy distribution for q = 2 Uniform distribution as q → −∞ Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Multivariate q-Gaussian distribution3 : ! 1 1 (1 − q) kXk2 1−q Gq,σ(X) = N 1 − 2 , σ Kq,N (N + 4) − (N + 2)q σ + N for all X ∈ R , with mean, µ = 0 covariance, Σ = σ2I and normalizing constant

 N N/2 2−q (N+4)−(N+2)q 2 π Γ 1−q  for q < 1  1−q Γ 2−q + N  1−q 2 Kq,N =  N N/2 1 N  2 π Γ −  (N+4)−(N+2)q q−1 2 for 1 < q < N+4  q−1 1 N+2 Γ q−1

3C. Vignat and A. Plastino. Central limit theorem and deformed exponentials. Journal of Physics A: Mathematical and Theoretical 20(45), 2007. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Support set :

2  n N 2 ((N+4)−(N+2)q)σ o  x ∈ R : kxk 6 (1−q) for q < 1 Ωq =  N N+4 R for 1 < q < N+2

Consider X(1),X(2),...,X(N) identical q-Gaussian distributed with N+4 q ∈ − ∞, N+2 , q 6= 1. 2 E X(i) = 1 for all i = 1,...,N. E X(i)X(j) = 0 for all i, j = 1,...,N, i 6= j.

(1−q) 2 ρ(X) = 1 − ((N+4)−(N+2)q) kXk .

b, b1, b2, . . . , bN ∈ N. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Theorem (Generalized co-moments) :

" b1 b2 bN # X(1) X(2) ... X(N) EG q (ρ(X))b

 N P bi N !  2  (N + 4) − (N + 2)q i=1 Y bi!  K¯  bi  1 − q 2bi ! = i=1 2  if bi is even for all i = 1, 2,...,N    0 otherwise

with

 Γ( 1 −b+1)Γ( 1 +1+ N )  1−q 1−q 2  N if q ∈ (−∞, 1)  1 1 N P bi  Γ( 1−q +1)Γ 1−q −b+1+ 2 + 2  i=1 K¯ = N  Γ 1 Γ 1 +b− N − P bi  ( q−1 ) q−1 2 2  i=1 2  1 1 N if q ∈ 1, 1 + N+2 Γ( q−1 +b)Γ( q−1 − 2 ) Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Limiting case :

b b b " (1) 1 (2) 2 (N) N # N b X X ... X Y (i) i lim EG (X) = EG(X) X q→1 q b (ρ(X)) i=1

Generalized moments and q-moments :

* b1 b2 bN + X(1) X(2) ... X(N) (ρ(X))b q

" b1 b2 bN # 2 X(1) X(2) ... X(N) = EG (X) (N + 2 − Nq) q (ρ(X))b+1 Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

mth-order moments and q-moments : Relations hold only if Gamma functions exist.

N X If b = 0 and bi = m, then K¯ is function of m. i=1 2 m-th order moments and co-moments exist for q < 1 + N+m , q 6= 1. 2 m-th order q-moments exist for q < 1 + N+m−2 , q 6= 1. The distribution, as well as1 st and2 nd order q-moments, exist 2 for all q < 1 + N . 2 Usual mean, µ = µq for q < 1 + N+1 .

(N+2)−Nq 2 Covariance,Σ= (N+4)−(N+2)q Σq for q < 1 + N+2 . Motivates us to express multivariate q-Gaussian in terms of q-moments. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Lemma 2 Given Z ∼ N (0,IN×N ), a ∼ χ (ν), ν > 0, and

rν Y = Z a

2 then Y ∼ Gq(0,IN×N ), where q = 1 + N+ν .

Lemma 2 Let Y ∼ Gq(0,IN×N ) for some q ∈ 1, 1 + N+2 and

q 2−q N+2−Nq Y X = q . q−1 T 1 + N+2−Nq Y Y

0 q−1 Then X ∼ Gq0 (0,IN×N ), where q = 1 − (N+4)−(N+2)q ∈ (−∞, 1). Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Sampling algorithm : (given q, µq andΣ q)

1 Generate N-dimensional vector Z ∼ N (0,IN×N )

2 Generate chi-squared random variate

 2(2−q)  χ2 for −∞ < q < 1,  1−q a ∼  2 N+2−Nq 2  χ q−1 for1 < q < 1 + N .

3 Compute

 q N+2−Nq Z  √ for −∞ < q < 1,  1−q a+ZT Z Y =  q  N+2−Nq √Z 2  q−1 a for1 < q < 1 + N .

1/2 4 X = µq + Σq Y ∼ Gq(µq, Σq) Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

System : Discrete event system, {Ym : m > 0}, controlled by parameter θ ∈ C, closed and convex subset of RN .

Cost : Long run average cost, J(θ) = Eνθ [h(Y )], where νθ is the stationary distribution of process, and h(Y ) is the single-stage cost.

Objective : Minimize J(θ) with respect to θ ∈ C.

Issue : No analytical relationship between J(θ) and θ.

Solution : Perform optimization with derivatives of J, esti- mated using Smoothed Functional approach. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Assumptions : The process is ergodic for a given θ, i.e., for large L

L−1 1 X J(θ) = [h(Y )] ≈ h(Y ). Eνθ L m m=0

J(.) is twice continuously diﬀerentiable for all θ ∈ C. The process remains stable under the sequence of parameter updates. (technically, we assume existence of a stochastic Lyapunov function.) Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

N Let f : C 7→ R be any function, then given a function Gβ : R 7→ R satisfying the Rubinstein conditions4, we have

Deﬁnition (Smoothed Functional) Z Sβ[f(θ)] = Gβ(η)f(θ − η) dη

N R

Figure: Unsmoothed function, Figure: Smoothed function, 2 1 −x2 S0.1[f(x)] f(x) = x − 4 e cos(8πx)

4R. Y. Rubinstein. Simulation and Monte-Carlo Method. John Wiley, 1981. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Rubinstein conditions : 1 η η η(1) η(2) η(N) Gβ(η) = βN G β , where G β = G1 β , β ,..., β

Gβ(η) is piecewise diﬀerentiable in η

Gβ(η) is a probability distribution function, i.e.,

Sβ[f(θ)] = EGβ (η)[f(θ − η)]

lim Gβ(η) = δ(η), where δ(η) is the Dirac delta function β→0

lim Sβ[f(θ)] = f(θ) β→0

Examples of smoothing kernels : 2 Gaussian distribution with covariance matrix β IN×N Cauchy distribution with scale parameter β N h β β i Uniform distribution on interval − 2 , 2 Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Optimization methods : Gradient descent algorithm 1 x = P x − ∇ f(x ) n+1 [−1,1] n n x n

Gradient descent on smoothed functional 1 x = P x − ∇ S [f(x )] n+1 [−1,1] n n x β n

Figure: Optimum found using Gradient (red) and SF (yellow). Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Smoothed Gradient

∇θSβ[f(θ)] = EG(η) [g1(η)f(θ + βη)| θ]

Stochastic framework :

Random vector η ∼ G = G1.

Process {Ym} controlled by parameter( θ + βη).

Function g1 depends on nature of G. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Lemma 2 The N-variate q-Gaussian distribution, with q-covariance β IN×N 2 satisﬁes the Rubinstein conditions for all q < 1 + N and q 6= 1.

One-simulation q-SF Gradient

∇θSq,β[J(θ)] = EGq (η) [g1(η)J(θ + βη)| θ]

2η where g1(η) = (1−q) 2 β(N + 2 − Nq) 1 − (N+2−Nq) kηk

Lemma

∇θSq,β[J(θ)] − ∇θJ(θ) = o(β) for all q < (1 + 2/N), q 6= 1. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Approximation :

M−1 L−1 ! 1 X X ∇ S [J(θ)] ≈ g (η ) h(Y ) , θ q,β ML 1 n nL+m n=0 m=0

where {YnL+m} controlled by( θ + βηn), for large M, L.

Gradient descent method :

1 Fix M, L, q and β.

2 Set parameter update θ0 = θinitial.

3 For k = 0 to a ﬁxed number of steps

1 Estimate ∇θk Sq,β[J(θk)] using above approximation. 2 Update θk+1 = PC (θk − ak∇θk Sq,β[J(θk)]). 4 Output ﬁnal parameter vector. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Remark : Parameter θ should lie in some bounded set (usually convex).

Taken care of by projection PC . Issue :

Estimation of ∇θk Sq,β[J(θk)] requires considerable computation. Nested loop scenario increases complexity further. Two-timescale approach :

Perform gradient estimation and parameter updation simultaneously. Update gradient estimation with larger step-sizes. Update parameter with smaller step-sizes. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Idea of two-timescale : Given update rules

xn+1 = xn + anf(xn, yn)

yn+1 = yn + bng(xn, yn) where an → 0, the updates can be viewed as ↓ 0, bn 1 y˙(t) = gx(t), y(t) x˙(t) = fx(t), y(t)

Faster timescale: for x quasi-static, yn → λ(x) globally asymptotically stable equilibrium of y˙(t) = gx, y(t) Slower timescale: 0 y(t) tracks λ(x(t)), and so xn → x , which is stable equi of x˙(t) = fx(t), λ(x(t))

Updates converge to x0, λ(x0). Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Step-sizes : ∞ ∞ + P 2 P 2 (an)n>0, (bn)n>0 ⊂ R satisfying an < ∞, bn < ∞, n=0 n=0 ∞ ∞ P P an an = bn = ∞ and an = o(bn), i.e., lim = 0. bn n=0 n=0 n→∞

The Gq-SF1 Algorithm :

1 Fix M, L, q and β.

2 Set gradient update Z0 = 0, parameter update θ0 = θinitial.

3 For n = 0 to M − 1 N 1 Generate η ∈ R , η ∼ Gq,1. 2 For m = 0 to L − 1

Simulate YnL+m with parameter( θn + βηn). ZnL+m+1 = (1 − bn)ZnL+m + bng1(ηn)h(YnL+m). 3 Update θn+1 = PC θn − anZ(n+1)L .

4 Output ﬁnal parameter vector. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Claim

The updates θn converge to the neighbourhood of a local minimum.

Faster timescale (averaging natural timescale) : ZnL+m+1 = ZnL+m + bn g1(ηn)h(YnL+m) − ZnL+m = ZnL+m + bn E[g1(ηn)h(YnL+m)|GnL+m−1] − ZnL+m + AnL+m

GnL+m = σ(θk, ηk,Yj)k6n,j6nL+m

(AnL+m, GnL+m) is martingale diﬀerence term with bounded variance.

θn, ηn quasi-static during above update.

So, we can say5

ZnL tracks g1(ηn)J(θn + βηn).

5V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Slower timescale : θn+1 = PC θn − ang1(ηn)J(θn + βηn) = PC θn + an [−∇θn J(θn) + ∆(θn) + ξn] ,

Noise term

ξn = ∇θn Sq,β[J(θn)] − g1(ηn)J(θn + βηn),

ξn is martingale diﬀerence term with bounded variance. Error term ∆(θ) = ∇θJ(θ) − ∇θSq,β[J(θ)],

Error satisﬁes k∆(θ)k = o(β). Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Convergence of algorithm : Updates track the ODE6 ˙ ˜ θ(t) = PC − ∇θ(t)J(θ(t)) + ∆(θ(t)) PC x+f(x) −x where P˜C f(x) = lim . ↓0 If ∆(θ(t)) → 0, then update would have converged to stable ﬁxed points of ˙ ˜ θ(t) = PC − ∇θ(t)J(θ(t)) , which are the local minima. Since k∆(θ)k = o(β), for small β, it will converge to some -neighbourhood of minima.

6H. J. Kushner and D. S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, 1978 Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Summary : Higher order moments and q-moments of multivariate q-Gaussian distribution. Algorithm to generate q-Gaussian random vectors. q-Gaussian generalizes existing class of smoothing kernels. Two-timescale q-SF gradient descent algorithms converge to neighbourhood of local minima for all q < (1 + 2/N), q 6= 1. Simulation results show that we can reach closer to global minimum compared to other methods. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

L´evy α-stable distibutions : “When sharks can’t ﬁnd food, they abandon Brownian motion.”

pX is stable distribution if X,Y ∼ pX , then paX+bY has similar shape as pX . Commonly observed in nature and finance, where α ∈ (0, 2]. Associated with Lévy flights and long-range interactions. Characteristic function, ϕ(t) := E[exp(itX)] = exp(−|ct|α) for some c > 0. Generalized Central Limit Theorem7 : 2 CLT: If X1,X2,... i.i.d. with zero mean and variance σ < ∞, n √1 P then Zn = n Xi tends to an normal distribution. i=1

GCLT: If X1,X2,... i.i.d. with zero mean and inﬁnite variance n 1 P (i.e., with power-law tails), then Zn = c Xi tends to an i=1 α-stable distribution. 7V. V. Gnedenko and A. N. Kolmogorov. Limit Distributions of Sums of Independent Random Variables. Addison-Wesley, 1968. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Origin of q-Gaussian : If p.d.f of1-state jump is Gaussian, p.d.f of N-state jump tends to Gaussian. Question: When does p.d.f. of N-state jump tend to α-stable distribution? Such p.d.f. obtained from maximization of BGS entropy with constraint Z E exp(−itX) exp(−|ct|α) = constant for some c > 0.

No physical interpretation for such constraint. Better interpretations when Tsallis entropy maximized, with q-moment constraints. 5 If p.d.f of1-state jump q-Gaussian with q > 3 , then p.d.f. of 5 N-state jump tends to α-stable distribution. If q < 3 , it tends to normal. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Vignat’s approach8 : Use the fact if Z ∼ N (0, 1) and a ∼ χ2(ν), then

rν 2 Y = Z ∼ G (0, 1), where q = 1 + . a q N + ν

Theorem : If X1,X2,... i.i.d. with zero mean and unit variance, and a ∼ χ2(ν), then

n 1 X Z = √ X n an i i=1 converges weakly to a q-Gaussian distribution. Shows connections to Tsallis’ results.

8C. Vignat and A. Plastino. Central limit theorem and deformed exponentials. Journal of Physics A: Mathematical and Theoretical 20(45), 2007. Multivariate q-Gaussian SF Algorithms q-Gaussian SF Discussions

Tsallis’ approach9 : Generalize to q-algebra, q-calculus, q-Fourier transform etc. q-independence: X and Yq -independent if

ϕq,X+Y (t) = ϕq,X (t) ⊗q ϕq,Y (t)

q-convergence: X1,X2,... are q-convergent to X∞ if

lim ϕq,X (t) = ϕq,X (t) locally uniformly in t. n→∞ n ∞

Theorem : For q ∈ (1, 2), if X1,X2,...q -independent and identically distributed with q-mean µq and (2q − 1)-variance 2 σ2q−1, then X1 + X2 + ... + Xn − nµq Zn = Cq,n,σ q-converges to a( q − 1)-Gaussian distribution. 9S. Umarov, C. Tsallis and S. Steinberg. On a q-Central Limit Theorem Consistent with Nonextensive Statistical Mechanics, Milan Journal of Mathematics, 76 (1), 307–328, 2008.