<<

UNIVERSIDAD CARLOS III DE MADRID

ESCUELA POLITECNICA´ SUPERIOR

MASTER´ EN MULTIMEDIA Y COMUNICACIONES

Duality between Probability and Optimization

Luca Martino Gesti´onDe Informati´on Multimedia March 2008 Contents

1 Introduction 2

2 3

3 Tropical (Min-plus Algebra) 5

4 Probability Space 6

5 Cost Measure and Decision Variable 7

6 Conclusion 11

1 1 Introduction

In many applications, relationated with Statistic Inference or Machine Learn- ing, is usually connected some Loss Functions (also called Cost Functions) with some Probability Density Functions. In general, the choice of some spe- cific loss function suppose a specific distribution of the data. In another words, the minimization of a loss function gives the best solution for the concerned problem, if distribution of data is exactly the probability density associated with the used loss function. A classical example, is the Least Squares Solution; this method minimizes the mean square error ,i.e., the follow cost function:

X 2 C = ξi (1) i This is the best solution if it is possible to suppose that the noise pertubation has Gaussian Distribution:

ξ ∼ k · exp (−α · ξ2) (2)

In next sections, we will find the follow correspondences:

P → K (3) where P and K identify respectivelly the Probability and Cost measure; the same with density functions:

p(x) → c(x) (4)

In general, is logic to expect these relations between the operations:

× → + (5)

+ → min (6)

In other words, we pass from R to Rmin (tropical semiring).

2 2 Algebraic Structure

A algebraic structure consists in one (o more) closed under one or more operations satisfying some axioms. The keystone to study algebraic struc- tures are the binary operations. A (also called dyadic oper- ation) is a calculation involving two operands, and often written using infix notation such as a ∗ b, a + b, a · b rather than functional notation of form f(a, b). The basic kind of algebraic structure is a (or ). Specifi- cally a magma consists in a set M equipped with a single binary operation M × M → M. The only hypotesis imposed on this operation is to be closed but no more axioms are supposed. A is a magma where the oper- ation is associative (is different to that is noempty magma where is also possible), and a is a semigroup with (for example, the identity element for addition is 0 and for multiplication is 1). A is a monoid with . We just have described the right branch of fig.1. In figure 2, there are another definitions, for example is a group where the operation is commutive. For the follow sections we need the notions of and semiring. Ring: is an algebraic structure formed by a set M and two operations that

Figure 1: From Magma to Group (M=magma, Q=quasigroup, L=loop, S=semigroup, N=monoid; d=divisibility, a=associativity, e=identity, i=inversibility) .

3 Figure 2: Algebraic Structures. are relationated each other by distributive property. Semiring: is similar to ring but without the requirement that each element must have an additive inverse. In other words, a semiring (M, +, ·) is a set M with two binary operations + and ·, called addition and multiplication, such that:

1. (M, +) is a commutative monoid with identity element 0:

• (a + b) + c = a + (b + c) • 0 + a = a + 0 = a

4 • b + a = a + b

2. (M, ·) is a monoid with identity element 1:

• (a · b) · c = a · (b · c) • 1 · a = a · 1 = a

3. Multiplication distributes over addition:

• a · (a + c) = (a · b) + (a · c) • (a + b) · c = (a · c) + (b · c)

4. 0 annihilates M:

• 0 · a = a · 0 = 0

3 Tropical Semiring (Min-plus Algebra)

The adjetive tropical is given in honor of Brazilian mathematian Imre Si- mon. It is usually inidicated Rmin = (<, ⊕, ). The underlying set is < of real numbers, sometime augmented by +∞ (< ∪ {+∞}). The arithmetic operations of tropical addition ⊕ and tropical multiplication are:

x ⊕ y := min{x, y} (7)

x y := x + y (8)

The Tropical Semiring is idempotent in the sense that:

a ⊕ a ⊕ a ⊕ . . . a ⊕ a = a (9)

The n-dimensional real

(a1, a2, . . . , an) ⊕ (b1, b2, . . . , bn) = (min{a1, b1}, ..., min{an, bn}) (10) and the tropical scalar multiplication:

λ (a1, a2, . . . , an) = (λ + a1, λ + a2, . . . , λ + an) (11)

5 Is easy to define a linear tropical space, for example:

Definition: a tropical linear space L in

To explain the duality between cost and probability measure, we need to use Rmin semiring.

4 Probability Space

A probability space consists in (Ω, =,P ), where:

• Ω is a set of all possible results.

• = is a σ-algebra of parts of Ω.

• P is a probability measure on =.

Definition: A family = of parts of a set Ω is called σ-algebra (or trib´u)if:

1. 0, Ω ∈ =.

2. if A ∈ = also AC ∈ =.

3. if A1,A2,...,An ∈ = ⇒ ∪iAi ∈ =.

4. if A1,A2,...,An ∈ = ⇒ ∩iAi ∈ =. Definition: Given Ω a set,= a σ-algebra of parts of Ω. A probability P is an application P : = → R+ such that:

• P (Ω) = 1.

• P (∅) = 0. P • Given {An}n with Ai ∩ Aj = 0 ∀i, j, then P (∪nAn) = n P (An). Therefore, we will call Probability Space the triplet of (Ω, =,P ).

6 5 Cost Measure and Decision Variable

It is possible to build a cost theory, dual of probabylity theory, replacing the classical structure of real numbers (R, +, ×) by the tropical semiring Rmin , i.e., to augment R∪{+∞} and to use the binary operations ⊕ = min, = +. To the probability of an event corresponds the cost of set of decisions. To the random variable corresponds the decision variable. Decision Space is defined like a the triplet (U, U,K) where U is a topological space (i.e., U is a set A together with a colletion of subsets of A), U is the set of pen sets of U and K is an application U → Rmin, such that: • K(U) = 0

• K(0) = +∞

• K(∪nAn) = infn K(An) ∀An ∈ U The mapping K is called cost measure. The cost density is a function c : U → Rmin. For example:

 +∞ for x 6= m χ (x) = (12) m 0 for x = m

It is the equivalent of delta probability density. Another example:

p 1 −1 p M = σ (x − m) for p > 1 (13) m,σ p That for p = 2 we find the cost density associated to gaussian distrubution 2 p with paramters (m, σ ), and Mm,0 = χm(x). Moreover, For two indipendent variables X and Y :

cX,Y (x, y) = cX (x) + cY (y) (14)

Is also possible to define the conditional cost excess of X knowing Y :

cX,Y (x|y) = cX,Y (x|y) − cY (y) (15)

Similary to the mode, we can define the optimum of decision variable:

O(x) = arg min cX (x) (16) x

7 If the decision variable X take values in Rn, near the optimum, we can write:

1 −1 p p cX (x) = σ (x − O(X)) + o(k(x − O(X))k ) (17) p and in this case, X is called of order p with sensivity of order p that clearly plays the rule of typical deviation:

Sp(X) = σ (18)

While the Value of X, corresponding to expectation value in probability theory, is:

V (X) = inf(x + cX (x)) (19) x Since a sequence of N i.i.c. decision variables (i.e.,indipendent with same density cost function),the joint cost density is definided as: X cX (x1, x2, . . . , xN ) = cXi (xi) (20)

Moreover, for a sequence the decisions variables (Xn, n ∈ N) we can define three different type of convergence, analogous to probability theory:

p P • Xn ∈ L converges in p-norm to X (Xn → X) if:

lim kXn − Xk = 0 (21) n→+∞

K • Xn converges in cost to X (Xn → X) if:

lim K{|Xn + X| > η} = +∞ (22) n→+∞

a.s. • Xn converges almost surely to X (Xn → X) if:

K{ lim Xn = X} = 0 (23) n→+∞

The convergence in p-norm implies the converge in cost but the converse is false, and the convergence in cost implies almost surely convergence and the converse is not true. In other words, the convergence p-norm is the strong condition. In probability theory, the almost surely convergence plays this

8 rule. Now, it is easy to understand the analogue of the law of large number: p Given a sequence (Xn, n ∈ N) of i.i.c. decision variables Xn ∈ L p ≥ 1, the aritmetic mean of Xn converge in p-norm to O(Xi) (all the variables have the ame optimum):

+∞ X lim Xn = O(X0) (24) n→+∞ n=0 The role of Laplace/Fourier transform in probability calculus is played by Fenchel Transform:

cˆ(ϑ) = F[cX (x)](ϑ) = sup[< ϑ, x > −cX (x)] (25) x or is equivaletly defined by:

cˆ(ϑ) = F[cX (x)](ϑ) = − inf[cX (x)− < ϑ, x >] (26) x For example, for a linear combination:

f(x) =< a, x > −b (27) the Fenchel Transform is:  b, ϑ = a F[f(x)](ϑ) = (28) ∞, ϑ 6= a

We can see easy from the definition:

F[f(x)](ϑ) = sup[< ϑ, x > − < a, x > +b] (29) x Or, another examples:

g(x) = |x| (30)

h(x) = exp(x) (31) the Fenchel Transforms will be:  0, |ϑ| ≤ 1 F[g(x)](ϑ) = (32) ∞, |ϑ| > 1

9   ϑ ln(ϑ) − ϑ, ϑ > 1 F[h(x)](ϑ) = 0, ϑ = 0 (33)  ∞, ϑ < 0

The Fenchel Transform of cX (x) is the Characteric Function of the decision variable X. The F is a involution, i.e., F(F[f]) = f (is periodic of period 2; Fourier Transform has period 4). Now we remember that the convolution of two functions: Z +∞ f ⊗ g = C(τ) = f(x)g(τ − x) · dx (34) −∞ The corresponding in decision theory id the inf-convolution of f and g:

f  g = Cˆ(τ) = inf{f(x) + g(τ − x)} (35) x Similary to the normal convolution and Laplace or Fourier Transforms, the inf-convolution and Fenchel transform have the follow properties: •F(f  g) = F(f) + F(g)

•F(f + g) = F(f)  F(g) So that for indipendent decision variables X and Y , we have:

cX+Y = cX  cY (36) and in Fenchel :

F(X + Y ) = F(X) + F(Y ) (37)

This consideration leads to demostration of Central Theorem; since a sequence Xn, i.i.c. variables centered (O(Xi) = 0) of order p and with 1 1 p + p0 = 1:

N−1 1 X p Xn → M p (38) N 1/p0 0,S (X0) n=0 The demostration uses that n → ∞:

1 0 F[X ](ϑ) → ϕ(ϑ) = kσϑkp (39) n p0

10 It is possible to find an analogue of Markov Chains that we will call Bellman

Chains. Since a sequence Xn and C = [Cxi,xj ] a matrix of transition cost and

φx0 the initial cost, we can write the cost of the path:

N X cX (x0, x1, . . . , xN ) = φx0 + Cxi,xj (40) i=0

n And tha marginal cost will be vx = K(Xn = x):

n+1 n n 0 vx = vx ⊗ C = min(vx + Cx,.), vx = φx0 (41) x

6 Conclusion

Over the pure theorical interests, these observations about the duality be- tween probabilities and costs, could be immediately used in Machine Learning Theory, where often loss/cost functions are employed. Moreover, the observations above, can find application in the wide world of stocastic processes and probability inference. The CRPF ”Cost Reference Particle Filter” is clearly an example where the duality could play a inter- esting role. In inference field, to find tha maximum of likelihood function corresponds to find a minimum of a cost function. In many case, to find a theorical bound or to study a convergence for example, is easier to think in cost planning than with traditional probabilities. In other words, it could be easier to reach a theorical results thinking with cost approach than in probability approach.

11