Duality Between Probability and Optimization

UNIVERSIDAD CARLOS III DE MADRID ESCUELA POLITECNICA´ SUPERIOR MASTER´ EN MULTIMEDIA Y COMUNICACIONES Duality between Probability and Optimization Luca Martino GestiónDe Informatión Multimedia March 2008 Contents 1 Introduction 2 2 Algebraic Structure 3 3 Tropical Semiring (Min-plus Algebra) 5 4 Probability Space 6 5 Cost Measure and Decision Variable 7 6 Conclusion 11 1 1 Introduction In many applications, relationated with Statistic Inference or Machine Learn- ing, is usually connected some Loss Functions (also called Cost Functions) with some Probability Density Functions. In general, the choice of some specific loss function suppose a specific distribution of the data. In another words, the minimization of a loss function gives the best solution for the concerned problem, if distribution of data is exactly the probability density associated with the used loss function. A classical example, is the Least Squares Solution; this method minimizes the mean square error ,i.e., the follow cost function: X 2 C = ξi (1) i This is the best solution if it is possible to suppose that the noise pertubation has Gaussian Distribution: ξ ∼ k · exp (−α · ξ2) (2) In next sections, we will find the follow correspondences: P → K (3) where P and K identify respectivelly the Probability and Cost measure; the same with density functions: p(x) → c(x) (4) In general, is logic to expect these relations between the operations: × → + (5) + → min (6) In other words, we pass from R to Rmin (tropical semiring). 2 2 Algebraic Structure A algebraic structure consists in one (o more) set closed under one or more operations satisfying some axioms. The keystone to study algebraic structures are the binary operations. A binary operation (also called dyadic operation) is a calculation involving two operands, and often written using infix notation such as a ∗ b, a + b, a · b rather than functional notation of form f(a, b). The basic kind of algebraic structure is a magma (or groupoid). Specifi- cally a magma consists in a set M equipped with a single binary operation M × M → M. The only hypotesis imposed on this operation is to be closed but no more axioms are supposed. A semigroup is a magma where the operation is associative (is different to quasigroup that is noempty magma where division is also possible), and a monoid is a semigroup with identity element (for example, the identity element for addition is 0 and for multiplication is 1). A group is a monoid with inverse element. We just have described the right branch of fig.1. In figure 2, there are another definitions, for example abelian group is a group where the operation is commutive. For the follow sections we need the notions of ring and semiring. Ring: is an algebraic structure formed by a set M and two operations that Figure 1: From Magma to Group (M=magma, Q=quasigroup, L=loop, S=semigroup, N=monoid; d=divisibility, a=associativity, e=identity, i=inversibility) . 3 Figure 2: Algebraic Structures. are relationated each other by distributive property. Semiring: is similar to ring but without the requirement that each element must have an additive inverse. In other words, a semiring (M, +, ·) is a set M with two binary operations + and ·, called addition and multiplication, such that: 1. (M, +) is a commutative monoid with identity element 0: • (a + b) + c = a + (b + c) • 0 + a = a + 0 = a 4 • b + a = a + b 2. (M, ·) is a monoid with identity element 1: • (a · b) · c = a · (b · c) • 1 · a = a · 1 = a 3. Multiplication distributes over addition: • a · (a + c) = (a · b) + (a · c) • (a + b) · c = (a · c) + (b · c) 4. 0 annihilates M: • 0 · a = a · 0 = 0 3 Tropical Semiring (Min-plus Algebra) The adjetive tropical is given in honor of Brazilian mathematian Imre Si- mon. It is usually inidicated Rmin = (<, ⊕, ). The underlying set is < of real numbers, sometime augmented by +∞ (< ∪ {+∞}). The arithmetic operations of tropical addition ⊕ and tropical multiplication are: x ⊕ y := min{x, y} (7) x y := x + y (8) The Tropical Semiring is idempotent in the sense that: a ⊕ a ⊕ a ⊕ . a ⊕ a = a (9) The n-dimensional real vector space <n is a module (is a generalization of the notion of vector space) over the tropical semiring (<, ⊕, ), with the operations of tropical addition: (a1, a2, . , an) ⊕ (b1, b2, . , bn) = (min{a1, b1}, ..., min{an, bn}) (10) and the tropical scalar multiplication: λ (a1, a2, . , an) = (λ + a1, λ + a2, . , λ + an) (11) 5 Is easy to define a linear tropical space, for example: Definition: a tropical linear space L in <n consists of alla tropical com- binations λ a ⊕ µ b ⊕ . α c of a fixed finite subset {a, b, . , c} ⊂ <n. To explain the duality between cost and probability measure, we need to use Rmin semiring. 4 Probability Space A probability space consists in (Ω, =,P ), where: • Ω is a set of all possible results. • = is a σ-algebra of parts of Ω. • P is a probability measure on =. Definition: A family = of parts of a set Ω is called σ-algebra (or tribú)if: 1. 0, Ω ∈ =. 2. if A ∈ = also AC ∈ =. 3. if A1,A2,...,An ∈ = ⇒ ∪iAi ∈ =. 4. if A1,A2,...,An ∈ = ⇒ ∩iAi ∈ =. Definition: Given Ω a set,= a σ-algebra of parts of Ω. A probability P is an application P : = → R+ such that: • P (Ω) = 1. • P (∅) = 0. P • Given {An}n with Ai ∩ Aj = 0 ∀i, j, then P (∪nAn) = n P (An). Therefore, we will call Probability Space the triplet of (Ω, =,P ). 6 5 Cost Measure and Decision Variable It is possible to build a cost theory, dual of probabylity theory, replacing the classical structure of real numbers (R, +, ×) by the tropical semiring Rmin , i.e., to augment R∪{+∞} and to use the binary operations ⊕ = min, = +. To the probability of an event corresponds the cost of set of decisions. To the random variable corresponds the decision variable. Decision Space is defined like a the triplet (U, U,K) where U is a topological space (i.e., U is a set A together with a colletion of subsets of A), U is the set of pen sets of U and K is an application U → Rmin, such that: • K(U) = 0 • K(0) = +∞ • K(∪nAn) = infn K(An) ∀An ∈ U The mapping K is called cost measure. The cost density is a function c : U → Rmin. For example: +∞ for x 6= m χ (x) = (12) m 0 for x = m It is the equivalent of delta probability density. Another example: p 1 −1 p M = σ (x − m) for p > 1 (13) m,σ p That for p = 2 we find the cost density associated to gaussian distrubution 2 p with paramters (m, σ ), and Mm,0 = χm(x). Moreover, For two indipendent variables X and Y : cX,Y (x, y) = cX (x) + cY (y) (14) Is also possible to define the conditional cost excess of X knowing Y : cX,Y (x|y) = cX,Y (x|y) − cY (y) (15) Similary to the mode, we can define the optimum of decision variable: O(x) = arg min cX (x) (16) x 7 If the decision variable X take values in Rn, near the optimum, we can write: 1 −1 p p cX (x) = σ (x − O(X)) + o(k(x − O(X))k ) (17) p and in this case, X is called of order p with sensivity of order p that clearly plays the rule of typical deviation: Sp(X) = σ (18) While the Value of X, corresponding to expectation value in probability theory, is: V (X) = inf(x + cX (x)) (19) x Since a sequence of N i.i.c. decision variables (i.e.,indipendent with same density cost function),the joint cost density is definided as: X cX (x1, x2, . , xN ) = cXi (xi) (20) Moreover, for a sequence the decisions variables (Xn, n ∈ N) we can define three different type of convergence, analogous to probability theory: p P • Xn ∈ L converges in p-norm to X (Xn → X) if: lim kXn − Xk = 0 (21) n→+∞ K • Xn converges in cost to X (Xn → X) if: lim K{|Xn + X| > η} = +∞ (22) n→+∞ a.s. • Xn converges almost surely to X (Xn → X) if: K{ lim Xn = X} = 0 (23) n→+∞ The convergence in p-norm implies the converge in cost but the converse is false, and the convergence in cost implies almost surely convergence and the converse is not true. In other words, the convergence p-norm is the strong condition. In probability theory, the almost surely convergence plays this 8 rule. Now, it is easy to understand the analogue of the law of large number: p Given a sequence (Xn, n ∈ N) of i.i.c. decision variables Xn ∈ L p ≥ 1, the aritmetic mean of Xn converge in p-norm to O(Xi) (all the variables have the ame optimum): +∞ X lim Xn = O(X0) (24) n→+∞ n=0 The role of Laplace/Fourier transform in probability calculus is played by Fenchel Transform: cˆ(ϑ) = F[cX (x)](ϑ) = sup[< ϑ, x > −cX (x)] (25) x or is equivaletly defined by: cˆ(ϑ) = F[cX (x)](ϑ) = − inf[cX (x)− < ϑ, x >] (26) x For example, for a linear combination: f(x) =< a, x > −b (27) the Fenchel Transform is: b, ϑ = a F[f(x)](ϑ) = (28) ∞, ϑ 6= a We can see easy from the definition: F[f(x)](ϑ) = sup[< ϑ, x > − < a, x > +b] (29) x Or, another examples: g(x) = |x| (30) h(x) = exp(x) (31) the Fenchel Transforms will be: 0, |ϑ| ≤ 1 F[g(x)](ϑ) = (32) ∞, |ϑ| > 1 9 ϑ ln(ϑ) − ϑ, ϑ > 1 F[h(x)](ϑ) = 0, ϑ = 0 (33) ∞, ϑ < 0 The Fenchel Transform of cX (x) is the Characteric Function of the decision variable X.

Load more