JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2020 1 Maximum Theorem: A Framework for Probabilistic Machine Learning Amir Emad Marvasti, Ehsan Emad Marvasti, Ulas Bagci, Hassan Foroosh

Abstract—We present a theoretical framework of probabilistic Index Terms—Probabilistic Machine Learning, Regularization, learning derived from the Maximum Probability (MP) Theorem Prior Knowledge, Uncertainty, Artificial Intelligence, Objective shown in the current paper. In this probabilistic framework, Functions, a model is defined as an event in the probability space, and a model or the associated event - either the true underlying model or the parameterized model - have a quantified probability I.INTRODUCTION measure. This quantification of a model’s probability measure is central problem in probabilistic learning and Bayesian derived by the MP Theorem, in which we have shown that an event’s probability measure has an upper-bound given its con- A statistics is choosing prior distributions of random vari- ditional distribution on an arbitrary random variable. Through ables and subsequently regularizing models. The importance this alternative framework, the notion of model parameters is of prior distributions is studied in Bayesian statistical inference encompassed in the definition of the model or the associated since the choice affects the process of learning. However, the event. Therefore, this framework deviates from the conventional choice of prior distributions is not clearly dictated by the approach of assuming a prior on the model parameters. Instead, the regularizing effects of assuming prior over parameters are axioms of . In current applications of the imposed through maximizing of models or according Bayesian framework, the choice of prior differs from case to to information theory, minimizing the of a case, and still in many practical scenarios the choice of prior model. The probability of a model in our framework is invariant is justified by experimental results or intuitions. Observe that to reparameterization and is solely dependent on the model’s there has been substantial attempts to unify the choice of priors . Also, rather than maximizing the posterior in a conventional Bayesian setting, the objective function in over random variables, e.g. Laplace’s Principle of Indifference our alternative framework is defined as the probability of set [15], Conjugate Priors [9, 10], Principle of Maximum Entropy operations (e.g. intersection) on the event of the true underlying [14], Jeffreys priors [16] and Reference Priors [3]. The overall model and the event of the model at hand. Our theoretical frame- goal of the existing literature is to pinpoint a single distribution work adds clarity to probabilistic learning through solidifying the as a prior to unify and objectify the inference procedure. definition of probabilistic models, quantifying their probabilities, and providing a visual understanding of objective functions. Impact Statement—The choice of prior distribution over the A. The Proposed Maximum Probability Approach parameters of probabilistic machine learning models determines Contrary to many classic problems of inference and statis- the regularization of learning algorithms in the Bayesian per- tics where a hidden parameter of interest needs to be esti- spective. The complexity in choice of prior over the parameters mated, machine learning does not necessarily follow this goal. and the form of regularization is relative to the complexity of the models being used. Thereby, finding priors for parameters of The relevant solution to many complex problems in machine complex models is often not tractable. We address this problem learning is the final likelihood of observable variables. The by uncovering MP Theorem as a direct consequence of Kol- likelihood functions in many cases are not necessarily the mogorov’s probability theory. Through the lens of MP Theorem, familiar likelihood functions ( e.g. Bernoulli, Gaussian, etc.) the process of regularizing models is understood and automated. and may take complex forms. A good example of such com- arXiv:1910.09417v5 [cs.LG] 14 Jun 2021 The regularization process is defined as the maximization of the probability of the model. The probability of the model is plex likelihood functions is Neural Networks in the context understood by MP Theorem and is determined by the behavior of image classification [20]. If such networks are viewed of the model. The effects of maximizing the probability of the through the Bayesian perspective, the model is the conditional model can be backpropagated in a gradient-based optimization distribution of labels given input images. Finding an analytical process. Consequently, the MP framework provides a form of close form for the prior over the parameters of complex models black-box regularization and eliminates the need for case-by-case analysis of models to determine priors. is not currently practical and to the knowledge of the authors, an automatic and practical procedure for assuming prior over the parameters of arbitrary models is not known to date. Amir Emad Marvasti is with the Department of Computer Science, Univer- sity of Central Florida, Orlando, FL 32826 USA (e-mail: [email protected]). A different approach is considered in this paper, where Ehsan Emad Marvasti is with the Department of Computer Science, Univer- instead of assuming a prior over parameters, we focus on sity of Central Florida, Orlando, FL 32826 USA (e-mail: [email protected]). the likelihood functions. A preliminary approach is to assume Ulas Bagci is with the Machine and Hybrid Intelligence Lab, Depart- ment of Radiology, Feinberg School of Medicine, and Department of a density over possible likelihood functions, but we can BME, School of Engineering, Northwestern University, IL, USA (e-mail: simplify the perspective further by paying attention to the [email protected]). probability space. Different likelihood functions on random Hassan Foroosh is with the Department of Computer Science, University of Central Florida, Orlando, FL 32826 USA (e-mail: [email protected]). variable V can be formalized as PV |Mθ , where Mθ is an event This paragraph will include the Associate Editor who handled your paper. in the underlying probability space. In this perspective, θ is 2 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2020

enumerating the events in the probability space and serves as objective functions are presented in Section III. The detailed a numerical representative, not necessarily a random variable. derivations for Section III are presented in Appendix B and Through the MP theorem proved in this paper, by having PV - Appendix C. prior over the observables - an upper bound on the probability of Mθ can be calculated . Thereby, instead of considering the B. Background likelihood function directly, the underlying event Mθ, with a quantifiable probability measure is considered. The background for this topic also known as Objective Viewing models as events can be extended to the true Bayesian is broad enough to restrict the authors to include underlying model. Consequently, the problem of learning the many important works. We refer the reader to the comprehen- true underlying model reduces to maximizing the probability sive review papers by Kass and Wasserman [18] and Consonni of similarity between the model and the true model. Since et al. [7]. Here, we briefly review some of the important works event are sets, the similarity between the model and the true that most impacted the objective Bayesian topic. underlying model is defined through set operations. As an ex- Laplace’s principle is one of the early works to define ample, the probability of the intersection of the parameterized priors leading to assuming a uniform prior over the possible model and the true underlying model could be maximized. The outcomes of a random variable. The downfall of this principle probability of the intersection event - as an objective function is seen in the case of real-valued random variables which leads - is maximized through tuning the parameters of the model. to an improper prior. Aside from the impropriety of the prior, In the case of intersection objective function, we show that the approach is not invariant to reparameterization. maximizing the probability of the model regularizes the model. Principle of Maximum Entropy (MAXENT), introduced From the probability theory perspective, MP Theorem and by Jaynes [12, 13, 15], provides a flexible framework in its consequences extends the ability of probability theory to which testable information can be incorporated in the prior assign probabilities to uncertain outcomes of random vari- distribution of a random variable. The prior is obtained by ables. It is because uncertainty in outcomes of a random solving a constrained optimization problem in which the prior variable can be modeled with a conditional distribution of the distribution with the highest entropy is chosen subject to the random variable given some underlying event. We consider the constraint of (i) testable prior information (usually represented probability of uncertain observations to be the probability of in form of expectations) (ii) the prior integrating to 1. The the underlying event. We show that considering probability shortcoming of MAXENT is in the case of continuous random upper bound as the probability of the underlying event is variables, where the maximization of entropy is shown to be consistent with existing definitions. In this paper, we only the minimization of the Kullback-Leibler divergence between investigate finite-range observable random variables and leave the prior distribution and a base distribution which is obtained the extension of the continuous random variables for future by Limiting Density of Discrete Points[12, 22]. The choice works. of the base measure is a similar problem to that of choosing In probabilistic machine learning, MP Theorem has the the prior and as Kass and Wasserman [18] point out this is a following desirable properties. (i) the complexity of choosing circular nature in finding the maximal entropy distribution. prior is relative to the complexity of the observable random Furthermore, Seidenfeld in [24] puts forward an example variables. As an example, in the MP framework, having a where MAXENT is not consistent with Bayesian updating. Bernoulli observable random variable, one needs to assume a In short, choosing a prior with MAXENT and obtaining the prior over a set with 2 elements. Consequently, through MP posterior with Bayes rule given some observations is not theorem the probabilities of different Bernoulli models can be necessarily the same as choosing the prior with MAXENT calculated. In the conventional approach, it is required to de- given similar observations. termine a prior distribution over the real-valued parameters of Jeffreys prior is a class of priors that are invariant under a Bernoulli distribution. The conventional perspective ends up one to one transformations. Laplace’s principle for finite-range determining a prior on a disproportionately more complex set. random variables could be seen as a form of invariance under (ii) the models in our framework are not mutually exclusive the permutation group. The only distribution that is invariant and can have non-empty intersections. It is intuitive to think under permutation of the states of a finite-range random vari- that two Bernoulli models with parameters 0.9 and 0.8 should able is the uniform prior. The general form of Jeffreys rule for be related and do not represent disjoint events. Conversely, in the prior p(θ) over a one dimensional real parameter θ and a given likelihood function is p(θ) ∝ √ 1 , where I is the the conventional perspective, two models with the parameters det(I(θ)) 0.9 and 0.8 are treated as disjoint events.(iii) Given the prior Fisher Information. Jeffreys prior is constructed as a function over the observables, in the MP framework the probabilities of of the likelihood function and does not provide any guideline models are determined by the characteristics of the likelihood for choosing the likelihood function over the observable . This functions. As such, per case analysis of likelihood functions property is present in the context of Reference Priors intro- are not needed. duced by Bernardo [5] and further developed in [1,2,3,4]. We start by presenting MP theorem and its properties in Reference priors in the case of one dimensional parameter SectionII. The proofs for all the theorems can be found and some regularity conditions coincide with Jefferey’s general in Appendix A. The connection of MP theorem to existing rule [18]. definitions and its interpretations are discussed in Section II-A. Reference priors construct the priors by maximizing the Finally, the Maximum Probability Framework and examples of mutual information of the observable variable and the hidden MARVASTI et al.: MAXIMUM PROBABILITY THEOREM 3 variable (parameter). The solution density is not necessarily A random variable can be extended by increasing the car- a proper prior but the usage is justified by showing that dinality of its range. The following theorem shows that the posterior is the limit case of posteriors obtained from proper probability upper bound is decreased by extending the random priors[3]. The solution density similar to Jeffreys prior is variables. invariant under one to one transformations, which follows from Theorem 2: For any random variable W that extends V , invariance of the mutual information. The mutual information the following inequality holds between the observable and the parameter is a function of the P(σ W ) ≤ P(σ V ) (4) likelihood function. Thereby reference priors similar to the case of Jeffreys priors are only dependent on the likelihood The simplest example of extending a random variables is by function and do not determine a clear guide to construct including additional random variables to describe the under- likelihood functions. lying event. Corollary 1: For any random variable V and Z with con- C. Notation catenation H = (V,Z) the following holds

We use P,PV for probability measure, and the probability P(σ H) ≤ P(σ V ) (5) distribution of some random variable V , respectively. ,L L V Theorem2 shows that the Maximum Probability bound is corresponds to the of the probability measure relative to the complexity of the random variable. In simple and log-probability distribution of some random variable V , terms, the random variables are tools to observe the underlying respectively. We use R(V ) to represent the range of some event. By extending a random variable, its number of states is random variable V and |R(V )| is the cardinality of the range increased and the underlying event is more specified. Since the of V . For simplicity of notation and depending on the context, event is more specified, its probability decreases. The upper we use (v) as a shorthand notation of (V −1(v)). The lower P P bound in (1) follows a similar logic. Therefore the upper bound case letters represent the outcomes of the random variable with of probabilities of events is relative to the characteristics of the corresponding uppercase letter. the random variables. We delve into the meaning of the upper bound in the maximum probability theorem in the next section. II.MAXIMUM PROBABILITY THEOREM The following theorem is the foundation of our work, A. Interpretation of The Probability Upper Bound bounding probabilities of events using their conditional dis- The upper bound in MP theorem have a concrete connec- tribution. tion to existing definitions related to random variables. The Theorem 1: Consider the probability space (Ω, Σ, P), the upper bound nature in MP theorem is embedded in existing random variable V with finite range R(V ) and PV (.) the definitions of the probability of outcomes of random variables. probability distribution of V . For any event σ ∈ Σ with the To demonstrate the former statement, we review the following conditional distribution PV |σ(.) the following holds, existing definition. Definition 3: Probability of an outcome v of random vari-   PV (v) able V denoted by PV (v) is defined as P(σ) ≤ inf , P(σ V ) (1) v∈R(V ) PV |σ(v) −1 PV (v) = P(V (v)), v ∈ R(V ) (6) P(σ V ) is read as maximum probability of σ observed Definition3 defines the probability of an outcome of v ∈ by V and −L(σ V ) = − log (P(σ V )) is read as the R(V ) to be the probability of the largest event in the sense of minimum information in σ observed by V . number of elements V −1(v), that is mapped to v. It is possible The proof for Theorem1 is short and simple, yet it is to show an equivalent definition in the sense of probability. fundamental in understanding probabilistic models. In this Proposition 1: Probability of an outcome v of random view, every probability distribution over the random variable variable V in the sense of maximum probability defined as V corresponds to an event where the probability of the event ∗ is bounded using Theorem1. The bound for probability in PV (v) , sup {P(σ)|σ ∈ Σ, ∀ω ∈ σ, V (ω) = v} , v ∈ R(V ) Theorem1 can be decreased by extending random variables (7) (Definition2). has the following property Definition 1: The preimage of an outcome v of the random −1 ∗ variable V , denoted by V (v), is defined as PV (v) = PV (v) ∀v ∈ R(V ). (8) −1 V (v) , {ω|ω ∈ Ω,V (ω) = v} , ∀v ∈ R(V ) (2) Preimage of an outcome v coincides with the most probable event being mapped to v because of the monotonicity of the −1 Note that since V is a measurable function, then V (v) ∈ Σ. probability measure. While many events can be mapped to the Definition 2: A random variable W extends random vari- outcome v, both definitions consider the largest underlying able V iff event; either in the sense of cardinality or probability. We −1 −1 −1 −1 can interpret the upper bound of probability in spirit of W (w) ⊂ V (v) or W (w) ∩ V (v) = ∅ ∗ PV (v). Given some information about the underlying event, ∀w ∈ R(W ), ∀v ∈ R(V ). (3) we assume that the underlying event is the one with largest 4 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2020

(a) The Generic Setup (b) Likelihood Solution (c) Intersection Solution

∗ Fig. 1. The probability space in MP framework. The model M and the oracle M are events in the probability space (Ω, Σ, P). The dashed lines representing the random variable V with |R(V )| = 4, partitioning the probability space. Probability of the model Mθ is dependent on PV |M and PV and is bounded by P(M V ). The parameters of the model M is tuned to mimic the underlying model M ∗. 1(a) depicts the general setup with the red region representing ∗ ∗ M ∩ M . 1(b) shows a candidate model M achieving the global maximum P(M |M). The grey region in 1(b) represents M, which is normalized out in the likelihood objective function. 1(c) shows a candidate model M achieving the global maximum of intersection probability.

0.5 0.5 5 probability measure. For example, having only the conditional Bob observed is at most inf{ 0.9 , 0.1 } = 9 . probability distribution PV |σ, we consider the upper bound as the probability of σ. We can also use Information Theory to III.MAXIMUM PROBABILITY FRAMEWORK interpret the probability upper bound. In Information Theory [8, 11, 21], information content of an event σ ∈ Σ is quantified In the current trend of Bayesian statistics and machine learning, a parameterized family of distributions is assumed. as −L(σ). Information content of σ is the minimum bits required to distinguish between σ and its complement σ. As Subsequently, the underlying model is estimated by finding probability of σ decreases, its information content increases the parameter setting maximizing the posterior distribution, or and therefore requires further description. Considering lower creating an ensemble of models with parameters drawn from probability than the maximum bound is translated to more the posterior. Other alternatives include Variational Inference information content in an event; which is not based on [6, 17, 23], where an approximation for the posterior is chosen the given information. Considering lower probability may be from a parametric family. The posterior approximation is interpreted as appending assumptions to the description of the chosen by finding the distribution in the family with minimum observation. Kullback-Leibler Divergence to the true posterior. All of the mentioned approaches treat the parameters as a random Corollary 2: Considering Definition3 and Theorem1, variable and require a prior distribution over the parameter given some set σ ∈ Σ where P (v) = 1, v V |σv space.

PV (v) = P(σv V ) (9) We introduce the Maximum Probability framework (MP framework) for probabilistic machine learning as a corollary Corollary2 shows that P is equivalent to PV in the case of of the MP theorem. In MP Framework, we define a model as d exact observations of V . We can define uncertain observations an event, Mθ ∈ Σ, where θ ∈ R is the parameter, and the as outcomes that are not completely determined. Uncertain conditional distribution of the random variable V given the

observations may be represented by a probability distribution, model is PV |Mθ . Also, the true underlying model or oracle that is conditioned on some underlying event. Exact obser- is represented as M ∗ ∈ Σ, with the conditional distribution ∗ vations may be considered as degenerate conditional distribu- PV |M ∗ . Mθ and M are not explicitly defined. Instead, the tions; special cases of uncertain observations. Definition3 fails conditional distribution of the model is obtained through some to address the probability of uncertain observations of V , but explicit and deterministic parameterization function π, i.e.

P extends to uncertain observations. We use an example to PV |Mθ (v) = π(v, θ). Also, the oracle can be understood as represent a scenario with uncertain observations and how MP an observation from a generative process. For example we theorem extends to such scenarios. Imagine a fair coin is being can define the conditional distribution of oracle given some flipped in a room. Alice asks Bob to investigate the room and observation vo ∈ R(V ), as PV |M ∗ (vo) = 1. Note that the tell her the outcome of the coin flip. Consider two scenarios: random variable V could be modeled as the concatenation of (i) N (i) Bob tells Alice that the coin is surely Head. Alice using multiple i.i.d random variables (V )i=1. In the i.i.d modeling existing definitions concludes that the probability of the event case V would represent a random variable corresponding is 0.5. The conclusion is similar if Alice uses P to calculate the to multiple observations, that conditioned on the oracle are probability. Alice can model the observation as a degenerate independent. The choice of modeling for the random variables distribution conditioned on the event σ1, i.e. P(H|σ1)=1. The is arbitrary and we focus on V as a generic random variable. 0.5 0.5 maximum probability of σ1 is inf{ 1 , 0 } = 0.5. (ii) Bob Given a prior distribution PV and through maximum prob- observes some underlying event and using Bayes rule comes ability theorem, we can calculate the probability of models up with the conclusion that the coin is Head with probability under different parameter settings. As opposed to the con- 0.9. Bob informs Alice about his conclusion. Alice cannot ventional approach where a probability density is assumed use existing definitions to calculate the probability of the on the parameters - an uncountable infinite set - the prior is underlying event. However using the Maximum Probability chosen over the finite range of V . Furthermore, in the MP bound, she concludes that the probability of the evidence that framework each model has a probability mass and models MARVASTI et al.: MAXIMUM PROBABILITY THEOREM 5 potentially have intersections in the underlying probability We show in the next section that the intersection objective space. The visual representation of the underlying probability function includes the log-probability of the model in the space containing the model and the oracle is depicted in Fig objective function. Considering the model’s log-probability, as 1. The goal is to increase the similarity of the underlying will be shown, induces regularizing effects on the final solution events corresponding to the model and the oracle by tuning the by preventing the final distribution from being degenerate. parameters. Since the model and the oracle are sets, we can Thereby intersection objective function prevents the overfitting construct objective functions using set operations between the problem that occurs in the log-likelihood objective function. model and the oracle. We explore maximizing two objective functions to demonstrate properties of each, i.e. log-likelihood B. Intersection Objective Function and intersection corresponding to L(M ∗|M) and L(M ∗ ∩M). The probability of the outcome of such set operations reflects Similar to our analysis of the log-likelihood objective func- the similarity between the model and the oracle. tion we focus on the properties of the solutions of intersection objective function. The log-probability of the intersection of the model and the oracle can be written as A. Log-Likelihood Objective Function ∗ ∗ ∗ In this section, we investigate maximization of the log- L(M ,Mθ) = L(M |Mθ)+L(Mθ) ≤ L(M |Mθ)+L(Mθ V ). ∗ (13) likelihood of the oracle given the model, or L(M |Mθ). We ∗ The maximum probability of the model L(M ) in (13), can start by representing L(M |Mθ) as the marginalization of the θ random variable V be substituted with the of the so called softmax ! probability family, Lα(Mθ). The following proposition defines ∗ X ∗ the family and shows its important property. L(M |M) = log P(M |v, M)P(v|M) v∈R(V ) Proposition 2: For the softmax probability family of func- ! tions Pα defined as X 0 − 1 − log P(v |M) . (10)   α  −α v0∈R(V ) X PV (v) Pα(σ V ) ,   , α > 0, PV |σ(v) Taking the partial derivative of the log-likelihood with respect v∈R(V ) to L(v|Mθ) (14) ∗ ∗ ∂L(M |Mθ) P(M |v, Mθ)P(v|Mθ) the following is true, = ∗ − P(v|Mθ) ∂L(v|Mθ) P(M |Mθ) ∗ Pα(σ V ) ≤ P(σ V ), ∀α > 0, ∀σ ∈ Σ, (15) = P(v|M ,Mθ) − P(v|Mθ). (11) and the equality holds as α → +∞. The softmin information and setting the partial derivative to zero, we obtain the family is defined as condition ∗ −Lα(σ V ) , − log (Pα(σ V )) . (16) P(v|M ,Mθ) = P(v|Mθ). (12) ∗ Considering the chain rule for gradients, any parameter setting We define a flexible family of objective functions Lα(M ,M) θˆ, that satisfies the condition in (12), is necessarily a critical by substituting the softmin information family Lα instead of point of the objective function. The gradient of the log- L in (13) likelihood objective function in (11) is the subtraction of two ∗ ∗ Lα(M ,Mθ) , L(M |Mθ) + Lα(Mθ V ). (17) probability vectors. The stochastic nature of the gradient is suitable for stochastic gradient optimization of the parameters, Without loss of generality we assume a uniform prior over especially for variables with large number of states. V . Since for all v, PV (v) = 1/|R(V )| is a constant we can ∗ L (M ∗,M ) Note that in (10), P(M |v, M) is not determined by neither represent α θ as the conditional distribution of the model nor the oracle. We   assume that the oracle and the model given any outcome of the ∗ X ∗ Lα(M ,Mθ) = log (M , v|Mθ) ∗  P  variable V are conditionally independent, i.e. P(M |v, Mθ) = v∈R(V ) P(M ∗|v). The model and the oracle are defined implicitly   through their conditional distribution on V . Thereby it is 1 X 1 − log (v0|M )α + log(|R(V )|), (18) natural in the problem of learning to assume that given the out- α  P θ  α 0 come of the random variable, M does not convey information v ∈R(V ) about M ∗; hence the conditional independence. Nevertheless, Taking the partial derivative with respect to the log- other alternatives can be assumed and derived separately. probabilities we get The models maximizing the log-likelihood objective function ∗ α ∂Lα(M ,Mθ) ∗ P(v|Mθ) have a special characteristic; their probability distribution is = P(v|Mθ,M ) − P 0 α . ∂Lα(v|Mθ) v0∈R(V ) P(v |Mθ) concentrated on the most probable outcome(s) of P ∗ . The V |M (19) formal description and proof of this claim are brought in Appendix B. The solution of optimization being a degenerate We call the second term in the right hand side of (19) as distribution is the analogue of overfitting in our framework. α-skeleton of a distribution for future references. 6 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2020

Fig. 2. Comparison of the objective functions in the Bernoulli example under conditional independence assumption. θ∗ represents the oracle’s log probability ratio, θ represents the model’s log probability ratio. In α = 1 the objective functions become similar. The intersection has a unique solution for α > 1. As α → ∞, the solution approaches the trivial solution and tends to lose its uniqueness.

Definition 4: (α-skeleton) given the conditional distribu- controls the regularization effect. The usage examples of the tion P(v|σ), σ ∈ Σ, the α-skeleton distribution of V given MP Framework are presented in the next section. σ is defined as α P(v|σ) IV. EXAMPLES Pα(v|σ) , P 0 α , α ∈ R. (20) v0∈R(V ) P(v |σ) In this section we present application of MP Framework α-skeletons concentrate probabilities of distributions depend- on two examples: (i) A Bernoulli random variable (ii) Con- ing on the sign and magnitude of α. In the limit case as volutional Neural Networks (CNNs). In the simple case of α → +∞ the α-skeleton of a distribution concentrates Bernoulli random variable, we plot the objective functions the probability measure on the most probable outcome or to visualize their properties. Additionally we examine an outcomes. Setting the partial derivatives in (19) to zero, we alternative to the assumption that the model and the oracle are conclude the following sufficient condition for the critical conditionally independent. The alternative assumption is that points the oracle is a subset of the model. As a practical example we apply MP framework to CNNs. we show that log-likelihood (v|M ,M ∗) = (v|M ). (21) P θ Pα θ objective function and the conventional cross entropy loss are Equation (21) is the sufficient condition and is satisfied similar. We further extend our derivation to intersection loss when the α-skeleton of the model distribution matches the function and obtain a regularization term in the loss. The posterior distribution. The α-skeleton term in (21) acts as a training of a CNN with intersection objective function for regularizer suppressing the peaks in the model’s distribution. different values of α are presented and compared with the Similar to the log-likelihood case, both terms in (19) are cross entropy loss with `2 regularization. probability vectors and can be Monte Carlo approximated for stochastic optimizations. Note that in the case of uniform A. Bernoulli Random Variable priors and α = 1, the log-likelihood and the intersection objective functions are equal up to an additive constant. Setting Consider the Bernoulli random variable B with range α = 2 has a significant property that the solution coincides R(B) = {0, 1}. Assume that the oracle M ∗ has the log ∗ ∗ ∗ with oracle’s likelihood function. We can check the former probability ratio θ = L(1|M )−L(0|M ) and the model Mθ ∗ claim by substituting P(v|Mθ) with P(v|M ) in (21) (see is parameterized by θ = L(B = 1|Mθ) − L(B = 0|Mθ). Note Appendix C). As α increases, the final solution is flatter than that θ and θ∗ fully characterize the probability distributions us- 1 the solution of the likelihood objective function. In the case of ing the sigmoid parameterization function, P(1|Mθ) = 1+e−θ . α → +∞, the solution tends to the prior distribution PV . This We assume that B has a uniform distribution over the sample is consistent with our visual representation in Fig 1(c). Setting space, i.e. P(1) = 0.5. We can plot the behaviour of these ∗ Mθ = Ω is the trivial solution for maximizing the intersection objective functions with respect to the parameters θ and θ . in the underlying space. In general α as a hyperparameter, Note that the only value that is not explicitly known is MARVASTI et al.: MAXIMUM PROBABILITY THEOREM 7

Fig. 3. Comparison of intersection and log-likelihood objective functions assuming that M ∗ ⊂ M. with respect to the parameters of the model for different ∗ ∗ ∗ values of α. Note that θ = L(1|M ) − L(0|M ) and θ = L(1|Mθ) − L(0|Mθ).

∗ P(M ,Mθ|v) which by assuming conditional independence B. Application to CNNs can be calculated in the following manner. Here we demonstrate the usage of the MP framework in

∗ ∗ the context of CNNs and the task of image classification. (M ,M |v) = (M |v) (M |v) (22) (i) P θ P P θ In image classification, the set of labeled data {z , ∗ ∗ ∗ (i) (i) N (i) n P(M |v) = P(v|M )P(M )/P(v) (23) (x , y )}i=1 is given, where x ∈ R is the image and y(i) ∈ {1,..., |R(Y )|} is the corresponding label. To formu- The visualization is presented in Fig2. The maxima of late the problem in a probabilistic sense, consider the proba- the likelihood function is always achieved in the limit case bility space (Ω, Σ, P). We consider the random variable Y (i) while the solution for the intersection function is finite and with range R(Y ) corresponding to the i-th label. Similarly unique. The uniqueness of the solution for intersection can be X(i) is the random variable corresponding to the i-th image understood by following the plots as θ∗ changes. and Z(i) is the random variable obtained by concatenating Dependence Assumption: To demonstrate another exam- the image and the label. We denote the range of all the ple instead of conditional independence assumption, we can image random variables by R(X) and the labels as R(Y ). ∗ d assume that M ⊂ Mθ and further analyze the behaviour of The CNN model with parameters θ ∈ R is the function ∗ d |R(Y )| |R(Y )| the objective functions. Since M ⊂ Mθ we can derive the f : R(X) × R → ∆ , where ∆ is the |R(Y )|- log-likelihood and intersection objective functions using MP dimensional probability simplex. In the probabilistic sense, the (i) theorem and the softmax probability family. CNN is modeled as Mθ ∈ Σ, where P(y|Mθ, x ) is the y-th (i)   component of f(x , θ). Since the CNN as a classifier does not 1 X ∗ model the input distribution, we assume for all x ∈ R(X) that L (M ∗|M ) − log e−α(Lα(v|Mθ )−Lα(v|M )) α θ , α   P(x|Mθ) = P(x). Furthermore we assumed that P(x) is equal v∈R(V ) to the empirical distribution for mathematical convenience. We (24) define the oracle M ∗ to characterize the training data, namely ∗ ∗ Lα(M ,Mθ) = Lα(M |Mθ) + Lα(Mθ) (25) (1) (N) ∗ P(z , . . . , z |M ) = 1. (26) The behaviour of the objective functions for the Bernoulli 1) Log-Likelihood Objective Function: The log-likelihood example is presented in Fig3. It is noteworthy that in contrast objective function, assuming that the oracle and the CNN to the independence assumption the objective functions are model are independent conditioned on the observables, can well behaved for optimization purposes. In particular in Fig be written as 3 intersection and likelihood objective functions are concave ∗ L(M |Mθ) = and the gradients do not vanish in the limit cases. In the case   of intersection, the objective function tends to become flat in X ∗ the open set region between the value of the true underlying log  P(M |z1, . . . , zN )P(z1, . . . , zN |Mθ) . parameter and the parameter value of the prior distribution as z1,...,zN ∈R(Z) α → ∞. (27) 8 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2020

The CNN model determines the label of any given image prior

independent of the rest of the images. Therefore, we can N incorporate the former property as conditional independence Y P(z1, . . . , zN ) = P(zi), (36) of observables i=1 N (z ) = (x , y ) = (x ) (y ) (37) Y P i P i i P i P i P(z1, . . . , zN |Mθ) = P(zi|Mθ). (28) 1 (y ) = (38) i=1 P i R(Y ) Considering (28) and using the Bayes’ rule we can rewrite and the property in (28) we can write (35) as (27) as   N  −α ∗ 1 X Y P(zi) L(M |Mθ) = Lα(M) = − log α  (z |M )    i=1 P i θ ∗ ∗ N z1,...,zN ∈R(Z) X P(z1, . . . , zN |M )P(M ) Y (i) log  P(z |Mθ) P(z1, . . . , zN ) (39) z1,...,zN ∈R(Z) i=1   (29) N  −α 1 Y X P(zi) = − log   (40) ∗ α P(zi|Mθ) Since P(M |z1, . . . , zN ) is non zero only when z1 = i=1 zi∈R(Z) z(1), . . . , z = z(N), we can simplify the summation term   N N  −α and obtain 1 X X P(zi) = − log   (41) α P(zi|Mθ) ∗ i=1 zi∈R(Z) L(M |Mθ) =   N ! (z(1), . . . , z(N)|M ∗) (M ∗) N  −α P P Y (i) 1 X  X P(xi)P(yi)  log (z |Mθ) (30) = − log (z(1), . . . , z(N)) P   P i=1 α  P(yi|xi,Mθ)P(xi|Mθ)  i=1 xi∈R(X) N ! yi∈R(Y ) X (i) ∗ (1) (N) = log P(z |Mθ) + L(M ) − L(z , . . . , z ) (42) i=1 (31) Since P(x|Mθ) = P(x) and equal to the empirical distribution N ! we can further simplify the log probability of the model into X (i) (i) ∗ (1) (N) = log P(y , x |Mθ) + L(M ) − L(z , . . . , z ) N   1 X X i=1 L (M ) = − log (y |x(i),M )α (32) α θ α  P i θ  i=1 yi∈R(Y ) N ! N ! X (i) (i) X (i) +N log(|R(Y )|). (43) = log P(y |Mθ, x ) + log P(x |Mθ) i=1 i=1 We conclude that the intersection objective function for the ∗ (1) (N) + L(M ) − L(z , . . . , z ). (33) described CNN is The CNN only determines the first term in (33). Therefore the N ∗ X (i) (i) log-likelihood objective function effectively reduces to the so L(Mθ,M ) = log P(y |Mθ, x ) called Cross Entropy loss in CNNs (without regularization), i=1 N   i.e. the log probability of the correct label given the model 1 X X − log (y |x(i),M )α + constant. (44) and the image. α  P i θ  2) Intersection Objective Function: We can use the results i=1 yi∈R(Y ) obtained from the log-likelihood objective function to calculate Note that by setting α = 1, the intersection objective function the intersection objective function. The intersection objective will become equivalent to the log-likelihood objective function function is defined as by considering that the normalization term appearing when we set α = 1, is implicit in the softmax layer. We can incorporate L (M ∗,M) = (M ∗|M) + L (M ) α L α θ (34) the intersection objective function by alternating the softmax The log probability of the model can be calculated as layer and continue using the cross entropy (likelihood) loss. The softmax layer exponetiate and normalizes the input. We    −α can generalize its functionality by defining the HyperNormal- 1 X P(z1, . . . , zN ) m m Lα(M) = − log   . ization layer (HN), HN : R → ∆ , as α (z1, . . . , zN |Mθ) z ,...,z ∈R(Z) P 1 N ehi (35) HN(h; α)i = , (45) q m α P αhj j=1 e (i) Note that z and zi should not be confused. We represented (i) the observations by z , while zi refers to the values iterated where HN(h; α)i, is the i-th component of the output. Note in the summation. Assuming the following property about the that the functionality of the log(HN(h; α) and the first two MARVASTI et al.: MAXIMUM PROBABILITY THEOREM 9

Fig. 4. The results of training a vanilla CNN using the intersection objective function. The loss represent the cross entropy loss (log-likelihood) commonly used in CNNs.

Fig. 5. The results of training a vanilla CNN with the conventional cross entropy loss and varying coefficient for `2 regularization of parameters. terms of 44 are similar. Also, by setting α = 1, the Hyper- from the HyperNormalization layer. Using α = 1, the ob- Nomalization layer becomes equivalent to the softmax layer. jective function is similar to the cross entropy loss without regularization. The effects of training a vanilla CNN (without BatchNorm, Dropout) using the intersection objective function We experimented the effects of using the intersection with varying value of α is presented in Fig4. To compare objective function with varying coefficients on CNNs. The the results with the conventional cross entropy loss and `2 experiments were done on the CIFAR10 image dataset [19] regularization (also known as weight decay) please refer to Fig labeled into 10 classes with 50000 training samples and 10000 5. The `2 regularization loss could be obtained by following test samples. The cross entropy loss and accuracy for the the conventional Bayesian framework, namely considering performance of the CNN on the test set and the training set the parameters as a random variable. The `2 regularization were measured. Using the intersection objective function, the loss appears when considering the normal distribution as the regularization effect is back-propagated through the network 10 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2020

TABLE I QUANTITATIVE COMPARISON OF `2 WEIGHTREGULARIZATIONVERSUS MP FRAMEWORK.THE REGULARIZATION COLUMN SHOWS THE COEFFICIENT CORRESPONDINGTOEACHREGULARIZATIONTYPE.THE RESULTS OF THE BEST CASE EPOCH AND THE FINAL EPOCH FOR EACH MEASUREMENT IS PRESENTED IN THE TABLE.

Regularization Test Loss Train Loss Test Accuracy(%) Train Accuracy(%)

Best Last Best Last Best Last Best Last

α = 1 0.645 1.725 1e-6 1e-6 82.16 82.12 100 100 α = 8 0.695 0.696 0.21 0.21 78.66 78.55 100 100 α = 16 1.104 1.104 0.771 0.771 76.27 76.27 100 100 `2 = 0.25e-4 0.657 1.3259 7e-6 9e-6 82.58 82.43 100 100 `2 = 0.5e-4 0.693 1.211 2e-5 3e-5 82.29 82.43 100 100 `2 = 1e-4 0.673 1.173 3e-5 1e-4 82.58 82.18 100 100 `2 = 2e-4 0.673 1.651 8e-5 3e-4 82.58 79.63 100 100

prior distribution of the parameter random variables. Also, V. DISCUSSION the coefficient of the regularization loss is determined by the In the current paper, we have presented and proved the variance of the normal distribution assumed on the parameters. MP theorem. MP theorem quantifies the upper bound for The goal of the conventional Bayesian framework is to obtain probabilities of events by having their respective conditional the maximum a posteriori (MAP) estimate of the parameters. distributions. We showed that considering the upper bound as the probability of the event agrees with the existing definitions By comparing the test loss among different values of α, we and extends our ability to quantify the probability of uncertain can see that overfitting is prevented by increasing the value observations. The MP theorem was used to define models, of α, while affecting the test accuracy minimally. Also, CNNs quantify their probability measure, and develop objective func- tions; resulting in the MP framework for probabilistic learning. trained with α > 1, generalize better than the `2 regularized network, when considering the test loss value. In Fig5 the test MP framework treats the parameterized model and the oracle as events in the underlying probability space. Considering the loss of `2 regularized networks is increased as the training continues. We can make a similar observation in Fig4 for underlying space enables using set operations to represent sim- the un-regularized network with α = 1. In Fig5 we can ilarities between two events and construct objective functions. We used the example of using likelihood and intersection as see that increasing the the coefficient of the `2 regularization does not prevent the test loss increase. The network with the objective function and showed the sufficient conditions of their solutions. the largest `2 coefficient achieves the largest error in the MP framework requires a prior distribution over the ob- final epoch. In contrast with the `2 regularized networks, the test loss of the hypernormalized networks converges to servable random variable while the choice of prior is not determined by the framework. Thereby existing principles for a stable minima. On the other hand `2 regularization is more successful in generalization of accuracy but performs determining priors need to be used in the framework e.g. worse in generalization of loss. The quantitative results of the MAXENT and Laplace’s Principle. The usage of the MP experiments are shown in TableI. TableI demonstrates the theorem is helpful since the prior only needs to be determined accuracy and loss of the tested methods during the best case over the observable random variable. As a corollary, the epoch and the final epoch. TableI shows that by increasing complexity of developing the prior distribution will be relative the value of α, the gap between the train loss and test loss to the complexity of observable random variable. It is common decreases. Also, the the gap between the best and last epoch that we have prior information about the observables rather test loss decreases as α increases, which shows the stability than the hidden random variables. In such cases determining the prior over observable random variable are more convenient of the generalization process. In the case of `2 regularization, increasing the coefficient, does not affect the gap of training than the hidden random variables. Our framework is developed and test loss and mostly does not impact the test accuracy. for finite-range random variables, and the generalization to the continuous case is left for future works. MP framework allows development and analysis of ob- Note that the accuracy is a surrogate objective function and jective functions other than likelihood and intersection, e.g. although useful, does not fully characterize the performance. Symmetric Difference of events. The motivation to explore We hypothesize that the reason that test accuracy is not im- other objective functions is to avoid the trivial solutions that proved in our example (with intersection objective function), are inherently present in likelihood and intersection. Note that is because of the depth of the network. The regularization the objective functions discussed so far have an important effect may disappear similar to the effect of gradient vanishing. feature. The gradient vectors of the log-likelihood and log- Including more random variables in the middle layers and probability of the model are probability vectors. This property considering them in the objective function may improve the enables us to approximate the gradients by Monte Carlo test accuracy results. approximation; (i) connecting this framework with stochastic MARVASTI et al.: MAXIMUM PROBABILITY THEOREM 11 optimization theory and techniques (ii) enabling black-box We know that P(w) ≥ P(w|σ)P(σ). Using Theorem1, P(σ) optimization of probabilistic models by sampling from their can be replaced with P(σ W ) and write corresponding likelihood functions (iii) scalability to factor- ized random variables having an exponentially large number P(w) ≥ P(w|σ)P(σ W ). (54) of states. In the MP framework, the formal treatment of Using (52,54) we can see that probabilistic models if coupled with the ability to optimize X 0 X 0 large scale models helps probabilistic models to move toward P(v) = P(w ) ≥ P(w |σ)P(σ W ) an axiomatic and practical approach to large scale machine 0 0 w ∈Rv (W ) w ∈Rv (W ) learning. X 0 = P(σ W ) P(w |σ). 0 APPENDIX w ∈Rv (W ) (55) A. Proofs Substituting (53) we conclude that Theorem 1: Consider the probability space (Ω, Σ, P), the random variable V with finite range R(V ) and PV (.) the P(v) ≥ P(v|σ)P(σ W ), (56) probability distribution of V . For any event σ ∈ Σ with the P(v) conditional distribution P (.) the following holds, ≥ P(σ W ). (57) V |σ P(v|σ)   Since (57) is valid for all possible outcomes of V , the PV (v) P(σ) ≤ inf , P(σ V ) (46) inequality is valid for the minimum of the left-hand side of v∈R(V ) P (v) V |σ (57). Consequently P(σ V ) is read as maximum probability of σ observed  0  PV (v ) by V and −L(σ V ) = − log (P(σ V )) is read as the inf ≥ P(σ W ) (58) v0∈R(V ) P (v0) minimum information in σ observed by V . V |σ Proof: ∀σ ∈ Σ and ∀v ∈ R(V ) the following is true P(σ V ) ≥ P(σ W ) (59)

P(v) = P(v|σ)P(σ) + P(v|σ¯)P(¯σ) (47) Corollary 1: For any random variable V and Z with con- since P(v|σ¯)P(¯σ) ≥ 0 then catenation H = (V,Z) the following holds P(v) − P(v|σ)P(σ) ≥ 0 (48) P(σ H) ≤ P(σ V ) (60) P(v) (σ) ≤ . (49) Proof: the proof follows from Theorem2, if H is extend- P (v|σ) P ing V . By definition of (V,Z), for all outcomes h ∈ R(H), −1 −1 −1 Since (49) is true for all v such that P(v|σ) 6= 0, then the H (h) = V (vh) ∩ Z (zh) for some zh ∈ R(Z), vh ∈ following holds R(V ) such that h = (vh, zh). Therefore every outcome of H   is either a subset of some partition induced by V or does not P(v) P(σ) ≤ inf (50) have intersection with it. Since H extends V , we can directly v∈R(V ) (v|σ) P use Theorem2 to finalize the proof. Proposition 1: Probability of an outcome v of random Theorem 2: For any random variable W that extends V , variable V in the sense of maximum probability defined as the following inequality holds ∗ PV (v) , sup {P(σ)|σ ∈ Σ, ∀ω ∈ σ, V (ω) = v} , v ∈ R(V ) P(σ W ) ≤ P(σ V ) (51) (61) Proof: Since W extends V and ∀w ∈ R(W ) and has the following property ∀v ∈ R(V ), W −1(w) is either the subset of V −1(v) or P ∗ (v) = P (v) ∀v ∈ R(V ). (62) does not have intersection. Similarly P(w, v) is either 0 or V V 0 0 (w, v) = (w). Defining the set Rv(W ) {w |w ∈ P P , Proof: Consider the set Σv = {σ|σ ∈ Σ, ∀ω ∈ R(W ),W −1(w0) ⊂ V −1(v)}, we can write σV ,V (ω) = V } for all σv ∈ Σv we can say that σv ⊂ −1 −1 X 0 X 0 V (v)) using the definition of V . by the monotonicity (v) = (w , v) = (w , v) −1 P P P of probability we can conclude P(σv) ≤ P(V (v) for all σv. 0 0 w ∈R(W ) w ∈Rv (W ) Therefore, X 0 = (w ) (52) −1 P P(V (v)) = sup {P(σ)|σ ∈ Σ, ∀ω ∈ σ, V (ω) = v} (63) 0 w ∈Rv (W ) for all v ∈ R(V ). and similarly Corollary 2: Considering Definition3 and Theorem1, X 0 X 0 given some set σ ∈ Σ where P (v) = 1, P(v|σ) = P(w , v|σ) = P(w |σ). (53) v V |σv w0∈R(W ) w0∈R (W ) v PV (v) = P(σv V ) (64) 12 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2020

Proof: we can write the following is true,  0  PV (v ) Pα(σ V ) ≤ P(σ V ), ∀α > 0, ∀σ ∈ Σ, (74) P(σv V ) = inf (65) v0∈R(V ) P (v0) V |σv and the equality holds as α → +∞. The softmin information

Note that outcomes with zero probability in PV |σv do not pose family is defined as constraints on probability of σv (refer to (48)). Since PV is zero for every v0 6= v we get −Lα(σ V ) , − log (Pα(σ V )) . (75) P (v) Proof: P(σ V ) = inf V = P (v). (66) v 1 V By defintion of L we have L(σ ≺ V ) = inf {L(v) − L(v|σ)} . (76) v∈R(V ) Lemma 1: The following inequality holds ! By directly using the inequality proved in Lemma1 we can 1 X αai write sup {ai} ≤ log e (67) i α 1   i −α(L(v)−L(v|σ)) ! inf {L(v) − L(v|σ)} ≥ − log e (77) 1 v∈R(V ) α X −αai inf {ai} ≥ − log e , ai ∈ R, α > 0 (68) i α L(σ V ) ≥ Lα(σ V ) (78) i where the equality holds as α → ∞. and exponentiating the above inequality we get ∗ Proof: (Lemma) Let us denote sup {ai} = a P(σ V ) ≥ Pα(σ V ) (79) i ! ! 1 X 1 X ∗ ∗ log eαai = log eα(ai−a +a ) = α α i i B. Log Likelihood Derivation ! ∗ 1 X α(a −a∗) The sufficient condition for the global maximizer of a + log e i . (69) α (M ∗|M) is brought in the following proposition. i L ∗ Proposition 3: Consider the function L(M |Mθ) and the since there exist at least an element in the set having the same ∗ following maximization problem P α(ai−a ) value as supremum of the set, log i e = log(1 + ∗ ∗ C) ≥ 0 for some positive constant C and therefore positive. θ = arg max{L(M |Mθ)}. (80) d Since α > 0,(69) is greater than the supremum θ∈R ∗ ! If M and M are independent conditioned on V , The sufficient 1 X log eαai = condition for θ to be the global maximizer of (84) is α i (v0|M ∗) = 0, ! P θ 1 X ∗ 0 0 ∗ 00 ∗ α(ai−a ) ∀v , P(v |M ) 6= max P(v |M ) (81) sup {ai} + log e ≥ sup {ai} (70) v00∈R(V ) i α i i Proof: We write the log-likelihood as In the limit case since ! ! ∗ X ∗ 1 X ∗ (M |M) = log (M |v, M) (v|M) lim a∗ + log eα(ai−a ) = L P P α→∞ α v∈R(V ) i ! ! 1 X ∗ X 0 a∗ + lim log eα(ai−a ) (71) − log P(v |M) (82) α→∞ α 0 i v ∈R(V ) ∗ since (ai − a ) ≤ 0, then ! X (v|M ∗) (v|M) ! = log P P 1 ∗ ∗ X α(ai−a ) ∗ (v) a + lim log e = a (72) v∈R(V ) P α→∞ α i ! X − log (v0|M) + log( (M ∗)). We can use the fact that −sup {−ai} = inf {ai} to prove the P P (83) i i v0∈R(V ) lemma for the infimum case. Proposition 2: For the softmax probability family of func- The second and third term in the right hand side of (83) are tions Pα defined as constants. The first term is logarithm of a convex combination of (v|M ∗)/ (v). We can bound the first term − 1 P P  −α α  P (v)  ∗ ! ∗ ! X V X (v|M )  (v|M ) Pα(σ V ) ,   , α > 0, log P (v|M) ≤ log sup P . PV |σ(v) P v∈R(V ) (v) v∈R(V ) (v) v∈R(V ) P P (73) (84) MARVASTI et al.: MAXIMUM PROBABILITY THEOREM 13

We define R (V ) ⊂ R(V ) as R = the condition in (91) is satisfied. By plugging in θ∗ in (91) we ( max ) max n ∗ o get ∗ P(v|M ) v|v ∈ R(V ), P(v|M ) 6= sup (v) . We can v∈R(V ) P ∗ (v|Mθ∗ ) (v|M ) (v|Mθ∗ ) (v|Mθ∗ ) 0 P P P P check that the equality in (84) holds for some θ , if the P ∗ = P . 0 P(v|Mθ∗ )P(v|M ) 0 P(v|Mθ∗ )P(v|Mθ∗ ) condition v ∈R(V ) v ∈R(V ) (95)

(v|M 0 ) = 0 ⇐⇒ ∀v ∈ R(V ), v∈ / R (V ) (85) P θ max If we use (94), we can see that the condition is satisfied. is true. Therefore in the case of α = 2, the distribution of the oracle For θ0,(84) reduces to is a solution. ! X P(v|M ∗) log P(v|Mθ0 ) = ACKNOWLEDGMENT P(v) v∈Rmax(V ) EFERENCES  ∗  ! R X P(v|M ) log sup P(v|Mθ0 ) [1] J. O. Berger and J. M. Bernardo. On the development v∈R(V ) P(v) v∈Rmax(V ) of the reference prior method. Bayesian statistics, 4(4):  ∗ ! P(v|M ) 35–60, 1992. = log sup . (86) [2] J. O. Berger and J. M. Bernardo. Ordered group refer- v∈R(V ) P(v) ence priors with application to the multinomial problem. Biometrika, 79(1):25–37, 1992. 0 θ achieves the upper bound and therefore the condition in [3] J. O. Berger, J. M. Bernardo, D. Sun, et al. The formal (85) is a sufficient condition for the solution of (84). definition of reference priors. The Annals of Statistics, 37(2):905–938, 2009. C. Intersection Derivation [4] J. O. Berger, J. M. Bernardo, D. Sun, et al. Overall ob- jective priors. Bayesian Analysis, 10(1):189–221, 2015. We start by rewriting the conditional independence assump- [5] J. M. Bernardo. Reference posterior distributions for tion and the uniform prior bayesian inference. Journal of the Royal Statistical So- ∗ ∗ P(M,M |v) = P(M|v)P(M |v), (87) ciety: Series B (Methodological), 41(2):113–128, 1979. 1 [6] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Vari- (v) = . (88) P |R(V )| ational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, Remember that the sufficient condition for the solution of 2017. intersection objective function was [7] G. Consonni, D. Fouskakis, B. Liseo, I. Ntzoufras, ∗ et al. Prior distributions for objective bayesian analysis. P(v|Mθ,M ) = Pα(v|Mθ). (89) Bayesian Analysis, 13(2):627–679, 2018. Using the assumptions in (87,88), we can represent the left [8] T. M. Cover and J. A. Thomas. Elements of information hand side of (89), i.e. the posterior, as theory. John Wiley & Sons, 2012. ∗ ∗ [9] P. Diaconis, D. Ylvisaker, et al. Conjugate priors for P(Mθ,M |v)P(v) P(Mθ|v)P(M |v)P(v) ∗ = ∗ = exponential families. The Annals of statistics, 7(2):269– P(Mθ,M ) P(Mθ,M ) ∗ ∗ 281, 1979. P(v|Mθ)P(v|M )P(Mθ)P(M ) . (90) [10] A. Gelman et al. Prior distributions for variance pa- (M ,M ∗) (v) P θ P rameters in hierarchical models (comment on article by Having (88), we can see that (90) is element-wise multipli- browne and draper). Bayesian analysis, 1(3):515–534, cation of two probability vectors followed by normalization, 2006. therefore the condition in (89) reduces to [11] M. H. Hansen and B. Yu. Model selection and the (v|M ) (v|M ∗) principle of minimum description length. Journal of P θ P = (v|M), (91) C Pα the American Statistical Association, 96(454):746–774, 2001. X ∗ C = P(v|Mθ)P(v|M ). (92) [12] E. T. Jaynes. Information theory and statistical mechan- v0∈R(V ) ics. Physical review, 106(4):620, 1957. 1) Setting α = 2: In the case of α = 2,(91) is [13] E. T. Jaynes. Information theory and statistical mechan- ics. ii. Physical review, 108(2):171, 1957. (v|M ) (v|M ∗) (v|M ) (v|M ) P θ P P θ P θ [14] E. T. Jaynes. Prior probabilities. IEEE Trans. Systems P ∗ = P . v0∈R(V ) P(v|Mθ)P(v|M ) v0∈R(V ) P(v|Mθ)P(v|Mθ) Science and Cybernetics, 4(3):227–241, 1968. (93) [15] E. T. Jaynes. Probability theory: The logic of science. Cambridge university press, 2003. We can show that for all θ∗ ∈ Rd such that [16] H. Jeffreys. An invariant form for the prior probability in ∗ P(v|Mθ∗ ) = P(v|M ), (94) estimation problems. Proceedings of the Royal Society of 14 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2020

London. Series A. Mathematical and Physical Sciences, Hassan Foroosh (Senior Member, IEEE) is cur- 186(1007):453–461, 1946. rently a CAE Link Professor of computer science with the University of Central Florida (UCF), Or- [17] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. lando, FL, USA. He has authored or coauthored Saul. An introduction to variational methods for graphi- over 160 peer-reviewed scientific articles in the cal models. Machine learning, 37(2):183–233, 1999. areas of computer vision, image processing, and machine learning. He received the Piero Zamperoni [18] R. E. Kass and L. Wasserman. The selection of prior Award from the International Association of Pattern distributions by formal rules. Journal of the American Recognition (IAPR), in 2004, the Best Scientific Statistical Association, 91(435):1343–1370, 1996. Paper Award from IAPR-International Conference on Pattern Recognition (IAPR-ICPR), in 2008, and [19] A. Krizhevsky, G. Hinton, et al. Learning multiple layers the Best Paper Award from the IEEE International Conference on Image of features from tiny images. 2009. Processing (ICIP), in 2018. He is also the Principal Investigator and the Lead [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet of the Science Data Center of the NASA GOLD Mission that launched a satellite into Earth’s geo-stationary orbit to study the space weather using classification with deep convolutional neural networks. ultraviolet imaging, in 2018. He has been serving on the Editorial Boards In Advances in neural information processing systems, and Organizing Committees of various IEEE Transactions, conferences, and pages 1097–1105, 2012. working groups. [21] D. J. MacKay and D. J. Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003. [22] C. Marsh. Introduction to continuous entropy. 2013. [23] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, 2014. [24] T. Seidenfeld. Why i am not an objective bayesian; some reflections prompted by rosenkrantz. Theory and Decision, 11(4):413–440, 1979.

Amir Emad Marvasti joined the computer science department as a PhD student in 2014. Currently, he works at Computational Imaging Lab(CIL) under the supervision of Professor Hassan Foroosh. He received his B.S. in Computer Engineer from Sharif University of Technology, Tehran, Iran. His research interests are Probabilistic Machine Learning and Probability Theory. At CIL, he is focusing on the formalization of prior knowledge, regularization of probabilistic models, and development of finite-state probabilistic models.

Ehsan Emad Marvasti received the B.S. degree in computer engineering from the Sharif University of Technology, Tehran, Iran, 2014. He is currently working toward the Ph.D. degree in computer sci- ence at the Department of Computer Science, Uni- versity of Central Florida, Orlando, FL, USA. He has been a member of Image Processing Lab at Sharif University of Technology from 2012-2014 and Computational Imaging Lab (CIL) at University of Central Florida from 2014-2017. Currently he is a member of Connected and Autonomous Vehicle Research Lab (CAVREL). His research interests include vehicular cooperative perception and cognition, cooperative sensor fusion and machine learning.

Dr.Ulas Bagci is a faculty member at the Cen- ter for Research in Computer Vision (CRCV), and the SAIC Chair Professor in University of Central Florida (UCF). His research interests are artificial intelligence, machine learning and their applications in biomedical and clinical imaging. Prof. Bagci has more than 200 peer-reviewed articles in related topics. Previously, he was a staff scientist and lab co-manager at the National Institutes of Health’s radiology and imaging sciences department, center for infectious disease imaging. Dr. Bagci holds two NIH R01 grants (Principal Investigator) and serve as an external committee member of AIR (artificial intelligence resource) at the NIH. Dr. Bagci serves as an area chair for MICCAI for several years and he is an associate editor of top-tier journals in his fields such as IEEE Trans. on Medical Imaging and Medical Physics, and editorial board member of Medical Image Analysis.