A Framework for Probabilistic Machine Learning Amir Emad Marvasti, Ehsan Emad Marvasti, Ulas Bagci, Hassan Foroosh

JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2020 1 Maximum Probability Theorem: A Framework for Probabilistic Machine Learning Amir Emad Marvasti, Ehsan Emad Marvasti, Ulas Bagci, Hassan Foroosh Abstract—We present a theoretical framework of probabilistic Index Terms—Probabilistic Machine Learning, Regularization, learning derived from the Maximum Probability (MP) Theorem Prior Knowledge, Uncertainty, Artificial Intelligence, Objective shown in the current paper. In this probabilistic framework, Functions, Information Theory a model is defined as an event in the probability space, and a model or the associated event - either the true underlying model or the parameterized model - have a quantified probability I. INTRODUCTION measure. This quantification of a model’s probability measure is central problem in probabilistic learning and Bayesian derived by the MP Theorem, in which we have shown that an event’s probability measure has an upper-bound given its con- A statistics is choosing prior distributions of random vari- ditional distribution on an arbitrary random variable. Through ables and subsequently regularizing models. The importance this alternative framework, the notion of model parameters is of prior distributions is studied in Bayesian statistical inference encompassed in the definition of the model or the associated since the choice affects the process of learning. However, the event. Therefore, this framework deviates from the conventional choice of prior distributions is not clearly dictated by the approach of assuming a prior on the model parameters. Instead, the regularizing effects of assuming prior over parameters are axioms of probability theory. In current applications of the imposed through maximizing probabilities of models or according Bayesian framework, the choice of prior differs from case to to information theory, minimizing the information content of a case, and still in many practical scenarios the choice of prior model. The probability of a model in our framework is invariant is justified by experimental results or intuitions. Observe that to reparameterization and is solely dependent on the model’s there has been substantial attempts to unify the choice of priors likelihood function. Also, rather than maximizing the posterior in a conventional Bayesian setting, the objective function in over random variables, e.g. Laplace’s Principle of Indifference our alternative framework is defined as the probability of set [15], Conjugate Priors [9, 10], Principle of Maximum Entropy operations (e.g. intersection) on the event of the true underlying [14], Jeffreys priors [16] and Reference Priors [3]. The overall model and the event of the model at hand. Our theoretical frame- goal of the existing literature is to pinpoint a single distribution work adds clarity to probabilistic learning through solidifying the as a prior to unify and objectify the inference procedure. definition of probabilistic models, quantifying their probabilities, and providing a visual understanding of objective functions. Impact Statement—The choice of prior distribution over the A. The Proposed Maximum Probability Approach parameters of probabilistic machine learning models determines Contrary to many classic problems of inference and statis- the regularization of learning algorithms in the Bayesian per- tics where a hidden parameter of interest needs to be esti- spective. The complexity in choice of prior over the parameters mated, machine learning does not necessarily follow this goal. and the form of regularization is relative to the complexity of the models being used. Thereby, finding priors for parameters of The relevant solution to many complex problems in machine complex models is often not tractable. We address this problem learning is the final likelihood of observable variables. The by uncovering MP Theorem as a direct consequence of Kol- likelihood functions in many cases are not necessarily the mogorov’s probability theory. Through the lens of MP Theorem, familiar likelihood functions ( e.g. Bernoulli, Gaussian, etc.) the process of regularizing models is understood and automated. and may take complex forms. A good example of such com- arXiv:1910.09417v5 [cs.LG] 14 Jun 2021 The regularization process is defined as the maximization of the probability of the model. The probability of the model is plex likelihood functions is Neural Networks in the context understood by MP Theorem and is determined by the behavior of image classification [20]. If such networks are viewed of the model. The effects of maximizing the probability of the through the Bayesian perspective, the model is the conditional model can be backpropagated in a gradient-based optimization distribution of labels given input images. Finding an analytical process. Consequently, the MP framework provides a form of close form for the prior over the parameters of complex models black-box regularization and eliminates the need for case-by-case analysis of models to determine priors. is not currently practical and to the knowledge of the authors, an automatic and practical procedure for assuming prior over the parameters of arbitrary models is not known to date. Amir Emad Marvasti is with the Department of Computer Science, Univer- sity of Central Florida, Orlando, FL 32826 USA (e-mail: [email protected]). A different approach is considered in this paper, where Ehsan Emad Marvasti is with the Department of Computer Science, Univer- instead of assuming a prior over parameters, we focus on sity of Central Florida, Orlando, FL 32826 USA (e-mail: [email protected]). the likelihood functions. A preliminary approach is to assume Ulas Bagci is with the Machine and Hybrid Intelligence Lab, Depart- ment of Radiology, Feinberg School of Medicine, and Department of a density over possible likelihood functions, but we can BME, School of Engineering, Northwestern University, IL, USA (e-mail: simplify the perspective further by paying attention to the [email protected]). probability space. Different likelihood functions on random Hassan Foroosh is with the Department of Computer Science, University of Central Florida, Orlando, FL 32826 USA (e-mail: [email protected]). variable V can be formalized as PV jMθ , where Mθ is an event This paragraph will include the Associate Editor who handled your paper. in the underlying probability space. In this perspective, θ is 2 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2020 enumerating the events in the probability space and serves as objective functions are presented in Section III. The detailed a numerical representative, not necessarily a random variable. derivations for Section III are presented in Appendix B and Through the MP theorem proved in this paper, by having PV - Appendix C. prior over the observables - an upper bound on the probability of Mθ can be calculated . Thereby, instead of considering the B. Background likelihood function directly, the underlying event Mθ, with a quantifiable probability measure is considered. The background for this topic also known as Objective Viewing models as events can be extended to the true Bayesian is broad enough to restrict the authors to include underlying model. Consequently, the problem of learning the many important works. We refer the reader to the comprehen- true underlying model reduces to maximizing the probability sive review papers by Kass and Wasserman [18] and Consonni of similarity between the model and the true model. Since et al. [7]. Here, we briefly review some of the important works event are sets, the similarity between the model and the true that most impacted the objective Bayesian topic. underlying model is defined through set operations. As an ex- Laplace’s principle is one of the early works to define ample, the probability of the intersection of the parameterized priors leading to assuming a uniform prior over the possible model and the true underlying model could be maximized. The outcomes of a random variable. The downfall of this principle probability of the intersection event - as an objective function is seen in the case of real-valued random variables which leads - is maximized through tuning the parameters of the model. to an improper prior. Aside from the impropriety of the prior, In the case of intersection objective function, we show that the approach is not invariant to reparameterization. maximizing the probability of the model regularizes the model. Principle of Maximum Entropy (MAXENT), introduced From the probability theory perspective, MP Theorem and by Jaynes [12, 13, 15], provides a flexible framework in its consequences extends the ability of probability theory to which testable information can be incorporated in the prior assign probabilities to uncertain outcomes of random vari- distribution of a random variable. The prior is obtained by ables. It is because uncertainty in outcomes of a random solving a constrained optimization problem in which the prior variable can be modeled with a conditional distribution of the distribution with the highest entropy is chosen subject to the random variable given some underlying event. We consider the constraint of (i) testable prior information (usually represented probability of uncertain observations to be the probability of in form of expectations) (ii) the prior integrating to 1. The the underlying event. We show that considering probability shortcoming of MAXENT is in the case of continuous random upper bound as the probability of the

A Framework for Probabilistic Machine Learning Amir Emad Marvasti, Ehsan Emad Marvasti, Ulas Bagci, Hassan Foroosh

Information Theory, Pattern Recognition and Neural Networks Part III Physics Exams 2006

On Measures of Entropy and Information

Information Theory and Entropy

Lecture Notes (Chapter

Stan Modeling Language

Information Theory: MATH34600

Arxiv:1703.01694V2 [Cs.CL] 31 May 2017 Tinctive Forms to Diﬀerentiate Each Word from Others, Especially If a Speaker’S Intended Word Has Higher-Frequency Competitors

Computing Entropies with Nested Sampling

Lecture 6; Using Entropy for Evaluating and Comparing Probability Distributions Readings: Jurafsky and Martin, Section 6.7 Manning and Schutze, Section 2.2

Similarity-Based Approaches to Natural Language Processing

Probability for Linguists Probabilités Pour Les Linguistes

Probability for Linguists