Revisiting the Optimal Probability Estimator from Small Samples For
Total Page:16
File Type:pdf, Size:1020Kb
Int. J. Appl. Math. Comput. Sci., 2019, Vol. 29, No. 4, 783–796 DOI: 10.2478/amcs-2019-0058 REVISITING THE OPTIMAL PROBABILITY ESTIMATOR FROM SMALL SAMPLESFORDATAMINING a,b BOJAN CESTNIK aDepartment of Knowledge Technologies Joˇzef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia bTemida d.o.o., Dunajska cesta 51, 1000 Ljubljana, Slovenia e-mail: [email protected] Estimation of probabilities from empirical data samples has drawn close attention in the scientific community and has been identified as a crucial phase in many machine learning and knowledge discovery research projects and applications. In addition to trivial and straightforward estimation with relative frequency, more elaborated probability estimation methods from small samples were proposed and applied in practice (e.g., Laplace’s rule, the m-estimate). Piegat and Landowski √ (2012) proposed a novel probability estimation method from small samples Eph 2 that is optimal according to the mean absolute error of the estimation result. In this paper we show that, even though the√ articulation of Piegat’s formula seems different, it is in fact a special case of the m-estimate, where pa =1/2 and m = 2. In the context of an experimental framework, we present an in-depth analysis of several probability estimation methods with respect to their mean absolute errors and demonstrate their potential advantages and disadvantages. We extend the analysis from single instance samples to samples with a moderate number of instances. We define small samples for the purpose of estimating probabilities as samples containing either less than four successes or less than four failures and justify the definition by analysing probability estimation errors on various sample sizes. Keywords: probability estimation, small samples, minimal error, m-estimate. 1. Introduction particular event (Rudas, 2008). It can be estimated from a data sample as a single value by the maximum When dealing with uncertainty in various situations, likelihood estimate. On the other hand, a community of probabilities often come into play (Starbird, 2006). Bayesians share the view that probabilities themselves can Probabilities are used to assign meaningful numerical be random variables distributed according to probability values (from the interval between 0 and 1) to express density functions and can, therefore, be used to express the likelihoods that certain random events will occur. also more or less uncertain degrees of (subjective) beliefs Probabilities are the pillars of many scientific and business that some events will occur (Berger, 1985). disciplines, including statistics (DeGroot and Schervish, 2012), game theory (Webb, 2007), quantum physics In essence, the problem of probability estimation (Gudder, 1988), strategic and financial decision-making from data samples can be formulated as follows (Good, (Grover, 2012), as well as knowledge discovery and data 1965). Suppose that we have conducted (observed) n mining (DasGupta, 2011; Flach, 2012). and recorded a sample of independent experiments (instances), out of which there are s successes and f Conceptually, there seems to be a controversial failures (s + f = n). How do we estimate the probability dispute in the scientific community about the existence p of success in the domain under observation? Or, and interpretation of various kinds of probabilities analogously, based on the collected experiments, how do (Good, 1966). In particular, there is a substantial we estimate the probability p that the next experiment will conceptual difference between the so-called frequentists be a success? and Bayesians. For frequentists, on the one hand, probability represents a frequency of occurrence of a In this paper we deal with probability estimation © 2019 B. Cestnik This is an open access article distributed under the Creative Commons Attribution-NonCommercial-NoDerivs license (http://creativecommons.org/licenses/by-nc-nd/3.0/). 784 B. Cestnik methods from small samples, which is a difficult and 2. Historical background and related work intricate task (Good, 1965; 1966; Chandra and Gupta, According to the classical definition, the probability 2011; Chan and Kroese, 2011). The subject is not of obtaining a success in the next experiment can be to be confused with the problem of detecting small estimated with relative frequency (Feller, 1968) samples and determining the optimal sample size for statistical analyses (Good and Hardin, 2012), which s s pˆ = = . (1) is beyond the scope of this paper. To define the relfr s + f n concept of “small sample” for the purpose of estimating probabilities, we start with the definition of its opposite: Probability estimation with relative frequency fits “sufficiently large sample.” Probability theory states that nicely into the frequentist paradigm and is perfect in n if sample size n is large enough, differences between the limit case, when the sample size approaches diverse probability estimation methods are negligible infinity (or, is acceptable, if the sample is at least large (Good, 1965). Therefore, we are more interested in enough). However, when the sample size is small, there determining which sample size reduces the differences are three major shortcomings concerning such probability between various probability estimation methods than in estimations. First, the estimates may have extreme values determining which sample size reduces the estimation of 0 and 1 which might be intricate in further numerical errors close to zero. For the study reported in this paper calculations. Second, the estimation errors might be we presume, as asserted by Good (1965) in his book, the considerably high. And third, the estimates with relative following. frequency do not stabilize fast enough with increasing the sample size (Piegat and Landowski, 2013). To curtail the apparent shortcomings of relative Definition 1. (Sufficiently large sample) For the purpose frequency, Laplace (1814) proposed the formula of estimating probabilities, a sample is large enough if both s and f are greater than 3. s +1 s +1 pˆ = = , Laplace s + f +2 n +2 (2) As explained and demonstrated in Section 6, on such large samples we can, for all practical purposes, which is also called Laplace’s rule of succession (e.g., more or less safely use any mathematically sound method Feller, 1968; Good, 1965). Laplace’s formula (2) fits for probability estimation, since the differences between into the Bayesian probability estimation paradigm since it average mean absolute errors of different estimation assumes a uniform prior probability distribution by adding methods tend to drop below the threshold of 0.01. one fictitious success and one fictitious failure to the data In addition to Definition 1, we define the concept of sample used for the estimation. small sample. Note that (2) solves the first problem of relative frequency (1), since pˆLaplace can never be equal to 0 or 1. In Definition 2. (Small sample) For the purpose of addition, Laplace’s rule of succession typically alleviates estimating probabilities, a sample is small if either s or the second and third problems of relative frequency by f is less than 4. reducing the average errors of the estimations (Feller, 1968; Good, 1965). Note that Definition 2 does not imply that samples Inspired by the work of Good (1965), Cestnik (1990) containing more than 8 instances are not small samples. proposed a more general Bayesian method for probability In fact, the expected size of a sufficiently large sample estimation called the m-estimate: depends on the actual probability of success in the s + p m s + p m pˆ (p ,m)= a = a . sample. The results of our analysis concerning the Cestnik a s + f + m n + m (3) shortest sufficiently large sample size show that if we are sampling from a population where the actual probability The m-estimate has two parameters: pa and m; pa is of success is 0.5, the average expected shortest sufficiently a prior probability and m is a weight assigned to the prior large sample size is 10.2 with a median of 10 and a probability. The m-estimate was first used in the context standard deviation of 2.3. If the actual probability of of machine learning and empowered several learning success is 0.2 (or its antonym 0.8), the average expected algorithms to significantly improve the classification shortest sufficiently sample size is 20.1 with a median accuracy of generated models (Cestnik, 1990; Cestnik and of 19 and a standard deviation of 8.8. More detailed Bratko, 1991; Dˇzeroski et al., 1993). results concerning the analysis of small sample sizes are A typical data set used for classification tasks in presented in Section 6. Note, however that, according to machine learning (Breiman et al., 1984; Flach, 2012) Definitions 1 and 2, every sample containing less than or consists of several learning instances, described with equal to 7 instances is a small sample, regardless of the attributes and their values. Each learning instance belongs actual probability of success in the sample. to a designated class. The task of machine learning Revisiting the optimal probability estimator from small samples for data mining 785 is to construct a model for predicting the value of the As an error measure of a probability estimation class given the values of the attributes. The m-estimate method we use the mean absolute error (abbreviated was intentionally designed for estimating conditional as MAE in the paper) for easier comparison with the probabilities of a class given the values of selected findings reported in Piegat and Landowski (2012). Also, attributes (Cestnik, 1990). The unconditional probability the preliminary experiments with another measure of of the class, estimated from the whole sample, is taken error, root mean squared error (RMSE), revealed that as a prior pˆa. Since the whole data set is usually the general observations and conclusions remain the same large enough for reliable probability estimation with regardless of the error measure used. However, in their any probability estimation method, such unconditional subsequent papers, Piegat and Landowski (2013; 2014) estimates are typically considered sufficiently accurate.