entropy

Article Assessing Probabilistic Inference by Comparing the Generalized of the Model and Source Probabilities

Kenric P. Nelson 1,2

1 Electrical and Computer Engineering Department, Boston University, Boston, MA 02215, USA; [email protected]; Tel.: +1-781-645-8564 2 Raytheon Company, Woburn, MA 01808, USA

Academic Editor: Raúl Alcaraz Martínez Received: 31 May 2017; Accepted: 12 June 2017; Published: 19 June 2017

Abstract: An approach to the assessment of probabilistic inference is described which quantifies the performance on the probability scale. From both information and Bayesian theory, the of an inference is proven to be the of the probabilities reported for the actual outcome and is referred to as the “Accuracy”. Upper and lower error bars on the accuracy are provided by the and the −2/3 mean. The arithmetic is called the “Decisiveness” due to its similarity with the cost of a decision and the −2/3 mean is called the “Robustness”, due to its sensitivity to outlier errors. Visualization of inference performance is facilitated by plotting the reported model probabilities versus the histogram calculated source probabilities. The visualization of the calibration between model and source is summarized on both axes by the arithmetic, geometric, and −2/3 . From information theory, the performance of the inference is related to the cross-entropy between the model and source distribution. Just as cross-entropy is the sum of the entropy and the divergence; the accuracy of a model can be decomposed into a component due to the source uncertainty and the divergence between the source and model. Translated to the probability domain these quantities are plotted as the model probability versus the average source probability. The divergence probability is the average model probability divided by the average source probability. When an inference is over/under-confident, the arithmetic mean of the model increases/decreases, while the −2/3 mean decreases/increases, respectively.

Keywords: probability; inference; information theory; Bayesian; generalized mean

1. Introduction The challenges of assessing a probabilistic inference have led to development of scoring rules [1–6], which project forecasts onto a scale that permits arithmetic of the score to represent average performance. While any concave function (for positive scores) is permissible as a score, its mean bias must be removed in order to satisfy the requirements of a proper score [4,6]. Two widely used, but very different, proper scores are the logarithmic average and the mean-square average. The logarithmic average, based upon the Shannon information metric, is also a local scoring rule, meaning that only the forecast of the actual event needs to be assessed. As such no compensation for mean bias is required for the logarithmic scoring rule. In contrast, the mean-square average is based on a Euclidean distance metric, and its local measure is not proper. The mean-square average is made proper by including the distance between non-events (i.e., p = 0) and the forecast of the non-event. In this paper, a misconception about the average probability is highlighted to demonstrate a simpler, more intuitive, and theoretically stronger approach to the assessment of probabilistic forecasts. Intuitively one is seeking the average performance of the probabilistic inference, which may be a

Entropy 2017, 19, 286; doi:10.3390/e19060286 www.mdpi.com/journal/entropy Entropy 2017, 19, 286 2 of 14 assumed to be the arithmetic mean, which is incorrect for probabilities. Probabilities, as normalized ratios,Entropy 2017must, 19 be, 286 averaged using the geometric mean, which is reviewed by Fleming and Wallace2 of [7] 14 and was discussed as early as 1879 by McAlister [8]. In fact, the geometric mean can be expressed as thehuman log-average: forecast, a machine algorithm, or a combination. Unfortunately “average” has traditionally been assumed to be the arithmetic mean, which is incorrect for probabilities. Probabilities, as normalized ratios, must be averaged using the geometric mean, which is reviewed, by Fleming and Wallace(1) [7 ] and was discussed as early as 1879 by McAlister [8]. In fact, the geometric mean can be expressed as the log-average: where p p :i 1,..,N is the set of probabilities and! w w :i 1,...,N is the set of weights   i   N N i   ≡ = wi formed by such factors as the frequencyPavg exp and∑ utilitywi ln pofi event∏ i.p iThe, connection with entropy and (1) i=1 i=1 information theory [9–12] is established by considering a distribution in which the weights are also where p = {p : i = 1, . . . , N} is the set of probabilities and w = {w : i = 1, . . . , N} is the set of the probabilitiesi w  p and the distribution of probabilities sumsi to one. In this case, the weights formed by suchi i factors as the frequency and utility of event i. The connection with entropy probability average is the transformation of the entropy function. For assessment of a probabilistic and information theory [9–12] is established by considering a distribution in which the weights are also inference, the set of probabilities is specified as p  p :i  1,..,N; j 1,...,M with i representing the the probabilities wi = pi and the distribution of probabilities ij sums to one. In this case, the probability samples,average isj therepresenting transformation the classes of the entropyand j * function.specifying For the assessment labeled true of aevent probabilistic which occurred. inference,  the set of probabilities is specified as p = pij : i = 1, . . . , N; j = 1, . . . , M with i representing the Assuming N independent trials of equal weight w  1 , the average probability (or forecast) of a samples, j representing the classes and j∗ specifying thei labeledN true event which occurred. Assuming probabilistic inference is then: 1 N independent trials of equal weight wi = N , the average probability (or forecast) of a probabilistic inference is then: N 1 N N P  p 1 . (2) avg≡  i ,true/N Pavg ∏i1 pi,true. (2) i=1 A proof using Bayesian reasoning that the geometric mean represents the average probability A proof using Bayesian reasoning that the geometric mean represents the average probability will will be completed in Section 2. In Section 3 the analysis will be extended to the generalized mean to be completed in Section2. In Section3 the analysis will be extended to the generalized mean to provide provide assessment of the fluctuations in a probabilistic forecast. Section 4 develops a method of assessment of the fluctuations in a probabilistic forecast. Section4 develops a method of analyzing analyzing inference algorithms by plotting the average model probability versus the average inference algorithms by plotting the average model probability versus the average probability of the probability of the data source. This provides quantitative visualization of the relationship between data source. This provides quantitative visualization of the relationship between the performance of the performance of the inference algorithm and the underlying uncertainty of the data source used the inference algorithm and the underlying uncertainty of the data source used for testing. In Section5 for testing. In Section 5 the effect of model errors in the mean, , and tail decay of a two-class the effect of model errors in the mean, variance, and tail decay of a two-class discrimination problem discrimination problem are demonstrated. Section 6 shows how the divergence probability can be are demonstrated. Section6 shows how the divergence probability can be visualized and its role as a visualized and its role as a figure of merit for the level of confidence in accurately forecasting uncertainty. figure of merit for the level of confidence in accurately forecasting uncertainty.

2.2. Proof Proof and and Properties Properties of of the the Geometric Geometric Mean as the Average Probability TheThe introduction introduction showed the origin of the the geomet geometricric mean mean of of the the probabilities probabilities as as the the translation translation ofof the entropy functionfunction andand logarithmic logarithmic score score back back to to the the probability probability domain. domain. The The geometric geometric mean mean can canalso also be shown be shown to be to the be averagethe average probability probability from fr theom perspectivethe perspective of averaging of averaging the Bayesian the Bayesian [10] total[10] totalprobability probability of multiple of multiple decisions decisions and measurements.and measurements.

Theorem 1. GivenGiven a a definition definition of of the the average average as as the the normalizat normalizationion of of a a total total formed formed from from independent independent samples, samples, thethe average average probability probability is is the the geometric geometric mean mean of of independent probabilities, probabilities, si sincence the the total total is is th thee product product of of the the samplesample probabilities. probabilities.

Proof.Proof. GivenGiven MM classclass hypotheses hypotheses j andand aa measurementmeasurement y which is used to assign a probability to eacheach class, the Bayesian as assignmentsignment for each class is:     py | p   p y jjj p j p py y| = , , (3) j j MM     (3) py|  p ∑ py jjj p j j=j1   where p j is the prior probability of hypothesis j. If N independent measurements y =

{y1,..., yi,..., yN} are taken and for each measurement a probability is assigned to the M hypotheses, n o then the probability of the jth hypothesis for all the measurements j ≡ 1j,..., ij,..., Nj is:

Entropy 2017, 19, 286 2 of 14

assumed to be the arithmetic mean, which is incorrect for probabilities. Probabilities, as normalized ratios, must be averaged using the geometric mean, which is reviewed by Fleming and Wallace [7] and was discussed as early as 1879 by McAlister [8]. In fact, the geometric mean can be expressed as the log-average:

, (1)

where p p :i 1,..,N is the set of probabilities and w w :i 1,...,N is the set of weights   i     i   formed by such factors as the frequency and utility of event i. The connection with entropy and information theory [9–12] is established by considering a distribution in which the weights are also

the probabilities wi  pi and the distribution of probabilities sums to one. In this case, the probability average is the transformation of the entropy function. For assessment of a probabilistic inference, the set of probabilities is specified as p  p :i  1,..,N; j 1,...,M with i representing the  ij  samples, j representing the classes and j * specifying the labeled true event which occurred.

1 Assuming N independent trials of equal weight wi  N , the average probability (or forecast) of a probabilistic inference is then:

N 1 P p N . avg   i ,true (2) i1 A proof using Bayesian reasoning that the geometric mean represents the average probability will be completed in Section 2. In Section 3 the analysis will be extended to the generalized mean to provide assessment of the fluctuations in a probabilistic forecast. Section 4 develops a method of analyzing inference algorithms by plotting the average model probability versus the average probability of the data source. This provides quantitative visualization of the relationship between the performance of the inference algorithm and the underlying uncertainty of the data source used for testing. In Section 5 the effect of model errors in the mean, variance, and tail decay of a two-class discrimination problem are demonstrated. Section 6 shows how the divergence probability can be visualized and its role as a figure of merit for the level of confidence in accurately forecasting uncertainty.

2. Proof and Properties of the Geometric Mean as the Average Probability The introduction showed the origin of the geometric mean of the probabilities as the translation of the entropy function and logarithmic score back to the probability domain. The geometric mean can also be shown to be the average probability from the perspective of averaging the Bayesian [10] total probability of multiple decisions and measurements.

Theorem 1. Given a definition of the average as the normalization of a total formed from independent samples, Entropythe2017 average, 19, 286 probability is the geometric mean of independent probabilities, since the total is the product3 of 14 of the sample probabilities.

  N  

Proof. Given M class hypothesesp j yand= a ∏measurementp ij yi , y which is used to assign a probability (4) to i=1 each class, the Bayesian assignment for each class is: since each measurement is independent. Note, this is not the posterior probability of multiple py| p jj probabilities for one decision, but the totalpy probability|  of N decisions., Given a definition for the j M (3) average as the normalization of the total, the normalizationpy| of p the Equation (4) must be the Nth  jj root since the total is a product of the individual samples.j1 Thus the average probability for the jth hypothesis is the geometric mean of the individual probabilities:

  N   1/N p j y = ∏ p ij yi . (5) i=1

Remark 1. The arithmetic mean is not a “robust statistic” [13] due to its sensitivity to outliers. The is often used as a robust average instead of the arithmetic mean. The geometric mean is even more sensitive to outliers. A probability of zero for one sample will cause the average for all the samples to be zero. However; in evaluating the performance of an inference engine, the robustness of the system may be of higher importance. Thus, in using the geometric mean as a measure of performance it is necessary to determine a precision for the system and limit probabilities to be greater than this precision; and to insure that the system uses methods which are not over-confident in the tails of the distribution. Section3 explores this further by defining upper and lower error bars on the average.

The average probability of an inference can be divided into two components in a similar way that cross-entropy has two components, the entropy and divergence. Again, these two sources of uncertainty will be defined in the probability space rather than the entropy space. Consider first, the comparison of two probability distributions, which are labeled source and model (or quoted) distribution (p ≡ psource, q ≡ qmodel). The labels anticipate the relationship between a source distribution used to generate feature data and a model distribution which uses feature data to determine a probability inference. From information theory we know that the cross-entropy between two distributions is the sum of the entropy plus the divergence:

H(p, q) = H(p) + D(pkq) N N   N qi . (6) = − ∑ pi ln pi − ∑ pi ln p = − ∑ pi ln qi i=1 i=1 i i=1

The additive relationship transforms in the probability domain to a multiplicative relationship:

N N  pi pi qi PCE(p, q) ≡ exp(−H(p, q)) = ∏ pi ∏ p = PEnt(p)PDiv(pkq) i=1 i=1 i N , (7) pi = ∏ qi i=1 where PCE is the cross-entropy probability, PEnt is the entropy probability and PDiv is the divergence probability. The divergence probability is the geometric mean of the ratio of the two distributions:

N  pi qi PDiv(pkq) = ∏ . (8) i=1 pi

The assessment of a forecast (which will be referred to as the model probability) can be decomposed into a component regarding the probability of the actual events (which will be referred to as the source probability) and the discrepancy between the forecast and event probabilities (which will be referred to as the divergence probability) for the true class events ij∗: Entropy 2017, 19, 286 4 of 14

1 ! N N 1 N q ∗    N ij q = PModel pj∗ , qj∗ = PSource pj∗ PDiv pj∗ kqj∗ = p ∗ . (9) ∏ ij ∏ ∗ i=1 i=1 pij

A variety of methods can be used to estimate the source probabilities pj∗ = ∗  pij∗ : i = 1, . . . , N, j true class . Histogram analysis will be used here by binning the nearest neighbors’ model forecasts q, though this can be refined by methods such as kernel density estimation.

3. Bounding the Average Probability Using the Generalized Mean An average is necessarily a filter of the information in a collection of samples intended to represent an aspect of the central tendency. A secondary requirement is to estimate the fluctuations around the average. If entropy defines the average uncertainty, then one approach to bounding the average is to consider the of the of the probabilities. Following the procedure of translating entropy analysis to a probability using the exponential function, the fluctuations about the average probability could be defined as:

1 N N !! 2 1 2 1 pσ(p) ≡ exp ∑ (ln pi) − ∑ ln pi . (10) N i=1 N i=1

Unfortunately, this function does not simplify into a function of the probabilities without the use of the and exponentials, defeating the purpose of simplifying the analysis. A more important concern is that the lower error bound on entropy (average of the log probability (entropy) minus the standard deviation of the log probabilities) can be less than zero. This is equivalent to the p(p)/pσ(p) > 1 and thus does not provide a meaningful measure of the variations in the probabilities. An alternative to establishing an error bar on the average probability can be derived from the research on generalized probabilities. The Rényi [14] and Tsallis [15] entropies have been shown to utilize the generalized mean of the probabilities of a distribution [16,17]. The Rényi entropy translates this average to the entropy scale using the natural logarithm and the Tsallis entropy uses a deformed logarithm. Recently the power of the generalized mean m (the symbol m is used instead of p to distinguish from probabilities) has been shown to be a function of the degree of nonlinear statistical coupling κ between the states of the distribution [18]. The coupling defines a dependence between the states of the distribution and is the inverse of the degree of freedom ν, more traditionally used in statistical analysis. For symmetric distributions with constraints on the mean and scale, the Student’s t distribution is the maximum entropy distribution, and is here referred to as a Coupled Gaussian using κ = 1/ν. The exponential distributions, such as the Gaussian, are deformed into heavy-tail or compact-support distributions. The relationship between the generalized mean power, the coupling, and the Tsallis/Rényi parameter q is:

2κ m = = q − 1, (11) 1 + dκ where d is the dimension of the distribution and will be considered one for this discussion, and the numeral 2 can be a real number for generalization of the Levy distributions, and 1 for generalization of the exponential distribution. For simplicity, the approach will be developed from the Rényi entropy, but for a complete discussion see [18]. Using the power parameter m and separating out the weight wi, which is pi for the entropy of a distribution and 1/N for a scoring rule, the Rényi entropy is:

N R 1 m S (w, p) = −∑ ln(wi pi ). (12) i=1 m Entropy 2017, 19, 286 5 of 14

N 1 Rm Entropy 2017, 19, 286 Swpwp,ln. ii 5(12) of 14 i1 m

Translating this to the probability domain using R and setting 1 , results in the Translating this to the probability domain using exp −S and setting wi = N , results in the mth weightedmth weighted mean mean of the of probabilities: the probabilities:

1 1 Nn! m ! m m 1 N  mm n (m) Pppexp1 ln 1 11m 1 m (13) Pavg =avgexp + ln NNpi i= i pi (13) m∑ ii11∑ m i=1 N N i=1 An example of how the generalized mean can be used to assess the performance of an inference An example of how the generalized mean can be used to assess the performance of an inference model is illustrated Figure 1. In this example two 10-dimensional Gaussian distributions are sources model is illustrated Figure1. In this example two 10-dimensional Gaussian distributions are sources for a set of random variables. The distributions have equal variance of 1 along each dimension and for a set of random variables. The distributions have equal variance of 1 along each dimension and no no correlation and are separated by a mean of one along each dimension. Three models of the correlation and are separated by a mean of one along each dimension. Three models of the uncertainty uncertainty are compared. For each model 25 samples are used to estimate of the two means and are compared. For each model 25 samples are used to estimate of the two means and single variance. single variance. The models vary the decay of the tail of a Coupled-Gaussian: The models vary the decay of the tail of a Coupled-Gaussian: 11 d 1  2  µ, ?1 µ − 1 ( 1 +d) G x;1  1,  x−1  x  2 κ (14) ( ) ≡ + ( − )| · · ( − ) Gκ x; µ, Σ Z 1 κ x µ Σ x µ , (14) Zκ(Σ) + where are the location vector and scale matrix,  is the degree of nonlinear coupling where µ and Σ are the location vector and scale matrix, κ is the degree of nonlinear coupling controlling controlling the tail decay, d is the dimensions, Z is the normalization and . In Figure 1a, the tail decay, d is the dimensions, Z is the normalization and (a)+ = max(0, a). In Figure1a, the decay theis Gaussian decay isκ Gaussian= 0. As more κ = 0. dimensions As more dimensions are modeled are the modele accuracyd the of theaccuracy classification of the performanceclassification (bar)performance and the (bar) accuracy and ofthe the accuracy reported of probabilitiesthe reported (grayprobabilities square at(gray 0.63) square measured at 0.63) by themeasured geometric by themean geometric (m = 0) reachmean their (m = maximum0) reach their at six maximum dimensions. at six dimensions.

Figure 1. Source of random variables are twotwo independentindependent 1010 dimensionaldimensional Gaussian.Gaussian. The model estimates the mean and variance of each dimensiondimension using 25 samples.samples. The mean and variance are used as the location and scale forfor three Coupled-Gaussian models, ( a) κ = 0 Gaussian, ( b) κ == 0.162 Heavy-tail (c) κ = 0.095 Compact-support.

For eight and 10 dimensions, the classification does not improve and the probability accuracy For eight and 10 dimensions, the classification does not improve and the probability accuracy degrades. This is because the estimation errors in the mean and variance eventually result in an degrades. This is because the estimation errors in the mean and variance eventually result in an over-fit over-fit model. The decisiveness (upper gray error bar) approximates the percentage of correct model. The decisiveness (upper gray error bar) approximates the percentage of correct classification classification (light brown bar) and the robustness (lower gray error bar) goes to zero. In contrast b) (light brown bar) and the robustness (lower gray error bar) goes to zero. In contrast b) the heavy-tail the heavy-tail model with κ = 0.162 is able to maintain an accuracy of 0.69 for dimensions 6, 8 and 10. model with κ = 0.162 is able to maintain an accuracy of 0.69 for dimensions 6, 8 and 10. The heavy-tail makes the model less susceptible to over-fitting. For c) the compact-support model (κ = 0.095) even Entropy 2017, 19, 286 6 of 14 Entropy 2017, 19, 286 6 of 14

The heavy-tail makes the model less susceptible to over-fitting. For c) the compact-support model (κ =the 0.095) 2-dimensional even the 2-dimensional model has inaccurate model has probabilities inaccurate (0), probabilities because a single(0), because probability a single report probability of zero, reportcauses of the zero, geometric causes meanthe geometric to be zero. mean to be zero.

4. Comparing Probability Inference Pe Performancerformance with the Data Distribution To better understand the sources of inaccuracy in a model, Equation (9) cancan be used to separate the errorerror into into the the “source “source probability” probability” of the of underlying the underlying data source data and source the “divergence and the “divergence probability” probability”between the modelbetween and the the model data source.and the Doingdata source. so requires Doing an so estimate requires of an the estimate source distribution of the source of distributionthe data for of a reportedthe data probability.for a reported This probability. is accomplished This is accomplished using histogram using analysis histogram of the analysis reported of probabilities,the reported probabilities, though improved though estimates improved are estimates possible utilizing are possible for example utilizing Bayesian for example priors Bayesian for the priorsbins, kernel for the methods, bins, kernel and k -nearestmethods, neighbor. and k-nearest In addition neighbor. to the In splitting addition of to the the components splitting of of the componentsstandard entropy of the measures, standard the entropy generalized measures, entropy the equivalents generalized using entropy the generalized equivalents mean using are alsothe generalizedsplit into source mean and are divergence also split components into source as and described divergence below. components The histogram as described results and below. the overall The histogrammetrics (arithmetic, results and geometric, the overall and metrics−2/3 mean) (arithmetic, are visualized geometric, in a plotand of−2/3 the mean) model are probability visualized versus in a plotsource of probabilities.the model probability versus source probabilities. Figure 22 shows:shows: (a) (a) the the parameters, parameters, (b) (b) the the likelih likelihoods,oods, (c) (c) the the posterior posterior probabilities, probabilities, and and (d) (d)the receiverthe receiver operating operating characteristic characteristic (ROC) (ROC) curve curve of ofa aperfectly perfectly matched matched model model for for two two classes classes with one-dimensional Coupled Gaussian di distributions.stributions. The The distribution distribution data data source source in in Figure 22cc isis shownshown as solid green and orange curves. Overlaying the sour sourcece distribution is the model distributions shown as dashed curves. The heavy-tail decay ( κ = 0.5) has thethe effecteffect ofof keepingkeeping thethe posteriorposterior probabilitiesprobabilities from saturating saturating away away from from the the mean mean shown shown in in Figure Figure 2c;2c; instead, instead, the the posteriors posteriors return return gradually gradually to theto the prior prior distribution. distribution. The The actual actual ROC ROC curve curve in Figure in Figure 2d uses2d uses the themodel model probabilities probabilities to classify to classify and theand source the source probabilities probabilities to determine to determine the the distri distribution.bution. The The predicted predicted ROC ROC curve curve uses uses the the model probabilities for both the classificationclassification andand calculationcalculation ofof thethe distribution.distribution.

Figure 2. A two-class classificationclassification example with single-dimension Coupled Gaussian distributions. (a) The parameters for the orange classclass source and model are shown; (b) The likelihood distributions are shown.shown. TheThe solid solid (source) (source) and and dashed dashed (model) (model) overlap overlap in this in case, this representing case, representing a perfectly a perfectly matched matchedmodel; (c )model; The posterior (c) The posterior probabilities probabilities assuming assuming equal priors equal for priors each class; for each (d) Theclass; receiver (d) The operating receiver operatingcharacteristic characteristic curve (solid) curve and a(solid) “predicted” and a ROC“predicted” curve (dashed), ROC curve which (dashed), is determined which usingis determined only the usingmodel only distributions. the model distributions.

While the ROC curve is an important metric for assessing probabilistic inference algorithms, it does not fully capture the relationship between a model and the distribution of the test set. Figure 3

Entropy 2017, 19, 286 7 of 14

EntropyWhile 2017, 19 the, 286 ROC curve is an important metric for assessing probabilistic inference algorithms,7 of 14 it does not fully capture the relationship between a model and the distribution of the test set. Figureintroduces3 introduces a new quantitative a new quantitative visualization visualization of the ofperformance the performance of a probabilistic of a probabilistic inference. inference. The Theapproach approach builds builds upon upon histogram histogram plots plots which which are used are used to calibrate to calibrate probabilities, probabilities, shown shown as green as green and andorange orange bubbles bubbles for foreach each class. class. The The histogram histogram information information is isquantified quantified into into three three metrics, thethe arithmeticarithmetic meanmean (called(called Decisiveness),Decisiveness), thethe geometricgeometric meanmean (called(called Accuracy)Accuracy) and and the the− −2/32/3 meanmean (called(called Robustness),Robustness), shownshown hashas brownbrown marks.marks. InIn thethe case of perfectly matched model with the source allall thethe histogramhistogram bubblesbubbles and each of thethe metricsmetrics aligns along the 45 diagonal representing equalityequality betweenbetween thethe model model probabilities probabilities and and the sourcethe source probabilities. probabilities. In Section In Section5 illustrations 5 illustrations will be provided will be whichprovided show which how show these how metrics these can metrics show can the gapshow in the performance gap in performance of a model, of whethera model, thewhether model the is undermodel oris over-confident,under or over-confident, and the limitations and the limitations to discrimination to discrimination of the available of the information.available information. BecauseBecause thethe goalgoal is to maximize the overall performance of the algorithm, the model probability isis shownshown onon thethe yy-axis.-axis. The distribution of of the source or test data isis treatedtreated asas thethe independentindependent variablevariable onon thethe xx-axis.-axis. This orientation isis reverse toto common practicepractice [[19,20],19,20], whichwhich binsbins thethe reported probabilitiesprobabilities andand showsshows thethe distributiondistribution vertically;vertically; however,however, thethe adjustmentadjustment isis helpfulhelpful forfor focusingfocusing onon increasingincreasing modelmodel performance. performance. In In Figure Figure3 the 3 the orange orange and and green green bubbles bubbles are individual are individual histogram histogram data. Neighboringdata. Neighboring model model probabilities probabilities for the for green the green or orange or orange class are class grouped are grouped to form to a form bin. Thea bin.y-axis The locationy-axis location is the geometric is the geometric mean of mean the model of the probabilities. model probabilities. The x-axis The location x-axis location is the ratio is the of the ratio samples of the withsamples the matchingwith the matching green or orangegreen or class orange divided class by divided the total by numberthe total of number samples. of samples.

FigureFigure 3.3. TheThe averageaveragemodel model probability probability versus versus the the average average source source probability probability for for the the model/source model/source of Figureof Figure1. The1. The orange orange and and green green bubbles bubbles represent represen individualt individual bins, bins, with with bins bins grouped grouped by by the the model model probabilityprobability andand the the ratio ratio of of class class to totalto total distribution distribution represented represented by the by source the source probability. probability. The brown The marksbrown aremarks the are overall the overall averages averag (arithmetic,es (arithmetic, geometric, geometric, and −and2/3). −2/3). For For a a perfect perfect model model all all areare onon thethe identity.identity.

TheThe processprocess for for creating creating model model versus versus source source probability probability plot plot given given a test a set test of set reported of reported or model or model probabilities Qqi: 1... NjM ,  1... , where i is the event number and j is the class label probabilities Q = qij : i =ij1, ..., N, j = 1, ..., M , where i is the event number and j is the class label andand qij ∗ isis the the labeled labeled true true event event is as is follows. as follows. There Th areere some are some distinctions distinctions between between the defined the defined process and that used to create the simulated data in Figure3 through Figure 6 which will be noted. process and that used to create the simulated data in Figure 3 through Figure 6 which will be noted. (1) Sort data: The probabilities reported are sorted. As shown the sorting is separated by class, (1) Sortbut coulddata: The also probabilities be for all the reported classes. are A sorted. precision As limitshownqlim theis sorting defined is such separated that all by the class, model but couldprobabilities also be satisfy for allqlim the≤ classes.q ≤ 1 − qAlim precision. limit is defined such that all the model

probabilities satisfy qqlim1 q lim .

Entropy 2017, 19, 286 8 of 14

Entropy 2017, 19, 286 8 of 14 (2) Bin data: Bins with an equal amount of data are created to insure that each bin has adequate data for the analysis. Alternatively, cluster analysis could be completed to insure that each bin has (2) Bin data: Bins with an equal amount of data are created to insure that each bin has adequate data elements with similar probabilities. for the analysis. Alternatively, cluster analysis could be completed to insure that each bin has Bin analysis k (3) elements: For with each similar bin probabilities.the average model probability, the source probability, and the contribution(3) Bin analysis to the: overallFor each accuracy bin k arethe computed.average model probability, the source probability, and the

contribution to the overall accuracy are computed. 2 (a) Model probability (y-axis): Compute the arithmetic, geometric, and − 3 mean of the model 2 (a)probabilities Model probability for the actual (y-axis) event:: Compute the arithmetic, geometric, and  3 mean of the model probabilities for the actual event: Nk  1/Nk  q ∗ m = 0  ∏ ij k (m) i=1 = 1 Qk ! m (15) Nk  1 m  ∑ q ∗ m = 1, −2/3,  Nk ij k (15)  i=1

where m is the power of the weighted generalized mean. (b) Source probability (x-axis): Compute the frequency of true events in each bin: where m is the power of the weighted generalized mean. M (b) Source probability (x-axis): Compute the frequency∗ of true events in each bin: ∑ Njk j=M1 Pk =  , (16) M N jk j1N P ∑ jk , k j=M1 (16)  N jk ∗ j1 where Njk and Njk are the number of j class element in bin k and the number of true j class elements in be k, respectively. If a source probability is calculated for each sample using where NNjk and jk are the number of j class element in bin k and the number of true j class k-nearest neighbor or kernel methods, then the arithmetic, geometric, and −2/3 mean can elementsalso be calculated in be k, forrespectively. the source probabilities.If a source probability is calculated for each sample using (c) kContribution-nearest neighbor to overall or kernel accuracy methods, (size): then Figure the3 arithmetic,through Figure geometric, 6 show and bubble mean sizes can also beproportional calculated for to thethe numbersource probabilities. of true events; however, a more useful visualization is the (c)contribution Contribution each to bin overall makes accuracy to the (size) overall: Figure accuracy. 3 through This Figure is determined 6 show bubble by weighting sizes proportional the modelto averagethe number by the of sourcetrue events; average. however, Taking a the more logarithm useful visualization of this quantity is makesthe contribution a linear each scalebin for makes the bubble to the size: overall accuracy. This is determined by weighting the model average by the source average. Taking the logarithm of this quantity makes a linear scale for the bubble size:  M  ∑ Nj∗k  j=1   M     ∑ Njk  Nj∗k  j=1  1/Nj∗k   ln q ∗ (17)  M   ∏ ij k   N ∗  = (17)  b ∑ j k  i 1  j=1   ∑ M  k=1 ∑ Njk j=1

(4) Overall(4) Overall Statistics: The overall: The overall statistics statistics are determined are determined by the generalizedby the generalized mean of mean the individual of the individual bin statistics.bin statistics.

(a) (a)Overall Overall Model Model Statistics Statistics: : 1   m

 B Nk !   1 1 m  , (18)  q ∗  , (18)  B ∑ N ∑ ij k   k=1 k i=1 

where B is the number of bins. (b) Overall Source Statistics:

Entropy 2017, 19, 286 9 of 14

(b) Overall Source Statistics: 1 Entropy 2017, 19, 286   M m m 9 of 14 ∑ N∗  B jk   1  j=1     (19)  ∑  M    B k=1    ∑ Njk  j=1 (19) In Figure3 through Figure 6 the brown marks are located at the ( x, y) point corresponding to the (model, source) mean for the three metrics Decisiveness (arithmetic, m = 1), Accuracy (geometric, m = 0), andIn Figure Robustness 3 through (m = Figure−2/3). 6 the brown marks are located at the (x, y) point corresponding to the (model, source) mean for the three metrics Decisiveness (arithmetic, m = 1), Accuracy (geometric, m = 5. Effect0), and of Robustness Model Errors (m = in −2/3). Mean, Variance, Shape, and Amplitude

The5. Effect three ofmetrics Model Errors (Decisiveness, in Mean, Accuracy,Variance, Shape, and Robustness) and Amplitude provide insight about the discrepancy between an inference model and the source data. The accuracy (geometric mean) always degrades due The three metrics (Decisiveness, Accuracy, and Robustness) provide insight about the to errors, while the changes in the decisiveness and robustness provide an indication of whether an discrepancy between an inference model and the source data. The accuracy (geometric mean) always algorithm is under or over-confident. The discrepancy in accuracy will be quantified as the “probability degrades due to errors, while the changes in the decisiveness and robustness provide an indication of divergence”whether an using algorithm Equation is under (8) andor over-confident. discussed more The thoroughlydiscrepancy in in accuracy Section 6will. The be quantified effect of four as the errors (mean,“probability variance, divergence” shape, and using amplitude) Equation on (8) one and of discu thessed two more class thoroughly models will in Section be examined. 6. The effect of Infour Figure errors4 (mean, the mean variance, of the shape, model and (a) amplitude) is shifted on from one 2 of to the 1 for two the class orange models class. will be The examined. closer spacing of the hypothesesIn Figure shown4 the mean in (b) ofresults the model in under-confident (a) is shifted from posterior 2 to 1 for probabilities the orange class. shown The in closer (c), where the dashedspacing curvesof the hypotheses of the model shown are in closer (b) results to 0.5 in than under-confident the solid source posterior curves. probabilities The actual shown ROC in curve shown(c), as where a solid the curvedashed in curves (d) is of better the model than theare clos forecasteder to 0.5 ROCthan the curve solid (dashed). source curves. Each The histogram actual bin for theROC analysis curve shown is a horizontal as a solid slicecurve of in the (d) neighboring is better than probabilitiesthe forecasted in ROC Figure curve4c. (dashed). A source Each of error histogram bin for the analysis is a horizontal slice of the neighboring probabilities in Figure 4c. A in estimating the source probabilities is that model probabilities are drawn from multiple features; source of error in estimating the source probabilities is that model probabilities are drawn from two in one-dimensional single modal distributions used here for demonstration. For more complex multiple features; two in one-dimensional single modal distributions used here for demonstration. distributionsFor more withcomplex many distributions modes and/or with many many modes dimensions and/or many the analysis dimensions would the haveanalysis to bewould segmented have by featureto be regions. segmented by feature regions.

FigureFigure 4. ( a4., b(a),b The) The orange orange modelmodel (dashed (dashed line) line) has has mean mean of 1 ofrather 1 rather than the than source the of source mean 2. of The mean 2. results in (c) under-confident posterior probabilities and a shift in the maximum posterior decision The results in (c) under-confident posterior probabilities and a shift in the maximum posterior decision boundary. (d) While the ROC curve and Maximum Posterior classification performance are only boundary. (d) While the ROC curve and Maximum Posterior classification performance are only modified slightly from the performance of the perfect model, the predicted ROC curve shows a modified slightly from the performance of the perfect model, the predicted ROC curve shows a significant discrepancy. significant discrepancy.

Entropy 2017, 19, 286 10 of 14 Entropy 2017, 19, 286 10 of 14

Figure5 5aa showsshows howhow thethe under-confidentunder-confident modelmodel affectsaffects thethe modelmodel probabilityprobability versusversus thethe sourcesource probability plot. The green andand orangeorange classclass bubblesbubbles separate.separate. ForFor sourcesource probabilitiesprobabilities greater/lessgreater/less than 0.5 thethe modelmodel probabilitiesprobabilities tend tend to to be be lower/higher, lower/higher, reflecting reflecting the the higher higher uncertainty uncertainty reported reported by bythe the model. model. The The large large brown brown mark, mark, which which represents repres theents average the average probability probability reported reported by the by model the modelversus theversus average the average probability probability of the source of the distribution, source distribution, is below the is equalbelow probability the equal dashedprobability line dasheddue to theline error due into thethe model.error in The the decisivenessmodel. The decisiveness and robustness and metrics robustness (smaller metrics brown (smaller marks) brown have ◦ marks)an orientation have an with orientation the x-axis with which the x is-axis less which than 45 is ,less indicative than 45°, of indicative the model of being the model less decisive being less but decisivemore robust but more (i.e., smaller robust variations(i.e., smaller in variations the reported in uncertainty).the reported uncertainty). The decisive The metric, decisive which metric, is the whicharithmetic is the mean, arithmetic has a largermean, average, has a larger while average, the robust while metric, the robust which metric, is the − which2/3 mean, is the has −2/3 a smaller mean, hasaverage. a smaller Figure average.5b shows Figure the contrasting5b shows the result contrasting when the result orange when class the model orange has class an over-confident model has an over-confidentmean shifted from mean 2 to shifted 3. The from orange 2 andto 3. green The orange class bubbles and green now haveclass averagebubbles modelnow have probabilities average whichmodel areprobabilities more extreme which than are the more source extreme probabilities. than the The source overall probabilities. reported probabilities The overall still decreasesreported probabilitiesrelative to a perfect still decreases model (large relative brown to a mark),perfect but model the orientation(large brown of themark), decisive but the and orientation robust metrics of the is ◦ decisivenow greater and thanrobust 45 metrics. is now greater than 45°.

Figure 5. The model probability versus source probability for ( a) the under-confidentunder-confident mean model pictured in Figure4 4 and and ( (bb)) anan over-confidentover-confident meanmean model.model. The green and orange class bubbles separate along with the three averages for each class.class. The large brown mark is the overall accuracy which isis now now below below the the dashed dashed line line for bothfor theboth under the under and over-confident and over-confident models. Formodels. (a) the For alignment (a) the alignmentof the three of overall the three metrics overall (brown) metrics is less (brown) than 45 is◦ lessindicating than 45° under-confidence. indicating under-confidence. For (b) the alignment For (b) theof the alignment three overall of the metrics three over is greaterall metrics than is 45 greater◦ indicating than over-confidence.45° indicating over-confidence.

Other model errors have a similar effect in lowering the model average below equality with the Other model errors have a similar effect in lowering the model average below equality with the source average, and shifting the orientation of the decisiveness and robustness to be greater or less source average, and shifting the orientation of the decisiveness and robustness to be greater or less than than 45°. The full spectrum of the generalized mean may provide approaches for more detailed 45◦. The full spectrum of the generalized mean may provide approaches for more detailed analysis of analysis of an algorithms performance, but will be deferred. In Figure 6 the effect of under and an algorithms performance, but will be deferred. In Figure6 the effect of under and over-confident over-confident variance and tail decay are shown. When the model variance is inaccurate (a) and (b) variance and tail decay are shown. When the model variance is inaccurate (a) and (b) the changes to the changes to the three metrics are similar to the effect of an inaccurate mean. The model accuracy the three metrics are similar to the effect of an inaccurate mean. The model accuracy decreases and the decreases and the angle of orientation for the decisive/robust metrics increases or decreases for over angle of orientation for the decisive/robust metrics increases or decreases for over or under confidence. or under confidence. When the tail decay is over-confident (c) the probabilities in the middle of the When the tail decay is over-confident (c) the probabilities in the middle of the distribution are accurate, distribution are accurate, but the tail inaccuracy can significantly increase the angle of orientation of but the tail inaccuracy can significantly increase the angle of orientation of the decisive/robust metrics. the decisive/robust metrics. When the tail decay is under-confident (d) the probabilities are When the tail decay is under-confident (d) the probabilities are truncated near the extremes of 0 and 1. truncated near the extremes of 0 and 1. An under-confident model of the tails, can be beneficial in An under-confident model of the tails, can be beneficial in providing robustness against other errors. providing robustness against other errors.

Entropy 2017, 19, 286 11 of 14 Entropy 2017, 19, 286 11 of 14

Model vs Source Probability Model vs Source Probability Class A Class B Class A Class B 1.0 1.0 a) b)  source 1.0  source 1.0 0.8 0.8  model  0.5  model  2.0

0.6 0.6 Probability l 0.4 0.4 Model Probability

0.2 0.2

0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Source Probability Source Probability Model vs Source Probability Model vs Source Probability Class A Class B Class A Class B 1.0 1.0 c) Over-confiden d) Prevents over-confidence

0.8 0.8 source  0.5 source  0.5

model  0.0 model  2.0 0.6 0.6 Probability Probability l l 0.4 0.4 Mode Mode

0.2 0.2 Tail Truncated

0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Source Probability Source Probability FigureFigure 6. 6. TheThe model model versus versus source source average average probability probability when when the orange the orange class class variance variance model model is over is orover under or under confident. confident. (a) The (a )orange The orange model model has a hasvariance a variance 0.5 rather 0.5 rather than thanthe source the source variance variance of 1.0. of The1.0. model The model accuracy accuracy (large (large brown brown mark) mark) is lower is lowerthan the than source the source accuracy accuracy and the and orientation the orientation of the decisive/robustof the decisive/robust metrics metrics(smaller (smaller brown brownmarks) marks) is greater is greater than than45°; 45(b◦) ;(Whenb) When the thevariance variance is over-confidentis over-confident (2.0) (2.0) the theaccuracy accuracy also also decreases decreases and and the thedecisive/robust decisive/robust orientation orientation is less is than less than45°; (45c) ◦The;(c) coupling The coupling or tail or decay tail decay of the ofmodel the model is over-confident. is over-confident.   0 κ is= a Gaussian0 is a Gaussian distribution. distribution. Only theOnly probabilities the probabilities near 0 near or 1 0 are or 1inaccurate, are inaccurate, but th butis significantly this significantly increases increases the orientation the orientation of the of decisive/robustthe decisive/robust metrics metrics angle angle above above 45°; 45 (d◦;() dWhen) When the the tail tail model model is is under-confident under-confident reported reported probabilitiesprobabilities near 0 or 1 are truncated. This can provide robustness against other sources of errors.

6.6. The The Divergence Divergence Probability Probability as as a a Figure Figure of of Merit Merit TheThe accuracy accuracy of of a a probabilistic probabilistic inference, inference, measured measured by by the the source source probability, probability, as as described described in in thethe previous previous sections, sections, is is affected affected by by two two sources sources of of error, error, the uncertainty in in the the distribution distribution of of the the sourcesource data data and and the the divergence divergence between between the the model model and source distributions. The The divergence divergence can can be be isolatedisolated by by dividing out thethe probabilityprobability associatedassociated withwith the the source source from from the the model model probability. probability. This This is    isa rearrangementa rearrangement of Equationof Equation (9) P Div(9) pPPj∗ kqpqj∗ = PModel pj∗ pq, q,j∗ /PSource Pp pj∗ , ,p j∗pis theis the vector vector of true of Div j** j Model j ** j Source j * j* state reported probabilities and qj∗ is the vector of estimated source probabilities. The divergence trueprobability state reported can be utilized probabilities as a figure and ofq meritj* is regardingthe vector the of degreeestimated of confidence source probabilities. in an algorithms The divergence probability can be utilized as a figure of merit regarding the degree of confidence in an

Entropy 2017, 19, 286 12 of 14 Entropy 2017, 19, 286 12 of 14 algorithms ability to report accurate probabilities. The source probability reflects the uncertainty in the data source based on issues such as the discrimination power of the features. abilityFigure to report 7 shows accurate a probabilities.geometric visualization The source of probability the divergence reflects probability the uncertainty for the in the example data source when basedthe tail on is issues over-confident such as the (model discrimination κ = 0 and power source of κ the = 0.5). features. The model probability corresponds to the y-axisFigure location7 shows of the a geometric algorithm visualization accuracy shown of the as divergencea brown mark. probability The source for the probability example is when the x the-axis taillocation is over-confident of the accuracy. (model Theκ = ratio 0 and of source these κtwo= 0.5). quantities The model (divergence probability probability) corresponds is equivalent to the y-axis to locationprojecting of the the algorithm y-axis onto accuracy the shownpoint where as a brown the mark.x-axis Theis length source probability1. The resulting is the x -axisquantities location are of the accuracy. The ratio of these two quantities (divergence probability) is equivalent to projecting the P pq, 0.56, P p 0.62, and Model j** j  Source j*   y-axis onto the point where the x-axis is length 1. The resulting quantities are PModel pj∗ , qj∗ = 0.56,      ∗ = PP∗ k pq∗ = ∗ pq∗, P p∗ = 0.90. PSource pj 0.62, and PDiv pj Divqj j****j*PModel Modelpj, q jj / jPSource Sourcepj j 0.90.

Model vs Source Probability Class A Class B 1.0

0.8 y

0.6 P – Source Probabilit l

0.4 Mode Divergence

-

Model P 0.2 –

P

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Source Probability

FigureFigure 7. 7.The The accuracy accuracy of of an aninference inference algorithmalgorithm measured by the model probability probability ((yy-axis)-axis) consists consists of oftwo two components, components, the the sourcesource probability probability ((x-axis)-axis) and thethe divergencedivergence probabilityprobability (projection(projection ofof the the cross-entropycross-entropy probability probability onto onto the the 0 to0 to 1 scale).1 scale).

7.7. Discussion Discussion TheThe role role and and importance importance of of the the geometric geometric mean mean in in defining defining the the average average probabilistic probabilistic inference inference hashas been been shown shown through through a a derivation derivation from from information information theory, theory, a a proof proof using using Bayesian Bayesian analysis, analysis, and and viavia examples examples of of a simplea simple two-class two-class inference inference problem. problem. The The geometric geometric mean mean of a setof a of set probabilities of probabilities can becan understood be understood as the as central the central tendency tendency because because probabilities, probabilities, as ratios, as resideratios,on reside a multiplicative on a multiplicative rather thanrather an additivethan an additive scale. scale. TwoTwo issues issues have have limited limited the the role role the the geometric geometric mean mean as as an an assessment assessment of of the the performance performance of of probabilisticprobabilistic forecasting. forecasting. Firstly,Firstly, informationinformation theoretictheoretic resultsresults havehave beenbeen framedframed inin terms terms of of the the arithmeticarithmetic mean mean of of the the logarithm logarithm of of the the probabilities. probabilities. While While optimizing optimizing the the logarithmic logarithmic score score is is mathematicallymathematically equivalent equivalent to optimizingto optimizing the geometric the geometric mean of mean the probabilities, of the probabilities, use of the logarithmic use of the scalelogarithmic obscures scale the connectionobscures the with connection the original with setthe oforiginal probabilities. set of probabilities. Secondly, because Secondly, the because log score the andlogthe score geometric and the mean geometric are very mean sensitive are very averages, sensitive algorithms averages, which algorithms do not which properly do not account properly for theaccount role of for tail the decay role in of achieving tail decay accurate in achieving models accu sufferrate severelymodels suffer against severely these metrics. against Thus these metrics, metrics. suchThus as metrics, the quadratic such as mean, the quadratic which keep mean the, scorewhich for keep a forecast the score of pfor= a0 forecastfinite and of are made finite proper andby are subtractingmade proper the by bias, subtracting have been the attractive bias, have alternatives. been attractive alternatives.

Entropy 2017, 19, 286 13 of 14

Two recommendations are suggested when using the geometric mean to assess the average performance of a probabilistic inference algorithm. First, careful consideration of the tail decay in the source of the data should be accounted for in the models. Secondly, a precision should be specified which restricts the reported probabilities to the range of qlim ≤ q ≤ 1 − qlim. Unlike an arithmetic scale, on the multiplicative scale of probabilities precision affects the range of meaningful quantities. In the examples visualizing the model versus source probabilities, the precision is qlim = 0.01 and matches the bin widths. The issue of precision is particularly important when forecasting rare events. Assessment of the accuracy needs to consider the size of the available data and how this affects the precision which can be assumed. The role of the quadratic score can be understood to include an assessment of the “Decisiveness” of the probabilities rather than purely the accuracy. In the approach described here this is fulfilled by using the arithmetic mean of the true state probabilities. Removing the bias in the arithmetic mean via incorporation of the non-true state probabilities is equivalent to the quadratic score; however, by keeping it a local mean the method is simplified and insight is provided regarding the degree of fluctuation in the reported probabilities. Balancing the arithmetic mean, the −2/3 mean provides an assessment of the “Robustness” of the algorithm. In this case, the metric is highly sensitive to reported probabilities of the actual event which are near zero. This provides an important assessment of how well the algorithm minimizes fluctuations in reporting the uncertainty and provides robustness against events which may not have been included in the testing. Together the arithmetic and −2/3 metrics provide an indication of the degree of variation around the central tendency measure by the geometric mean. The performance of an inference algorithm, including its tendency for being under or over-confident can be visualized by plotting the model versus source averages. As just described, the model averages are based on the generalized mean of the actual events. The contribution to the uncertainty from the source is determined by binning similar forecasts. The source probability for each bin is the frequency of probabilities associated with a true event in each bin. The arithmetic, geometric, and −2/3 means of the source probability for each bin are used to determine the overall source probability means. The geometric mean of the model probabilities is called the model probability and is always less than the source probability. The ratio of these quantities is the divergence probability, Equation (9), and is always less than one. A perfect model has complete alignment between the source and model probabilities, and in turn the three metrics are aligned along the 45◦ line of the model versus source probability plot. If the algorithm is over-confident the arithmetic mean of the reported probabilities increases and the −2/3 mean decreases, causing a rotation in the orientation of the three metrics to a higher angle. In contrast an under-confident algorithm causes a rotation to a lower angle. This was illustrated using a two-class discrimination problem as an example, showing the effects of errors in the mean, variance, and tail decay. The three metrics provide a quantitative measure of the algorithms performance as a probability and an indication of whether the algorithm is under or over confident. Distinguishing between the types of error (mean, variance, and decay) may be possible by analyzing the full spectrum of the generalized mean and is recommended for future investigations.

Acknowledgments: Reviews of this work with Fred Daum and Carlos Padilla of Raytheon assisted in refining the methods. Conflicts of Interest: The author declares no conflict of interest.

References

1. Jose, V.R.R.; Nau, R.F.; Winkler, R.L. Scoring rules, generalized entropy, and utility maximization. Oper. Res. 2008, 56, 1146. [CrossRef] 2. Jewson, S. The problem with the Brier score. arXiv 2004, arXiv:physics/0401046. Entropy 2017, 19, 286 14 of 14

3. Ehm, W.; Gneiting, T. Local Proper Scoring Rules; Department of Statistics, University of Washington: Seattle, WA, USA, 2009. 4. Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [CrossRef] 5. Dawid, A.P. The geometry of proper scoring rules. Ann. Inst. Stat. Math. 2007, 59, 77–93. [CrossRef] 6. Dawid, A.; Musio, M. Theory and applications of proper scoring rules. Metron 2014.[CrossRef] 7. Fleming, P.J.; Wallace, J.J. How not to lie with statistics: The correct way to summarize benchmark results. Commun. ACM 1986, 29, 218–221. [CrossRef] 8. McAlister, D. The law of the geometric mean. Proc. R. Soc. Lond. 1879, 29, 367–376. [CrossRef] 9. Khinchin, A.I. Mathematical Foundations of Information Theory; Courier Corporation: North Chelmsford, MA, USA, 1957. 10. Bernardo, J.M.; Smith, A.F.M. Inference and Information. In Bayesian Theory; John Wiley & Sons: Hoboken, NJ, USA, 2009; pp. 67–81. 11. Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana, IL, USA, 1949; p. 114. 12. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. 13. Huber, P.J.; Ronchetti, E.M. Robust Statistics, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2009. 14. Rényi, A. On measures of entropy and information. Proc. Fourth Berkeley Symp. Math. Stat. Probab. 1961, 1, 547–561. 15. Tsallis, C. Nonadditive entropy and nonextensive statistical mechanics—an overview after 20 years. Braz. J. Phys. 2009, 39, 337–356. [CrossRef] 16. Nelson, K.P.; Scannell, B.J.; Landau, H. A Risk Profile for Information Fusion Algorithms. Entropy 2011, 13, 1518–1532. [CrossRef] 17. Oikonomou, T. Tsallis, Renyi and nonextensive Gaussian entropy derived from the respective multinomial coefficients. Phys. A Stat. Mech. Appl. 2007, 386, 119–134. [CrossRef] 18. Nelson, K.P.; Umarov, S.R.; Kon, M.A. On the average uncertainty for systems with nonlinear coupling. Phys. A Stat. Mech. Its Appl. 2017, 468, 30–43. [CrossRef] 19. Cohen, I.; Goldszmidt, M. Properties and benefits of calibrated classifiers. Lect. Notes Comput. Sci. 2004, 3202, 125–136. 20. Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning ICML 05, Bonn, Germany, 7–11 August 2005; pp. 625–632.

© 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).