On the Statistical Differences Between Binary Forecasts and Real World

Home , Nassim Nicholas Taleb

On the Statistical Differences between Binary Forecasts and Real World Payoffs Nassim Nicholas Taleb∗† ∗ NYU Tandon School of Engineering † Universa Investments Forthcoming, International Journal of Forecasting, 2020

I.INTRODUCTION/ABSTRACT 3) Machine Learning: Some nonlinear payoff functions, while not lending themselves to verbalistic expressions and There can be considerable mathematical and statistical dif- "forecasts", can be well captured by ML or expressed in option ferences between the following two items: contracts. 1) (univariate) binary predictions, bets and "beliefs" (ex- 4) Fattailedness: The difference is exaggerated when the pressed as a specific "event" will happen/will not hap- variable under consideration lies in the power law classes of pen) and, on the other, probability distributions. 2) real-world continuous payoffs (that is, numerical benefits 5) Model error: Binary forecasts are not particularly prone or harm from an event). to model error; real world payoffs are. Way too often, the decision science and economics literature The paper is organized as follows. We first present the uses one as a proxy for another. Some results, say overestima- difference in statistical properties thanks to precise mathe- tion of tail probability, by humans can be stated in one result1 matical definitions of the two types in section II. The text and unwarranted conclusions that people overestimate tail risk is structured with (numbered) "definitions","comments", and have been chronically made since.23 "examples". Section III presents the differences in the context In this paper we show the mischaracterization as made in of Gaussian-like and fat tailed environments (that is, the class the decision-science literature and presents the effect of their of distributions dominated by remove events), a separation conflation. We also examine the differences under thin and based on the presence or absence of a characteristic scale. Sec- fat tails –for under Gaussian distributions the effect can be tion IV develops the mathematics of spurious overestimation, marginal, which may have lulled the psychology literature into comparing the properties of payoffs under thin tails (section A) the conflation. and Fat Tails (section B), discusses the conflation and presents The net effects are: the impact of model error (section D). Section V applies 1) Spuriousness of many psychological results: This af- to the calibration in psychological experiments. Section VI fects risk management claims, particularly the research results presents a catalogue of scoring metrics. Section VII shows the to the effect that humans overestimate the risks of rare events. loss functions of machine learning and how they fit nonlinear Many perceived "biases" are shown to be just mischaracteri- payoffs in practical applications. zations by psychologists. We quantify such conflations with a The appendix shows the mathematical derivations and exact metric for "pseudo-overestimation". distribution of the various payoffs, along with an exact explicit 2) Being a "good forecaster" in binary space doesn’t functions for the Brier score helpful for other applications lead to having a good actual performance : The reverse such as significance testing and sample sufficiency (new to is also true, and the effect is exacerbated under nonlinearities. the literature).

arXiv:1907.11162v3 [q-fin.GN] 29 Dec 2019 A binary forecasting record is likely to be a reverse indicator under some classes of distributions or deeper uncertainty. II.CONTINUOUSVS.DISCRETE PAYOFFS:DEFINITIONS AND COMMENTS December 4, 2019. A version of this article was presented at the M4 competition conference in New York in October 2018. The author thanks Example 1 ("One does not eat beliefs and (binary) forecasts"). Marcos Carreira, Peter Carr, Pasquale Cirillo, Zhuo Xi, two invaluable and In the first volume of the Incerto ( Fooled by Randomness, anonymous referees, and, of course, the very, very patient Spyros Makridakis. 2001 [9]), the narrator, a trader, is asked by the manager A particular thanks to Raphael Douady who took the thankless task of reviewing the mathematical definitions and derivations. "do you predict that the market is going up or down?" "Up", 1The notion of "calibration" is presented and explained in Fig. 1 and its he answered, with confidence. Then the boss got angry when, captions; for its origin, see Lichtenstein et al ([1], and more explicitly [2] "The looking at the firm’s exposures, he discovered that the narrator primary bias is the "overestimation of low frequencies and underestimation of ...high frequencies") and most notably since the influential Kahneman and was short the market, i.e., would benefit from the market going Tversky paper on prospect theory [3]). Miscalibration and the point of the down. article are shown in Fig. 2. The trader had difficulties conveying the idea that there 2 See further [4] , A more recent instance of such conflation is in a review was no contradiction, as someone could hold the (binary) paper by Barberis [5]). 3Some alternative mechanisms in say [6] and more generally Gigerenzer’s belief that the market had a higher probability of going up group, see a summary in [7]. than down, but that, should it go down, there is a very 2

Correct Frequency ''Actual'' Pr

1.0 1.0 ''Normative'' 0.8 0.8 ''Descriptive''

0.6 0.6

0.4 0.4 Perfect calibration over 0.2 0.2 under

Forecast Judged Pr 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Fig. 1. Probabilistic calibration (simpliﬁed) as seen in the psychology Fig. 3. "Typical patterns" as stated and described in Baron’s textbook literature. The x axis shows the estimated probability produced by the [8], a representative claim in psychology of decision making that people forecaster, the y axis the actual realizations, so if a weather forecaster predicts overestimate small probability events. We note that to the left, in the estimation 30% chance of rain, and rain occurs 30% of the time, they are deemed part, 1) events such as ﬂoods, tornados, botulism, mostly patently fat tailed "calibrated". We hold that calibration in frequency (probability) space is an variables, matters of severe consequences that agents might have incorporated academic exercise (in the bad sense of the word) that mistracks real life in the probability, 2) these probabilities are subjected to estimation error that, outcomes outside narrow binary bets. It is particularly fallacious under fat when endogenized, increase the true probability of remote events. tails.

If exposures and beliefs do not go in the same direction, Corresponding MisCalibration in Probability it is because beliefs are verbalistic reductions that contract a higher dimensional object into a single dimension. To express 1.0 the manager’s error in terms of decision-making research, there can be a conﬂation in something as elementary as the 0.8 notion of a binary event (related to the zeroth moment) or the

0.6 probability of an event and expected payoff from it (related to the ﬁrst moment and, when nonlinear, to all higher moments) 0.4 as the payoff functions of the two can be similar in some circumstances and different in others. 0.2 Comment 1. In short, probabilistic calibration requires esti-

0.2 0.4 0.6 0.8 1.0 mations of the zeroth moment while the real world requires all moments (outside of gambling bets or artiﬁcial environments MisCalibration in Payoff such as psychological experiments where payoffs are necessarily truncated), and it is a central property of fat tails that 4000 higher moments are explosive (even "inﬁnite") and count more and more. 3000 A. Away from the Verbalistic 2000 While the trader story is mathematically trivial (though the mistake is committed a bit too often), more serious gaps are 1000 present in decision making and risk management, particularly when the payoff function is more complicated, or nonlinear 0 (and related to higher moments). So once we map the contracts 0 1000 2000 3000 4000 or exposures mathematically, rather than focus on words and verbal descriptions, some serious distributional issues arise.

Fig. 2. How miscalibration in probability corresponds to miscalibration in payoff under power laws. The distribution under consideration is Pareto with tail index α = 1.15 Definition 1: Event A (real-valued) random variable X:Ω → R defined on the probability space (Ω, F,P ) is a function X(ω) of the outcome ω ∈ Ω. An event is a measurable subset small probability that it could go down considerably, hence a (countable or not) of Ω, measurable meaning that it can short position had a positive expected return and the rational be defined through the value(s) of one of several random response was to engage in a short exposure. "You do not eat variable(s). forecasts, but P/L" (or "one does not monetize forecasts") goes the saying among traders. 3

Definition 2: Binary forecast/payoff Definition 3: Real world open continuous payoff A binary forecast (belief, or payoff) is a random variable X :Ω → [a, ∞) ∨ (−∞, b] ∨ (−∞, ∞) taking two values A continuous payoff "lives" in an interval, not a finite set. X :Ω → {X ,X }, 1 2 It corresponds to an unbounded random variable either with realizations X1,X2 ∈ R. doubly unbounded or semi-bounded, with the bound on one side (one tailed variable).

In other words, it lives in the binary set (say {0, 1}, {−1, 1}, Caveat: etc.), i.e., the specified event will or will not take place and, if We are limiting for the purposes of our study the there is a payoff, such payoff will be mapped into two finite consideration to binary vs. continuous and open-ended (i.e., numbers (a fixed sum if the event happened, another one if no compact support). Many discrete payoffs are subsumed it didn’t). Unless otherwise specified, in this discussion we into the continuous class using standard arguments of ap- default to the {0, 1} set. proximation. We are also omitting triplets, that is, payoffs in, say {−1, 0, 3}, as these obey the properties of binaries (and Example of situations in the real world where the payoff is can be constructed using a sum of binaries). Further, many binary: variable with a floor and a remote ceiling (hence, formally with compact support), such as the number of victims or a • Casino gambling, lotteries , coin flips, "ludic" environments, or binary options paying a fixed sum if, say, catastrophe, are analytically and practically treated as if they the stock market falls below a certain point and nothing were open-ended [12]. otherwise –deemed a form of gambling4. Example of situations in the real world where the payoff is continuous: • Elections where the outcome is binary (e.g., referenda, U.S. Presidential Elections), though not the economic • Wars casualties, calamities due to earthquake, medical effect of the result of the election.5 bills, etc. • Medical prognoses for a single patient entailing survival • Magnitude of a market crash, severity of a recession, rate or cure over a specified duration, though not the duration of inflation itself as variable, or disease-specific survival expressed • Income from a strategy in time, or conditional life expectancy. Also exclude • Sales and profitability of a new product anything related to epidemiology. • In general, anything covered by an insurance contract • Whether a given person who has an online profile will buy or not a unit or more of a specific product at a given Payoff time (not the quantity or units). 3.5

Comment 2 (A binary belief is equivalent to a payoff). A 3.0 g(x) binary "belief" should map to an economic payoff (under 2.5 θ(x) some scaling or normalization necessarily to constitute a probability), an insight owed to De Finetti [11] who held that a 2.0 Mistracking "belief" and a "prediction" (when they are concerned with two 1.5 distinct outcomes) map into the equivalent of the expectation of a binary random variable and bets with a payoff in {0, 1}. 1.0 An "opinion" becomes a choice price for a gamble, and one 0.5 at which one is equally willing to buy or sell. Inconsistent x opinions therefore would lead to a violation of arbitrage rules, -1 1 2 3 4 such as the "Dutch book", where a combination of mispriced bets can guarantee a future loss. Fig. 4. Comparing the payoff of a binary bet (The Heaviside θ(.)) to a continuous open-ended exposure g(x). Visibly there is no way to match the (mathematical) derivatives for any form of hedging.

Most natural and socio-economic variables are continuous 4Retail binary options are typically used for gambling and have been banned in many jurisdictions, such as, for instance, by the European Securities and their statistical distribution does not have a compact and Markets Authority (ESMA), www.esma.europa.eu, as well as the United support in the sense that we do not have a handle of an exact States where it is considered another form of internet gambling, triggering upper bound. a complaint by a collection of decision scientists, see Arrow et al. [10]. We consider such banning as justiﬁed since bets have practically no economic Example 2. Predictive analytics in binary space {0, 1} can value, compared to ﬁnancial markets that are widely open to the public, where natural exposures can be properly offset. be successful in forecasting if, from his online activity, online 5Note the absence of spontaneously forming gambling markets with binary consumer Iannis Papadopoulos will purchase a certain item, payoffs for continuous variables. The exception might have been binary say a wedding ring, based solely on computation of the options but these did not remain in fashion for very long, from the experiences of the author, for a period between 1993 and 1998, largely motivated by tax probability. But the probability of the "success" for a potential gimmicks. new product might be –as with the trader’s story– misleading. 4

Given that company sales are typically fat tailed, a very low matching derivatives for the purposes of offsetting variations probability of success might still be satisfactory to make a is not a possible strategy.6 The point is illustrated in Fig 4. decision. Consider venture capital or option trading –an out of the money option can often be attractive yet may have less than 1 in 1000 probability of ever paying off. More significantly, the tracking error for probability guesses III.THERE IS NO DEFINED "COLLAPSE","DISASTER", OR will not map to that of the performance. "SUCCESS" UNDER FAT TAILS This difference is well known by option traders as there The fact that an "event" has some uncertainty around its are financial derivative contracts called "binaries" that pay magnitude carries some mathematical consequences. Some in the binary set {0, 1} (say if the underlying asset S, say, verbalistic papers still commit in 2019 the fallacy of binarizing exceeds a strike price K), while others called "vanilla" that an event in [0, ∞): A recent paper on calibration of beliefs, pay in [0, ∞), i.e. max(S − K, 0) (or, worse, in (−∞, 0) [14] says "...if a person claims that the United States is on for the seller can now be exposed to bankruptcy owing to the verge of an economic collapse or that a climate disaster is the unbounded exposure). The considerable mathematical and imminent..." An economic "collapse" or a climate "disaster" economic difference between the two has been discussed and must not be expressed as an event in {0, 1} when in the real is the subject of Dynamic Hedging: Managing Vanilla and world it can take many values. For that, a characteristic scale Exotic Options [13]. Given that the former are bets paying is required. In fact under fat tails, there is no "typical" collapse a fixed amount and the latter have full payoff, one cannot be or disaster, owing to the absence of characteristic scale, hence properly replicated (or hedged) using another, especially under verbal binary predictions or beliefs cannot be used as gauges. fat tails and parametric uncertainty –meaning performance in one does not translate to performance into the other. While this We present the difference between thin tailed and fat tailed knowledge is well known in mathematical finance it doesn’t domains (illustrated for the intuition of the differences in Fig seem to have been passed on to the decision-theory literature. 5) as follows. Comment 3 (Derivatives theory). Our approach here is in- spired from derivatives (or option) theory and practice where Definition 4: Characteristic scale there are different types of derivative contracts, 1) those with Let X be a random variable that lives in either (0, ∞) binary payoffs (that pay a fixed sum if an event happens) or (−∞, ∞) and E the expectation operator under "real and 2) "vanilla" ones (standard options with continuous pay- world" (physical) distribution. By classical results [15]: offs). It is practically impossible to hedge one with another 1 lim E(X|X>K ) = λ, (2) [13]. Furthermore a binary bet with a strike price K and K→∞ K a call option with same strike K, with K in the tails of • If λ = 1 , X is said to be in the thin tailed class the distribution, almost always have their valuations react D and has a characteristic scale in opposite ways when one increases the kurtosis of the 1 • If λ > 1 , X is said to be in the fat tailed regular distribution, (while preserving the first three moments) or, in variation class D and has no characteristic scale an example further down in the lognormal environment, when 2 • If one increases uncertainty via the scale of the distribution. lim E(X|X>K ) − K = µ Comment 4 (Term sheets). Note that, thanks to "term sheets" K→∞ that are necessary both legally and mathematically, financial where µ > 0, then X is in the borderline exponen- derivatives practice provides precise legalistic mapping of tial class payoffs in a way to make their mathematical, statistical, and economic differences salient. The point can be made clear as follows. One cannot have There has been a tension between prediction markets and a binary contract that adequately hedges someone against a real financial markets. As we can show here, prediction "collapse", given that one cannot know in advance the size of markets may be useful for gamblers, but they cannot hedge the collapse or how much the face value or such contract needs economic exposures. to be. On the other hand, an insurance contract or option with The mathematics of the difference and the impossibility of continuous payoff would provide a satisfactory hedge. Another hedging can be shown in the following. Let X be a random way to view it: reducing these events to verbalistic "collapse", variable in R, we have the payoff of the bet or the prediction "disaster" is equivalent to a health insurance payout of a lump θK : R → {0, 1}, sum if one is "very ill" –regardless of the nature and gravity 1, x ≥ K of the illness – and 0 otherwise. θ (x) = (1) K 0 otherwise, And it is highly flawed to separate payoff and probability R R ∂ and g : → that of the natural exposure. Since ∂x θK (x) is ∂) 6To replicate an open-ended continuous payoff with binaries, one needs an a Dirac delta function at K, δ(K) and gk(x) is at least once ∂x infinite series of bets, which cancels the entire idea of a prediction market differentiable for x ≥ K (or constant in case the exposure is by transforming it into a financial market. Distributions with compact support globally linear or, like an option, piecewise linear above K), always have finite moments, not the case of those on the real line. 5 in the integral of expected payoff.7 Some experiments of the indexed by (g)), where f (g)(.) is the PDF: type shown in Fig. I-5 ask agents what is their estimates R ∞ (g) of deaths from botulism or some such disease: agents are K xf (x) dx I1 lim ∞ = = 1. (5) K→∞ R (g) I blamed for misunderstanding the probability. This is rather K K f (x) dx 2 a problem with the experiment: people do not necessarily separate probabilities from payoffs. Special case of a Gaussian: Let g(.) be the PDF of predomi- nantly used Gaussian distribution (centered and normalized),

2 IV. SPURIOUS OVERESTIMATION OF TAIL PROBABILITY IN ∞ − K Z e 2 THEPSYCHOLOGYLITERATURE xg(x) dx = √ (6) K 2π Deﬁnition 5: Substitution of integral 1 K + and K = erfc √ , where erfc is the complementary Let K ∈ R be a threshold, f(.) a density function and p 2 2 K pK ∈ [0, 1] the probability of exceeding it, and g(x) an error function, and p is the threshold corresponding to the p impact function. Let I1 be the expected payoff above K: probability . I1 We note that Kp corresponds to the inverse Mills ratio Z ∞ I2 I1 = g(x)f(x) dx, used in insurance. K

and Let I2 be the impact at K multiplied by the probability of exceeding K: B. Fat tails Z ∞ For all distributions in the regular variation class, defined I2 = g(K) f(x) dx = g(K)pK . by their tail survival function: for K large, K The substitution comes from conflating I and I , which −α 1 2 P(X > K) ≈ LK , α > 1, becomes an identity if and only if g(.) is constant above (p) K (say g(x) = θK (x), the Heaviside theta function). For where L > 0 and f is the PDF of a member of that class: g(.) a variable function with positive first derivative, I1 R ∞ (p) can be close to I only under thin-tailed distributions, xf (x) dx α 2 lim K = > 1 (7) R ∞ (p) not under the fat tailed ones. Kp→∞ K f (x) dx α − 1 Kp

For the discussions and examples in this section assume g(x) = x as we will consider the more advanced nonlinear case in Section VII. C. Conﬂations

1) Conﬂation of I1 and I2: In numerous experiments, Theorem 1: Convergence of I1 I2 which include the prospect theory paper by Kahneman and If X is in the thin tailed class D1 as described in 2, Tversky (1978) [3], it has been repeatedly established that agents overestimate small probabilities in experiments where I lim 1 = 1 (3) the odds are shown to them, and when the outcome corre- K→∞ I 2 sponds to a single payoff. The well known Kahneman-Tversky If X is in the regular variation class D2, result proved robust, but interpretations make erroneous claims

I1 from it. Practically all the subsequent literature, relies on lim = λ > 1. (4) I and conflates it with I , what this author has called the K→∞ I2 2 1 ludic fallacy in The Black Swan [9], as games are necessarily truncating a dimension from reality. The psychological results Proof. From Eq. 2. Further comments: might be robust, in the sense that they replicate when repeated in the exact similar conditions, but all the claims outside these A. Thin tails conditions and extensions to real risks will be an exceedingly dubious generalization –given that our exposures in the real By our very definition of a thin tailed distribution (more world rarely map to I1. Furthermore, one can overestimate the generally any distribution outside the subexponential class, probability yet underestimate the expected payoff. 7 Practically all economic and informational variables have been shown 2) Stickiness of the conflation: The misinterpretation is since the 1960s to belong to the D2 class, or at least the intermediate still made four decades after Kahneman-Tversky (1979). In subexponential class (which includes the lognormal), [16], [17], [18], [19], a review of behavioral economics, with emphasis on miscac- [9], along with social variables such as size of cities, words in languages, connections in networks, size of firms, incomes for firms, macroeconomic ulation of probability, Barberis (2003) [5] treats I1 = I2. And data, monetary data, victims from interstate conflicts and civil wars[20], [12], Arrow et al. [10], a long list of decision scientists pleading operational risk, damage from earthquakes, tsunamis, hurricanes and other for deregulation of the betting markets also misrepresented the natural calamities, income inequality [21], etc. Which leaves us with the more rational question: where are Gaussian variables? These appear to be at best fitness of these binary forecasts to the real world (particularly one order of magnitude fewer in decisions entailing formal predictions. in the presence of real financial markets). 6

3) The Value at Risk Problem: Another stringent –and TABLE I dangerous –example is the "default VaR" (Value at risk, GAUSSIANPSEUDO-OVERESTIMATION ∗ i.e. the minimumm one is expected to lose within, say, 1% R ∞ R ∞ ∗ p p Kp xf(x)dx Kp f(x)dx p Kp Kp p probability, over a certain period, that is the quantile for 1 −1 −1 −1 10 1.28 1.75 × 10 1.28 × 10 1.36 × 10 1.36 1%, see [22]) which is explicitly given as I2, i.e. default 1 −2 −2 −2 100 2.32 2.66 × 10 2.32 × 10 1.14 × 10 1.14 probability x(1 − expected recovery rate). This quantity can 1 −3 −3 −3 1000 3.09 3.36 × 10 3.09 × 10 1.08 × 10 1.08 1 −4 −4 −4 be quite different from the actual loss expectation in case of 10000 3.71 3.95 × 10 3.71 × 10 1.06 × 10 1.06 default. The Conditional Value at Risk, CVar, measures the expected total losses conditional on default, which is altogether TABLE II another entity, particularly under fat tails (and, of course, PARETAN PSEUDO-OVERESTIMATION ∗ considerably higher). Further, financial regulators presents R ∞ R ∞ ∗ p p Kp xf(x)dx Kp f(x)dx p Kp Kp p erroneous approximations of CVaR, and the approximation 1 10 8.1 8.92 0.811 1.1 (sic) 11. is the risk-management flaw that may have caused the crisis 1 100 65.7 7.23 0.65 0.11 11. of 2008 [23].8 The fallacious argument is that they compute 1 1000 533 5.87 0.53 0.011 11. 1 the recovery rate as the expected value of collateral, without 10000 4328 4.76 0.43 0.0011 11. conditioning by the default event. The expected value of the collateral conditionally to a default is often far less then its unconditional expectation. In 2007, after a massive series of probability to the binary one p∗ α foreclosures, the value of most collaterals dropped to about = 1/3 of the expected value! p α − 1 Furthermore, VaR is not a coherent risk measure, owing which can allow in absurd cases p∗ to exceed 1 when the to its violating the subadditivity property [24]: for random distribution is grossly misspecified. variables (or assets) X and Y and a threshold α, a measure Tables I and II show, for a probability level p, the corre- must meet the inequality µα(X + Y ) ≤ µα(X) + µα(Y ). On sponding tail level Kp, such as the other hand, CVaR satisfies the coherence bound; likewise Kp = {inf K : P(X > K) > p} , I2, because its construction is similar to VaR, violates the inequality. and the corresponding adjusted probability p∗ that de-binarize 4) Misunderstanding of Hayek’s knowledge arguments: the event 910– probabilities here need to be in the bottom "Hayekian" arguments for the consolidation of beliefs via half, i.e., p < .5. Note that we are operating under the mild prices does not lead to prediction markets as discussed in such case of known probability distributions, as it gets worse under pieces as [25], or Sunstein’s [26]: prices exist in financial and parametric uncertainty.11 commercial markets; prices are not binary bets. For Hayek [27] The most commonly known distribution among the public, consolidation of knowledge is done via prices and arbitrageurs the "Pareto 80/20" (based on Pareto discovering that 20 percent (his words)–and arbitrageurs trade products, services, and of the people in Italy owned 80 percent of the land), maps to financial securities, not binary bets. a tail index α = 1.16, so the adjusted probability is > 7 times the naive one. 5) Example of probability and expected payoff reacting in Definition 6: Corrected probability in binarized ex- opposite direction under increase in uncertainty: An example periments ∗ showing how, under a skewed distribution, the binary and Let p be the equivalent probability to make I1 = I2 and the expectation reacting in opposite directions is as follows. eliminate the effect of the error, so 1 Consider the risk-neutral lognormal distribution L(X0 − σ2 , σ) ∗ σ2 2 p = {p : I1 = I2} with pdf fL(.), mean X0 and variance e − 1 X0 . We can increase its uncertainty with the parameter σ. We have the expectation of a contract above X0, E>X : K 0 Now let’ s solve for p "in the tails",√ working with a ∞ −1 Z 1 σ probability p. For the Gaussian, Kp = 2erfc (2p); for the E = xf (x) dx = X 1 + erf √ −1/α >X0 L 0 Paretan tailed distribution, Kp = p . X0 2 2 2

Hence, for a Paretan distribution, the ratio of real continuous and the probability of exceeding X0, 1 σ 8The mathematical expression of the Value at Risk, VaR, for a random P(X > X0) = 1 − erf √ , variable X with distribution function F and threshold α ∈ [0, 1] 2 2 2

9The analysis is invariant to whether we use the right or left tail .By VaR (X) = − inf {x ∈ : F (x) > α}, α R X convention, finance uses negative value for losses, whereas other areas of risk management express the negative of the random variance, hence focus and the corresponding CVar on the right tail. 10K is equivalent to the Value at Risk V aR in finance, where p is the p p ESα(X) = E −X |X≤− VaRα(X) probability of loss. 11Note the van der Wijk’s law, see Cirillo [28]: I1 is related to what is I2 called in finance the expected shortfall for Kp. 7

Binary goes to 0.

10 D. Distributional Uncertainty We can gauge the effect of distributional uncertainty by 5 examining the second order effect with respect to a parameter, which shows if errors induce either biases or an acceleration 0 of divergence. A stable model needs to be necessarily linear to errors, see [29]. 12

-5 Remark 1: Distributional uncertainty

Owing to Jensen’s inequality, the discrepancy (I1 − I2) -10 increases under parameter uncertainty, expressed in higher kurtosis, via stochasticity of σ the scale of the Thin Tails thin-tailed distribution, or that of α the tail index of the Paretan one. 10

Proof. First, the Gaussian world. We consider the effect of 5 R ∞ (g) R ∞ (g) I1 − I2 = K xf (x) − K f (x) under stochastic volatility, i.e. the parameter from increase of volatility. Let σ be the 0 scale of the Gaussian, with K constant:

2 R ∞ (g) 2 R ∞ (g) ∂ ( K xf (x)dx) ∂ ( K f (x)dx) -5 − ∂σ2 ∂σ2 2 (8) − K 3 2 e 2σ2 (K − 1)K − (K − 2)Kσ -10 = √ , 2πσ5 4 Pareto 80/20 which is positive for all values of K > 0 (given that K − K3 − K2 + 2K > 0 for K positive). Second, consider the sensitivity of the ratio I1 to parameter 1000 I2 uncertainty for α in the Paretan case (for which we can get a streamlined expression compared to the difference). For α > 1 500 (the condition for a ﬁnite mean): ∞ ∞ ∂2 R xf (p)(x)dx/R f (p)(x)dx 2K 0 K K = (9) ∂α2 (α − 1)3 which is positive and increases markedly at lower values of -500 α, meaning the fatter the tails, the worse the uncertainty about the expected payoff and the larger the difference between I1 -1000 and I2.

Fig. 5. Comparing the three payoff time series under two distributions –the binary has the same profile regardless of whether the distribution is thin or fat tailed. The first two subfigures are to scale, the third (representing the V. CALIBRATION AND MISCALIBRATION Pareto 80/20 with α = 1.16 requires multiplying the scale by two orders of The psychology literature also examines the "calibration" of magnitude. probabilistic assessment –an evaluation of how close someone providing odds of events turns out to be on average (under some operation of the law of large number deemed satisfac- where erf is the error function. As σ rises erf √σ → 1, tory) [1], [32], see Figures 1 , 2, and I-5. The methods, for 2 2 E P the reasons we have shown here, are highly flawed except in with >X0 → X0 and (X > X0) → 0. This example is well known by option traders (see Dynamic Hedging [13]) as narrow circumstances of purely binary payoffs (such as those entailing a "win/lose" outcome) –and generalizing from these the binary option struck at X0 goes to 0 while the standard call of the same strike rises considerably to reach the level payoffs is either not possible or produces misleading results. of the asset –regardless of strike. This is typically the case Accordingly, Fig. I-5 makes little sense empirically. with venture capital: the riskier the project, the less likely it 12Furthermore, distributional uncertainty is in by itself a generator of fat is to succeed but the more rewarding in case of success. So, tails: heteroskedasticity or variability of the variance produces higher kurtosis the expectation can go to +∞ while to probability of success in the resulting distribution [13],[30], [31]. 8

M4 TABLE III 4) The M4 point estimate score λn for n observations SCORING METRICSFOR PERFORMANCE EVALUATION used in the M4 competition, and its sequel M5.13 14 1) P/L in Payoff Space (under survival condition): The Metric Name Fitness to reality "P/L" is short for the natural profit and loss index, that is, a cumulative account of performance. Let X be realizations of P (r)(T ) Cumulative P/L Adapted to real world distributions, par- i ticularly under a survival filter an unidimensional generic random variable X with support in (p) P (n) Tally of bets Misrepresents the performance under R and t = 1, 2, . . . n. Real world payoffs Pr(.) are expressed fat tails, works only for binary bets in a simplified way as and/or thin tailed domains. λ(n) Brier score Misrepresents performance precision X under fat tails, ignores higher moments. Pr(n) = P (0) + g(xt), (10) (M4)1,2 λn M4 first moment Represents precision, not exactly real k≤N score (point esti- world performance, but maps to real mate) distribution of underlying variables. where gt : R → R is a measurable function representing the (M4)3 λn M4 dispersion Represents precision in the assessment payoff; g may be path dependent (to accommodate a survival Score of confidence intervals. condition), that is, it is a function of the preceding period (M5) P λn Proposed M5 Represents both precision and survival τ < t or on the cumulative sum τ≤t g(xτ ) to introduce an score conditions by predicting extrema of absorbing barrier, say, bankruptcy avoidance, in which case time series. g(.) Machine learning Expresses exposures without verbalism we write: nonlinear payoff and reflects true economic or other P/L. function (not a Resembles financial derivatives term metric) sheets. (r) (r) X P (T ) = P (0) + 1 P g(xt), (11) ( τb) t≤n

At the core, calibration metrics such as the Brier score are always thin-tailed, when the variable under measurement is where b is any arbitrary number in R that we call the survival fat-tailed, which worsens the tractability. mark and 1(.) an indicator function ∈ {0, 1}. To use again the saying "You do not eat forecasts", most The last condition from the indicator function in Eq. 11 is businesses have severely skewed payoffs, so being calibrated meant to handle ergodicity or lack of it [9]. in probability is meaningless. Comment 5. P/L tautologically corresponds to the real world distribution, with an absorbing barrier at the survival condi- Remark 2: Distributional differences tion. Binary forecasts and calibration metrics via the Brier 2) Frequency Space, : The standard psychology literature score belong to the thin-tailed class. has two approaches. A–When tallying forecasts as a counter:

1 X P (p)(n) = 1 , (12) VI.SCORING METRICS n Xt∈χ i≤n This section, summarized in Table III, compares the probability distributions of the various metrics used to measure 1 where Xt∈χ ∈ {0, 1} is an indicator that the random variable performance, either by explicit formulation or by linking it to x ∈ χt in in the "forecast range", and T the total number of a certain probability class. Clearly one may be mismeasuring such forecasting events. performance if the random variable is in the wrong probability B–When dealing with a score (calibration method): in the class. Different underlying distributions will require a different absence of a visible net performance, researchers produce number of sample sizes owing to the differences in the way some more advanced metric or score to measure calibra- the law of numbers operates across distributions. A series of tion. We select below the "gold standard", De Finetti’s Brier binary forecasts will converge very rapidly to a thin-tailed score(DeFinetti, [34]). It is favored since it doesn’t allow ar- Gaussian even if the underlying distribution is fat-tailed, but bitrage and requires perfect probabilistic calibration: someone an economic P/L tracking performance for someone with a betting than an event has a probability 1 of occurring will real exposure will require a considerably larger sample size get a perfect score only if the event occurs all the time. Let if, say, the underlying is Pareto distributed [33]. ft ∈ [0, 1] be the probability announced by the forecaster for We start by precise expressions for the four possible ones: event t,

1) Real world performance under conditions of survival, or, 13https://www.m4.unic.ac.cy/wp-content/uploads/2018/03/M4- in other words, P/L or a quantitative cumulative score. Competitors-Guide.pdf 2) A tally of bets, the naive sum of how often a person’s 14The M4 competition also entails a range estimate, evaluated according to techniques for the Mean Scaled Interval Score (MSIS), which reﬂects binary prediction is correct uncertainty about the variable and maps to the distribution of the mean 3) De Finetti’s Brier score λ(B)n absolute deviation. 9

1 X λ(B) = (f − 1 )2. (13) Theorem 2 n n t Xt∈χ t≤n Regardless of the distribution of the random variable X, without even assuming independence of (f1 − 1A1),..., (fn − 1An), for n < +∞, the score λn has E q which needs to be minimized for a perfect probability assessor. all moments of order q, (λn) < +∞. 3) Applications: M4 and M5 Competitions: The M series (Makridakis [35]) evaluate forecasters using various methods 2 Proof. For all i, (fi − 1Ai) ≤ 1 . to predict a point estimate (along with a range of possible values). The last competition in 2018, M4, largely relied on a We can get actually closer to a full distribution of the score series of scores, λM4j , which works well in situations where across independent betting policies. Assume binary predictions one has to forecast the first moment of the distribution and the fi are independent and follow a beta distribution B(a, b) dispersion around it. (which approximates or includes all unimodal distributions in [0, 1] (plus a Bernoulli via two Dirac functions), and let p be E 1 Definition 7: The M4 first moment forecasting scores the rate of success p = ( Ai), the characteristic function of λn for n evaluations of the Brier score is The M4 competition precision score (Makridakis et al. [35]) judges competitors on the following metrics in- ϕ (t) = πn/2 2−a−b+1Γ(a + b) dexed by j = 1, 2 n n 1 |X − X | ˜ b + 1 b a + b 1 it (M4)j X fi ri p 2F2 , ; , (a + b + 1); λn = (14) 2 2 2 2 n n sj i a + 1 a a + b 1 it 1 − (p − 1) 2F˜2 , ; , (a + b + 1); where s1 = 2 (|Xfi |+|Xri |) and s2 is (usually) the raw 2 2 2 2 n mean absolute deviation for the observations available (15) up to period i (i.e., the mean absolute error from either ˜ "naive" forecasting or that from in sample tests), Xfi is Here 2F2 is the generalized hypergeometric function reg- ˜ 2F2(a;b;z) the forecast for variable i as a point estimate, Xri is ularized 2F2(., .; ., .; .) = and pFq(a; b; z) has (Γ(b1)...Γ(bq )) the realized variable and n the number of experiments P∞ (a1)k...(ap)k k series expansion z /k!, were (a)(.) is the under scrutiny. k=0 (b1)k...(bp)k Pockhammer symbol. Hence we can prove the following: under the conditions of independence of the summands stated above, In other word, it is an application of the Mean Absolute Scaled

Error (MASE) and the symmetric Mean Absolute Percentage D Error (sMAPE) [36]. λn −→N (µ, σn) (16) The suggest M5 score (expected for 2020) adds the forecasts of extrema of the variables under considerations and repeats where N denotes the Gaussian distribution with for ﬁrst the same tests as the one for raw variables in Deﬁnition 7. argument the mean and for second argument the standard deviation. The proof and parametrization of µ and σn are in the appendix. 3) Distribution of the economic P/L or quantitative measure A. Deriving Distributions Pr:

(p) 1) Distribution of P (n): Remark 4 Conditional on survival to time T , the distribution of the Remark 3 quantitative measure P (r)(T ) will follow the distribution The average tally of the binary forecast (expressed as of the underlying variable g(x). "higher" or "lower" than a threshold), P (p)(n) is, (using p as a shortcut), asymptotically normal with mean p q The discussion is straightforward if there is no absorbing and standard deviation 1 (p − p2) regardless of the n barrier (i.e., no survival condition). distribution class of the random variable X. 4) Distribution of the M4 score: The distribution of an absolute deviation is in the same probability class as the variable itself. Thee Brier score is in the norm L2 and is based The results are quite standard, but see appendix for the re- on the second moment (which always exists) as De Finetti has derivations. shown that it is more efﬁcient to just a probability in square 2) Distribution of the Brier Score λn: deviations. However for nonbinaries, it is vastly more efﬁcient 10

under fat tails to rely on absolute deviations, even when the 20 second moment exists [37].

VII.NON-VERBALISTIC PAYOFF FUNCTIONSAND THE 20 40 60 80 100 120 GOOD NEWSFROM MACHINE LEARNING -20 Earlier examples focused on simple payoff functions, with some cases where the conflation I1 and I2 can be benign (under the condition of being in a thin tailed environment). -40 Serious Starting Inseparability of probability under nonlinear payoff func- Benefits from Harm from Point tion: Now when we introduce a payoff function g(.) that -60 Decline is nonlinear, that is that the economic or other quantifiable Decline response to the random variable X varies with the levels of X, the discrepancy becomes greater and the conflation worse. Fig. 6. The Morgan Stanley Story: an example of an elementary nonlinear Comment 6 (Probability as an integration kernel). Probability payoff that cannot be described verbalistically. The x axis represents the variable, the vertical one the payoff. This exposure is called in derivatives is just a kernel inside an integral or a summation, not a real traders jargon a "Christmas Tree", achieved by purchasing a put with strike thing on its own. The economic world is about quantitative K and selling a put with lower strike K − ∆1 and another with even lower payoffs. strike K − ∆2, with ∆2 ≥ ∆1 ≥ 0.

Figure 7 shows a more complicated payoff, dubbed a Remark 5: Inseparability of probability "butterfly". Let F : A → [0, 1] be a probability distribution (with 2) g and machine learning: We note that g maps to various R R derivative f) and g : → a measurable function, machine learning functions that produce exhaustive nonlinear- 0 the "payoff"". Clearly, for A a subset of A: ities via the universal universal approximation theorem (Cy- Z Z benko [38]), or the generalized option payoff decompositions g(x)dF (x) = f(x)g(x)dx (see Dynamic Hedging [13]).15 A0 A0 Z Z Consider the function ρ :(−∞, ∞) → [K, ∞), with K, the 6= f(x)dx g dx r.v. X ∈ R: A0 A0 p(x−K) In discrete terms, with π(.) a probability mass func- log e + 1 ρK,p(x) = k + (18) tion: p ! We can express all nonlinear payoff functions g as, with the X X 1 X R π(x)g(x) 6= π(x) g x weighting ωi ∈ : n x ∈A0 x∈A0 x∈A0 (17) X g(x) = ω ρ (x) (19) = probability of event i Ki,p i × payoff of average event by some similarity, ρK,p(x) maps to the value a call price with strike K and time t to expiration normalized to 1, all rates set Proof. Immediate by Jensen’s inequality. at 0, with sole other parameter σ the standard deviation of the underlying. In other words, the probability of an event is an expected We note that the expectation of g(.) is the sum of expecta- payoff only when, as we saw earlier, g(x) is a Heaviside theta tion of the ReLu functions: function. E X E Next we focus on functions tractable mathematically or (g(x)) = ωi (ρKi,p(x)) (20) legally but not reliable verbalistically via "beliefs" or "pre- i dictions". The variance and other higher order statistical measurements 1) Misunderstanding g: Figure 6 showing the mishedging are harder to obtain in closed or simple form. story of Morgan Stanley is illustrative of verbalistic notions such as "collapse" mis-expressed in nonlinear exposures. In Comment 7. Risk management is about changing the payoff 2007 the Wall Street firm Morgan Stanley decided to "hedge" function g(.) rather than making "good forecasts". against a real estate "collapse", before the market in real estate We note that g(.) is not a metric but a target to which one started declining. The problem is that they didn’t realize that can apply various metrics. "collapse" could take many values, some worse than they expected, and set themselves up to benefit if there were a mild 15We note the machine learning use of cross-entropy to gauge the difference decline, but lose much if there is a larger one. They ended up between the distribution of a variable and the forecast, a technique that in effect captures the nonlinearities of g(.). In effect on Principal Component right in predicting the crisis, but lose $10 billion from the Maps, distances using cross entropy are more effectively shown as the L^2 "hedge". norm boosts extremes. For a longer discussion, [37]. 11

f(x) different from that of a single gambler over n days, owing to 40 the conditioning. In that sense, measuring the static performance of an 30 agent who will eventually go bust (with probability one) is meaningless.17 20

10 VIII.CONCLUSION:

x Finally, that in the real world, it is the net performance 50 100 150 200 (economic or other) that counts, and making "calibration" -10 mistakes where it doesn’t matter or can be helpful should be encouraged, not penalized. The bias variance argument is -20 well known in machine learning [40] as means to increase performance, in discussions of rationality (see Skin in the Fig. 7. A butterfly (built via a sum of options or ReLu ρ ), with open Ki Game [9]) as a necessary mechanism for survival, and a very tails on both sides and flipping first and second derivatives. Again, the x axis represents the variable, the vertical one the payoff. This example is particularly helpful psychological adaptation (Brighton and Gigerenzer potent as it has no verbalistic correspondence but can be understood by option [41] show a potent argument that if it is a bias, it is a pretty traders and machine learning. useful one.) If a mistake doesn’t cost you anything –or helps you survive or improve your outcomes– it is clearly not a Example 3. XIV was a volatility-linked exchange traded note mistake. And if it costs you something, and has been present that went bust, in February 2018, in the wake of a sudden in society for a long time, consider that there may be hidden jump in volatility, while having been able to "predict" such evolutionary advantages to these types of mistakes –of the variable correctly. Simply, their g(.) function was nonlinear following sort: mistaking a bear for a stone is worse than not properly adapted to naive forecasting of the first moment mistaking a stone for a bear. of the random variable. 16 The following simplified example We have shown that, in risk management (and while fore- can show us their error and the mismatch between correct casting socio-economic and financial variables), one should forecasting and performance as well as what happens under never operate in probability space. fat tails. Consider the following "short volatility" payoff function, APPENDIX g(x) = 1 − x2 evaluated daily, where x is a daily variation of an asset price, meaning if x moves by up to 1 unit (say, A. Distribution of Binary Tally P (p)(n) standard deviation), there is a profit, and losses beyond that. We are dealing with an average of Bernoulli random vari- Such terms are found in a typical contract called "variance ables, with well known results but worth redoing. The charac- swap". Assume the note (or fund) are predicting that average teristic function of a Bernoulli distribution with parameter p is daily (normalized) volatility will be "lower than 1". Now ψ(t) = 1−p+E(It)p. We are concerned with the N-summed consider the following two types successions of deviations of 0 ω N cumulant generating function ψ (ω) = log ψ( N ) . We have x for 7 days (expressed in normalized standard deviations). κ(p) the cumulant of order p: Succession 1 (thin tails): {1, 1, 1, 1, 1, 0, 0}. Mean variation= 0.71. P/L: g(x) = 2. Prediction is correct. ∂pψ0 κ(p) = −ip Succession 2 (fat tails): {0, 0, 0, 0, 0, 0, 5}. Mean variation= p ∂t t→0 0.71 (same). P/L: g(x) = −18 (bust, really bust, but prediction is correct). (1−p)p (p−1)p(2p−1) So: κ(1) = p, κ(2) = N , κ(3) = N 2 , κ(4) = So in both cases a forecast of < 1 is correct, but the lumping (1−p)p(6(p−1)p+1) P (p)(N) N 3 , which proves that√ converges by of the volatility –the fatness of tails– made a huge difference. the law of large numbers at speed N, and by the central This in a nutshell shows how, in the world of finance, "bad" 1 limit theorem arrives to the Gaussian at a rate of N , (since forecasters can make great traders and decision makers, and 6(p−1)p+1 from the cumulants above, its kurtosis = 3 − ). vice versa. n(p−1)p

Survival B. Distribution of the Brier Score Decision making is sequential. Accordingly, miscalibration Base probability f: First, we consider the distribution of may be a good idea if it reduces the odds of being absorbed. f the base probability. We use a beta distribution that covers See [39] and the appendix of [9], which shows the differ- both the conditional and unconditional case (it is a matter of ence between ensemble probability and time probability. The parametrization of a and b in Eq. 15). expectation of the sum of n gamblers over a given day is 17One suggestion by the author is for the M5 competition to correct for 16see Born to Die: Inside XIV, the Busted Volatility ETN, Wall Street that by making "predictors" predict the minimum (or maximum) in a time Journal, Feb. 6, 2018). series. This insures the predictor meets the constraint of staying above water. 12

2 Distribution of the probability: Let us refresh a standard From here we get the characteristic function for yj = (fj − 2 result behind nonparametric discussions and tests, dating from 1Aj) :

Kolmogorov [42] to show the rationale behind the claim that (y2) √ −a−b+1 the probability distribution of probability (sic) is robust –in Ψ (t) = π2 Γ(a other words the distribution of the probability of X doesn’t b + 1 b a + b 1 + b) p 2F˜2 , ; , (a + b + 1); it depend on the distribution of X, ([43] [32]). 2 2 2 2 The probability integral transform is as follows. Let X have − (p a continuous distribution for which the cumulative distribution a + 1 a a + b 1 − 1) 2F˜2 , ; , (a + b + 1); it , function (CDF) is FX . Then –in the absence of additional 2 2 2 2 information –the random variable U deﬁned as U = FX (X) (22) is uniform between 0 and 1.The proof is as follows: For t ∈ ˜ [0, 1], where 2F2 is the generalized hypergeometric function reg- 2F2(a;b;z) ularized 2F˜2(., .; ., .; .) = and pFq(a; b; z) has (Γ(b1)...Γ(bq )) P∞ (a1)k...(ap)k k series expansion z /k!, were (a)(.) is the P(Y ≤ u) = k=0 (b1)k...(bp)k −1 −1 Pockhammer symbol. P (FX (X) ≤ u) = P (X ≤ FX (u)) = FX (FX (u)) = u We can proceed to prove directly from there the convergence (21) 1 Pn 2 in distribution for the average n i yi : n which is the cumulative distribution function of the uniform. lim Ψy2 (t/n) = This is the case regardless of the probability distribution of n →∞ it(p(a − b)(a + b + 1) − a(a + 1)) (23) X. exp − , (a + b)(a + b + 1) Clearly we are dealing with 1) f beta distributed (either as a special case the uniform distribution when purely random, which is that of a degenerate Gaussian (Dirac) with location p(b−a)+ a(a+1) as derived above, or a beta distribution when one has some parameter a+b+1 . 1 a+b accuracy, for which the uniform is a special case), and 2) At We can ﬁnally assess the speed of convergence, the rate a Bernoulli variable with probability p. at which higher moments map to those of a Gaussian dis- th Let us consider the general case. Let ga,b be the PDF of the tribution: consider the behavior of the 4 cumulant κ4 = 4 Beta distribution: ∂ log Ψ.(.) −i ∂t4 |t→0: 1) in the maximum entropy case of a = b = 1: xa−1(1 − x)b−1 6 g (x) = , 0 < x < 1. κ4|a=1,b=1= − , a,b B(a, b) 7n regardless of p. The results, a bit unwieldy but controllable: 2) In the maximum variance case, using l’Hôpital: 6(p − 1)p + 1 a2(−(p − 1)) − ap + a + b(b + 1)p Γ(a + b) lim κ4 = − . µ = , a→0 n(p − 1)p Γ(a + b + 2) b→0 κ4 −1 Se we have 2 →0 at rate n . κ2 n→∞ Further, we can extract its probability density function of 1 σ2 = − a2(p − 1) + a(p − 1) the Brier score for N = 1: for 0 < z < 1, n n(a + b)2(a + b + 1)2 p(z) 2 1 √ √ − b(b + 1)p + (a + b)(a + b Γ(a + b) (p − 1)za/2 (1 − z)b − p (1 − z)a zb/2 (a + b + 2)(a + b + 3) = √ . 2 ( z − 1) zΓ(a)Γ(b) + 1)(p(a − b)(a + b + 3)(a(a + 3) + (b + 1)(b + 2)) (24) − a(a + 1)(a + 2)(a + 3)). REFERENCES We can further verify that the Brier score has thinner tails [1] S. Lichtenstein, B. Fischhoff, and L. D. Phillips, “Calibration of proba- than the Gaussian as its kurtosis is lower than 3. bilities: The state of the art,” in Decision making and change in human affairs. Springer, 1977, pp. 275–324. [2] S. Lichtenstein, P. Slovic, B. Fischhoff, M. Layman, and B. Combs, “Judged frequency of lethal events.” Journal of experimental psychology: Proof. We start with yj = (f − 1Aj), the difference between Human learning and memory, vol. 4, no. 6, p. 551, 1978. a continuous Beta distributed random variable and a discrete [3] D. Kahneman and A. Tversky, “Prospect theory: An analysis of decision Bernoulli one, both indexed by j. The characteristic function under risk,” Econometrica, vol. 47, no. 2, pp. 263–291, 1979. [4] E. J. Johnson and A. Tversky, “Affect, generalization, and the perception (y) −it of yj, Ψf = 1 + p −1 + e 1F1(a; a + b; it) where of risk.” Journal of personality and social psychology, vol. 45, no. 1, 1F1(.; .; .) is the hypergeometric distribution 1F1(a; b; z) = p. 20, 1983. k [5] N. Barberis, “The psychology of tail events: Progress and challenges,” ∞ a z ! P k k . American Economic Review, vol. 103, no. 3, pp. 611–16, 2013. k=0 bk 13

[6] R. Hertwig, T. Pachur, and S. Kurzenhäuser, “Judgments of risk frequen- [36] R. J. Hyndman and A. B. Koehler, “Another look at measures of forecast cies: tests of possible cognitive mechanisms.” Journal of Experimental accuracy,” International journal of forecasting, vol. 22, no. 4, pp. 679– Psychology: Learning, Memory, and Cognition, vol. 31, no. 4, p. 621, 688, 2006. 2005. [37] N. N. Taleb, The Statistical Consequences of Fat Tails. STEM [7] G. Gigerenzer, “Fast and frugal heuristics: The tools of bounded ratio- Publishing, 2020. nality,” Blackwell handbook of judgment and decision making, vol. 62, [38] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” p. 88, 2004. Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, [8] J. Baron, Thinking and deciding, 4th Ed. Cambridge University Press, 1989. 2008. [39] O. Peters and M. Gell-Mann, “Evaluating gambles using dynamics,” [9] N. N. Taleb, Incerto: Antifragile, The Black Swan , Fooled by Random- arXiv preprint arXiv:1405.0585, 2014. ness, the Bed of Procrustes, Skin in the Game. Random House and [40] T. Hastie, R. Tibshirani, and J. Friedman, “The elements of statistical Penguin, 2001-2018. learning: data mining, inference, and prediction, springer series in [10] K. J. Arrow, R. Forsythe, M. Gorham, R. Hahn, R. Hanson, J. O. statistics,” 2009. Ledyard, S. Levmore, R. Litan, P. Milgrom, F. D. Nelson et al., “The [41] H. Brighton and G. Gigerenzer, “Homo heuristicus and the bias–variance promise of prediction markets,” Science, vol. 320, no. 5878, p. 877, dilemma,” in Action, Perception and the Brain. Springer, 2012, pp. 68– 2008. 91. [11] B. De Finetti, “Probability, induction, and statistics,” 1972. [42] A. Kolmogorov, “Sulla determinazione empirica di una lgge di dis- [12] P. Cirillo and N. N. Taleb, “On the statistical properties and tail risk of tribuzione,” Inst. Ital. Attuari, Giorn., vol. 4, pp. 83–91, 1933. violent conflicts,” Physica A: Statistical Mechanics and its Applications, [43] C. F. Dietrich, Uncertainty, calibration and probability: the statistics of vol. 452, pp. 29–45, 2016. scientific and industrial measurement. Routledge, 2017. [13] N. N. Taleb, Dynamic Hedging: Managing Vanilla and Exotic Options. John Wiley & Sons (Wiley Series in Financial Engineering), 1997. [14] J. Dana, P. Atanasov, P. Tetlock, and B. Mellers, “Are markets more accurate than polls? the surprising informational value of “just asking”,” Judgment and Decision Making, vol. 14, no. 2, pp. 135–147, 2019. [15] P. Embrechts, Modelling extremal events: for insurance and finance. Springer, 1997, vol. 33. [16] B. Mandelbrot, “The pareto-levy law and the distribution of income,” International Economic Review, vol. 1, no. 2, pp. 79–106, 1960. [17] ——, “The stable paretian income distribution when the apparent exponent is near two,” International Economic Review, vol. 4, no. 1, pp. 111–115, 1963. [18] B. B. Mandelbrot, “New methods in statistical economics,” in Fractals and Scaling in Finance. Springer, 1997, pp. 79–104. [19] X. Gabaix, “Power laws in economics and finance,” National Bureau of Economic Research, Tech. Rep., 2008. [20] L. F. Richardson, “Frequency of occurrence of wars and other fatal quarrels,” Nature, vol. 148, no. 3759, p. 598, 1941. [21] D. G. Champernowne, “A model of income distribution,” The Economic Journal, vol. 63, no. 250, pp. 318–351, 1953. [22] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath, “Coherent measures of risk,” Mathematical finance, vol. 9, no. 3, pp. 203–228, 1999. [23] N. N. Taleb and G. A. Martin, “How to prevent other financial crises,” SAIS Review of International Affairs, vol. 32, no. 1, pp. 49–60, 2012. [24] F. Delbaen and W. Schachermayer, “A general version of the fundamen- tal theorem of asset pricing,” Mathematische annalen, vol. 300, no. 1, pp. 463–520, 1994. [25] G. Bragues, “Prediction markets: The practical and normative possibil- ities for the social production of knowledge,” Episteme, vol. 6, no. 1, pp. 91–106, 2009. [26] C. R. Sunstein, “Deliberating groups versus prediction markets (or hayek’s challenge to habermas),” Episteme, vol. 3, no. 3, pp. 192–213, 2006. [27] F. A. Hayek, “The use of knowledge in society,” The American economic review, vol. 35, no. 4, pp. 519–530, 1945. [28] P. Cirillo, “Are your data really pareto distributed?” Physica A: Statis- tical Mechanics and its Applications, vol. 392, no. 23, pp. 5947–5962, 2013. [29] N. N. Taleb and R. Douady, “Mathematical definition, mapping, and detection of (anti) fragility,” Quantitative Finance, 2013. [30] J. Gatheral, The Volatility Surface: a Practitioner’s Guide. John Wiley & Sons, 2006. [31] K. Demeterifi, E. Derman, M. Kamal, and J. Zou, “More than you ever wanted to know about volatility swaps,” Working paper, Goldman Sachs, 1999. [32] G. Keren, “Calibration and probability judgements: Conceptual and methodological issues,” Acta Psychologica, vol. 77, no. 3, pp. 217–273, 1991. [33] N. N. Taleb, “How much data do you need? an operational, pre- asymptotic metric for fat-tailedness,” International Journal of Forecast- ing, 2018. [34] B. De Finetti, Philosophical Lectures on Probability: collected, edited, and annotated by Alberto Mura. Springer Science & Business Media, 2008, vol. 340. [35] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The m4 competition: Results, findings, conclusion and way forward,” International Journal of Forecasting, vol. 34, no. 4, pp. 802–808, 2018.