Bayes Theorem/Rule, A First Intro Until the mid-1700’s, the theory of (as distinct from theories of valuation like expected utility theory) was focussed almost entirely on estimating the likelihood of uncertain future events; lotteries, coin flips, or life expectancies. This class of estimate is often called aleatory probability from the latin aleator, meaning gambler. In law, aleatory contracts are those in which the signatories to the contract both risk loss or gain in the face of future uncer- tainty. A life insurance policy is an example of an aleatory contract.

Aleatory uncertainties are exactly the kind of probabilistic events that Pascal had envisioned as the subject of a of probabilities. Regardless of whether or not the world is as truly deter- ministic as Descartes and Galileo hoped, we often do not know what will happen the future. We do not know when a particular individual will die or whether a particular coin will land heads or tails up if it is flipped. Pascal’s probability theory was designed to model events of this type. In the second half of the eighteenth century two men revolutionized the calculus of probability when they realized that one could apply this probability theory not just to assess the likelihood of future events, but also to assess the likelihood of past events. While this may seem a small thing, it changed the way europeans thought about the mathematics of probability and opened the way to a more formal theory of decision making.

Consider an uncertain situation which was of tremendous interest to both The English reverend Thomas Bayes and to the French mathematician Pierre-Simon LaPlace. An astronomer measures the angular altitude of Jupiter six times in rapid succession and gets six slightly different numbers each time. Jupiter has a single altitude, but we have six imperfect observations of that altitude, all of which differ. What, we might ask, was the most likely actual altitude of Jupiter at the time that we made our observations? It was Thomas Bayes’ insight, published posthumously in 1763, that probability theory could be extended to answer questions of this type as well. Bayes reasoned that if one knew the distribution of errors induced by the astronomer’s instruments, then one could mathematically infer the most likely true altitude of Jupiter when the observations were made. It is important to note that there is nothing aleatory about this kind of probability. At the time the measurement was made Jupiter certainly had an altitude. The only uncertainty derives from our own lack of knowledge. The limitation that we face in this example is entirely epistemological. Bayes was suggesting that probability theory could be used to describe epistemological uncer- tainty as well as to aleatory uncertainty.

Unfortunately little is known about the historical Thomas Bayes. We do know that he was a rural protestant theologian and minister who was not a member of the Church of , a Dissenter. He published only two works during his life: A theological work entitled Divine Benevolence, or an Attempt to Prove That the Principal End of the Divine Providence and Government Is the Hap- piness of His Creatures and a more mathematical work; An Introduction to the Doctrine of Flux- ions, and a Defence of the Mathematicians Against the Objections of the Author of The Analyst in which he defended Newton’s Calculus against an attack by the philosopher Bishop George Berke- ley. After his death, Bayes’ friend and executor discovered amongst his papers a third manuscript entitled: Essay Towards Solving a Problem in the Doctrine of Chances. Price presented that paper at the Royal Society in in 1763 and it is entirely upon that work which Bayes’ quite considerable fame rests.

Today Bayes is such a towering name in mathematics, it seems astonishing that we know so little about him. We do not, for example, know why he was elected a fellow of the Royal Society before his death. In fact, the only picture of Bayes that we have may not even be a portrait of him. The historical Bayes is almost a total mystery. To his contemporaries that may not have been terribly surprising; the posthumous publication of Bayes’ essay in The Proceedings had almost no impact until LaPlace rediscovered it about 10 years later.

Bayes’ insight was profound. He realized that there are many events about which we have only partial or inaccurate knowledge. Events which truly happened but about which we are, because of our limited knowledge, are uncertain. It was Bayes who first realized that a mathematically com- plete kind of inverse probability could be used to infer the most likely values or properties of those events1.

The Bayesian theorem provides the basis for a fundamentally statistical approach to this kind of epistemological uncertainty. It does this by putting, on rigorous mathematical footing, the process of predicting the likelihood of all possible previous states of the world given one’s available observations. Put in English, Bayes’ theorem allows us to ask the following question: given my knowledge of how often I have observed that the world appeared to be in state x, and my knowl- edge of how well correlated my current sensory data is with the actual world state x, then pre- cisely how likely is it that the world was actually in state x.

Bayes’ theorem is so important that I want to digress here to present a fairly complete example of how the mathematics of the theorem work. Imagine that you are a monkey trained to fixate a spot of light while two eccentric spots of light are also illuminated just as in the example presented in chapter five. In this experiment, however, the central fixation light changes color to indicate which of the two eccentric target lights, the left one or the right one, will serve as your goal on this trial. If you can decide which target is the goal, and look at it, you receive a raisin as a reward. How- ever, the color of the central fixation light (or more precisely the wavelength of the light emitted by the central stimulus) can be any one of a hundred different hues (or wavelengths). We can begin our Bayesian description of this task by saying that there are two possible world states. One state in which a leftward eye movement will be rewarded and one state in which a rightward eye movement will be rewarded. Figure 8.1: Bayesian Graphs of the Example - EPS doc - PG

In mathematical notation we designate these two world states as w1 and w2. State w1 is when a leftward eye movement, or saccade, will be rewarded and state w2 is when a rightward saccade will be rewarded. After observing 100 trials we discover that on 25% of trials a leftward move- ment was rewarded, irrespective of the color of the fixation light and on 75% of trials the right- ward movement was rewarded. Based upon this observation we can say that the prior probability that world state w1 will occur (known as P(w1)) is 0.25, and the prior probability of world state w2 is 0.75.

1. As Stephen Stigler has pointed out, Thomas Stimpson was really the first mathematician to propose the idea of inverse proba- bilities, but it was Bayes who developed the mathematical approach on which modern inverse probabilities are based (Stigler, 1989).

To make these prior probabilities more accurate estimates of the state of the world we next have to take into account the color of the central fixation stimulus and the correlation of that stimulus color with each of the world states. To do that we need to generate a graph which plots the proba- bility that we will encounter a particular stimulus wavelength (which we will call λ) when the 1 world is in state w1. Figure 8.5a plots an example of such a probability density function showing the likelihood of each value of λ when the world is in state w1, and when in state w2. We refer to this as the density function for λ in world state w1, or P(λ|w1).

Next, in order to get the two graphs in Figure 8.5a to tell us how likely it is that we see a given λ and the world is in a given state, we have to correct these graphs for the overall likelihood that the world is in either state w1 or state w2. To do that we multiply each point on the graphs by the prior probability of that world state. The graph on the left thus becomes: P(λ|w1)P(w1), where P(w1) is the prior probability for world state w1 as described above. Note in Figure 8.5b that this has the effect of re-scaling the graphs that appeared in Figure 8.5a.

Finally, we have to determine how likely it is that any given value of λ will occur regardless of world state. To do this we need simply to count up all the times that we have seen λ at a specific value and then plot the probability density function for all values of λ (irrespective of which movement was rewarded) as shown in Figure 8.5c.

Now we are ready to ask, when we see a given wavelength of light, what is the likelihood that on this trial a rightward movement will be rewarded (that we are in world state w1) and what is the likelihood that a leftward movement will be rewarded (world state w2). To compute these likeli- hoods we divide the curves shown in Figure 8.5b by the curve shown in Figure 8.5c. This essen- tially corrects the likelihood that one would see a particular λ in a particular world state for the overall likelihood that one would ever have seen that wavelength λ. This is the essence of the Bayesian theorem given by the equation:

Probability of w1 given the current value of λ is P(λ|w1)P(w1)/P(λ)

To restate this in english one could say: the best possible estimate of the probability that a right- ward movement will be rewarded is equal to the probability that the central stimulus would be this color on a rightward trial -times- the overall probability of a rightward trial -divided by- the prob- ability that this particular color would ever be observed. The result is usually referred to as a pos- terior probability and it reports, in principle, the best estimate that you can derive for this likelihood. Therein lies the absolute beauty of Bayes’ theorem. Bayes’ theorem provides a mechanical tool which can report the best possible estimate of the likelihood of an event. No other method, no matter how sophisticated, can provide a more accurate estimate of the likelihood of an uncertain event. The Bayesian theorem is a critical advance because no decision process which must estimate the likelihood of an uncertain outcome can ever do better than a Bayesian estimate

1. I should point out that P(λ) in this specific example is actually a probability function, not a probability density function, because wavelength is treated as a discrete variable. In practice this makes little difference to this exposition but it is, in fair- ness, an abuse of notation which more mathematical readers may find annoying.

of that probability. The Bayesian theorem is a tool for reducing epistemologic uncertainty to a minimal level and then for assigning probabilities to world states. Total PrbiyfW Probability Probability Density D : TheLiklyodtaYuArInPcWS If WorldisnStae C B A p(λ|w : PriobaltyfSengGvWh : PriobaltyfSengGvWh : PriobaltyfSengGvWh p(λ|w Wavelngth Wavelngth And SeigtaGvWorl: 1

Probability )P(w 1 As aFunctiofWvelgh: ) Bayes' Thorm Regardls ofWSt:

Probability 1 Source: PWG Caption: Byes' Figure 8.5 ) 1 1 : = 0.25 THEN: Wavelngth P(λ) IF: Wavelngth Total PrbiyfW Probability Probability Density Theorm If WorldisnStae p(λ|w p(λ|w p(λ|w p(λ|w 2 Wavelngth )P(w Wavelngth 1 2 )P(w )P(w 2 ) 2 ) 1 2 2 )/P(λ )/P(λ 2 : = 0.75