Belief functions: A gentle introduction Seoul National University

Professor Fabio Cuzzolin

School of Engineering, Computing and Mathematics Oxford Brookes University, Oxford, UK

Seoul, Korea, 30/05/18

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 1 / 125 Outline

Belief functions Decision making 1 Uncertainty Semantics 5 Second-order uncertainty Dempster’s rule Theories of uncertainty Classical probability Multivariate analysis 2 Beyond probability Misunderstandings Monotone capacities Probability intervals Set-valued observations 4 Reasoning with belief Propositional evidence Fuzzy and possibility theory functions Probability boxes Scarce data Statistical inference Rough sets Representing ignorance Combination Rare events Conditioning 6 Belief functions on reals Uncertain data Belief vs Bayesian reasoning Continuous belief functions 3 Belief theory Generalised Bayes Theorem Random sets A theory of evidence The total belief theorem 7 Conclusions

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 2 / 125 Uncertainty Second-order uncertainty Orders of uncertainty

the difference between predictable and unpredictable variation is one of the fundamental issues in the philosophy of probability second order uncertainty: being uncertain about our very model of uncertainty has a consequence on human behaviour: people are averse to unpredictable variations (as in Ellsberg’s paradox) how good are Kolmogorov’s measure-theoretic probability, or Bayesian and frequentist approaches at modelling second-order uncertainty?

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 3 / 125 Uncertainty Classical probability Probability measures mainstream mathematical theory of (first order) uncertainty: mathematical (measure-theoretical) probability mainly due to Russian mathematician Andrey Kolmogorov probability is an application of measure theory, the theory of assigning numbers to sets additive probability measure → mathematical representation of the notion of chance assigns a probability value to every subset of a collection of possible outcomes (of a random experiment, of a decision problem, etc) collection of outcomes Ω → sample space, universe subset A of the universe → event

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 4 / 125 Uncertainty Classical probability Probability measures

probability measure µ: a real-valued function on a probability space that satisfies countable additivity probability space: it is a triplet (Ω, F, P) formed by a universe Ω, a σ-algebra F of its subsets, and a probability measure on F

I not all subsets of Ω belong necessarily to F axioms of probability measures:

I µ(∅) = 0, µ(Ω) = 1 I 0 ≤ µ(A) ≤ 1 for all events A ⊆ F I additivity: for all countable collection of pairwise disjoint events Ai : ! [ X µ Ai = µ(Ai ) i i probabilities have different interpretations: we consider frequentist and Bayesian (subjective) probability

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 5 / 125 Uncertainty Classical probability Frequentist inference

in the frequentist interpretation, the (aleatory) probability of an event is its relative frequency in time the frequentist interpretation offers guidance in the design of practical ‘random’ experiments developed by Fisher, Pearman, Neyman three main tools:

I statistical hypothesis testing I model selection I confidence interval analysis

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 6 / 125 Uncertainty Classical probability Statistical hypothesis testing

1 state the research hypothesis 2 state the relevant null and alternative hypotheses 3 state the statistical assumptions being made about the sample, e.g. assumptions about the statistical independence or about the form of the distributions of the observations 4 state the relevant test statistic T (a quantity derived from the sample) 5 derive the distribution of the test statistic under the null hypothesis from the assumptions 6 set a significance level (α), i.e. a probability threshold below which the null hypothesis will be rejected

7 compute from the observations the observed value tobs of the test statistic T 8 calculate the p-value, the probability (under the null hypothesis) of sampling a test statistic at least as extreme as the observed value 9 Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the p-value is less than the significance level threshold

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 7 / 125 Uncertainty Classical probability P-values

More likely observation

P-value

Very unlikely Very unlikely observations observations

Observed Probability density Probability data point

Set of possible results

the p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false: frequentist statistics does not and cannot attach probabilities to hypotheses

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 8 / 125 Uncertainty Classical probability Maximum Likelihood Estimation (MLE)

the term ‘likelihood’ was popularized in mathematical statistics by Ronald Fisher in 1922: ‘On the mathematical foundations of theoretical statistics’ Fisher argued against ‘inverse’ (Bayesian) probability as a basis for statistical inferences, and instead proposes inferences based on likelihood functions likelihood principle: all of the evidence in a sample relevant to model parameters is contained in the likelihood function

I this is hotly debated, still [Mayo,Gandenberger] maximum likelihood estimation:

{θˆmle} ⊆ {arg max L(θ ; x1,..., xn)}, θ∈Θ where L(θ ; x1,..., xn) = f (x1, x2,..., xn | θ) and {f (.|θ), θ ∈ Θ} is a parametric model consistency: the sequence of MLEs converges in probability, for a sufficiently large number of observations, to the (actual) value being estimated

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 9 / 125 Uncertainty Classical probability Subjective probability

(epistemic) probability = degrees of belief of an individual assessing the state of the world Ramsey and de Finetti → subjective beliefs must follow the laws of probability if they are to be coherent (if this ‘proof’ was prooftight we would not be here in front of you!) also, evidence casts doubt that humans will have coherent beliefs or behave rationally

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 10 / 125 Uncertainty Classical probability Bayesian inference

prior distribution: the distribution of the parameter(s) before any data is observed, i.e. p(θ | α) depends on a vector of hyperparameters α likelihood: the distribution of the observed data conditional on its parameters, i.e. p(X | θ) marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s): Z p(X | α) = p(X | θ)p(θ | α) dθ θ

posterior distribution: the distribution of the parameter(s) after taking into account the observed data, as determined by Bayes’ rule: p(X | θ)p(θ | α) p(θ | X, α) = ∝ p(X | θ)p(θ | α) p(X | α)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 11 / 125 Beyond probability Outline

Belief functions Decision making 1 Uncertainty Semantics 5 Second-order uncertainty Dempster’s rule Theories of uncertainty Classical probability Multivariate analysis Imprecise probability 2 Beyond probability Misunderstandings Monotone capacities Probability intervals Set-valued observations 4 Reasoning with belief Propositional evidence Fuzzy and possibility theory functions Probability boxes Scarce data Statistical inference Rough sets Representing ignorance Combination Rare events Conditioning 6 Belief functions on reals Uncertain data Belief vs Bayesian reasoning Continuous belief functions 3 Belief theory Generalised Bayes Theorem Random sets A theory of evidence The total belief theorem 7 Conclusions

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 12 / 125 Beyond probability Something is wrong?

measure-theoretical mathematical probability is not general enough:

I cannot (properly) model missing data I cannot (properly) model propositional data I cannot really model unusual data (second order uncertainty)

the frequentist approach to probability:

I cannot really model pure data (without ‘design’) I in a way, cannot even model properly continuous data I models scarce data only asymptotically

Bayesian reasoning has several limitations:

I cannot model no data (ignorance) I cannot model ‘uncertain’ data I cannot model pure data (without prior) I again, cannot properly model scarce data (only asymptotically)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 13 / 125 Beyond probability Fisher has not got it all right

the setting of hypothesis testing is (arguably) arguable

I the scope is quite narrow: rejecting or not rejecting a hypothesis (although it can provide confidence intervals) I the criterion is arbitrary: who decides what an ‘extreme’ realisation is (choice of α)? what is the deal with 0.05 and 0.01? I the whole ‘tail’ idea comes from the fact that, under measure theory, the conditional probability (p-value) of a point outcome x is zero – seems trying to patch an underlying problem with the way probability is mathematically defined cannot cope with pure data, without assumptions on the process (experiment) which generated them (we will come back to this later) deals with scarce data only asymptotically

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 14 / 125 Beyond probability The problem(s) with Bayes

pretty bad at representing ignorance

I Jeffrey’s ‘uninformative’ priors are just not adequate I different results on different parameter spaces Bayes’ rule assumes the new evidence comes in the form of certainty:“A is true”

I in the real world, often this is not the case (‘uncertain’ or ‘vague’ evidence) beware the prior! → model selection in Bayesian statistics

I results from a confusion between the original subjective interpretation, and the objectivist view of a rigorous objective procedure I why should we ‘pick’ a prior? either there is prior knowledge (beliefs) or there is not I all will be fine, in the end! asymptotically, the choice of the prior does not matter (really!)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 15 / 125 Beyond probability Set-valued observations The die as

� face 6 face 1 face 3 face 5 face 2 face 4 X

1 2 3 4 5 6

a die is a simple example of (discrete) random variable there is a probability space Ω = {face1, face2, ..., face6} which maps to a : 1, 2, ..., 6 (no need for measurability here) now, imagine that face1 and face2 are cloaked, and we roll the die

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 16 / 125 Beyond probability Set-valued observations The cloaked die: set-valued observations � face 6 face 1 face 3 face 5 face 2 face 4 X

1 2 3 4 5 6

the same probability space Ω = {face1, face2, ..., face6} is still there (nothing has changed in the way the die works) however, now the mapping is different: both face1 and face2 are mapped to the set of possible values {1, 2} (since we cannot observe the outcome) this is a random set [Matheron,Kendall,Nguyen, Molchanov]: a set-valued random variable whenever data are missing observations are inherently set-valued

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 17 / 125 Beyond probability Propositional evidence Reliable witnesses Evidence supporting propositions

suppose there is a murder, and three people are under trial for it: Peter, John and Mary our hypothesis space is therefore

Θ = {Peter, John, Mary}

there is a witness: he testifies that the person he saw was a man this amounts to supporting the proposition A = {Peter, John} ⊂ Θ should we take this testimony at face value? in fact, the witness was tested and the machine reported an 80% chance he was drunk when he reported the crime we should partly support the (vacuous) hypothesis that any one among Peter, John and Mary could be the murderer: it is natural to assign 80% chance to proposition A, and 20% chance to proposition Θ

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 18 / 125 Beyond probability Propositional evidence Dealing with propositional evidence

even when evidence (data) supports propositions, Kolmogorov’s probability forces us to specify support for individual outcomes this is unreasonable - an artificial constraint due to a mathematical model that is not general enough

I we have no elements to assign this 80% probability to either Peter or John, nor to distribute it among them the cause is the additivity of probability measures: but this is not the most general type of measure for sets under a minimal requirement of monotoniticy measure can potentially suitable to describe probabilities of events: these objects are called capacities in particular, random sets are capacities in which the numbers assigned to subsets are given by a

Belief functions and propositional evidence As capacities (and random sets in particular), belief functions allow us to assign mass directly to propositions.

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 19 / 125 Beyond probability Scarce data Machines that learn Generalising from scarce data

machine learning: designing algorithms that can learn from data BUT, we train them on a ridicously small amount of data: how can we make sure they are robust to new situations never encountered before (model adaptation)? statistical learning theory [Vapnik] is based on traditional

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 20 / 125 Beyond probability Scarce data Dealing with scarce data

a somewhat naive objection: probability distributions assume an infinite amount of evidence, so in reality finite evidence can only provide a constraint on the ‘true’ probability values

I unfortunately, those who believe probabilities to be limits of relative frequencies (the frequentists) never really ‘estimate’ a probability from the data – the only assume (‘design’) probability distributions for their p-values I Fisher: fine, I can never compute probabilities, but I can use the data to test my hypotheses on them I in opposition, those who do estimate probability distributions from the data (the Bayesians) do not think of probabilities as infinite accumulations of evidence (but as degrees of belief) I Bayes: I only need to be able to model a likelihood function of the data well, actually, frequentists do estimate probabilities from scarce data when they do stochastic regression (e.g., logistic regression)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 21 / 125 Beyond probability Scarce data Asymptotic happiness

what is true, is that both frequentists and Bayesians seem to be happy with solving their problems ‘asymptotically’ I limit properties of ML estimates I Bernstein-von Mises theorem what about the here and now? e.g. smart cars?

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 22 / 125 Beyond probability Representing ignorance Modelling pure data Bayesian inference

Bayesian reasoning requires modelling the data and a prior (actually, you need to pick the proper hypothesis space too!)

I prior is just a name for beliefs built over a long period of time, from the evidence you have observed – so long a time has passed that all track record of observations is lost, and all is left is a probability distribution why should we ‘pick’ a prior? either there is prior knowledge or there is not nevertheless we are compelling to picking one, because the mathematical formalism requires it

I this is the result of a confusion between the original subjective interpretation (where prior beliefs always exist), and the objectivist view of a rigorous objective procedure (where in most cases we do not have any prior knowledge) Bayesians then go in ‘damage limitation’ mode, and try to pick the least damaging prior (see ‘ignorance’ later) all will be fine, in the end! (Bernstein-von Mises theorem) Asymptotically, the choice of the prior does not matter (really!)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 23 / 125 Beyond probability Representing ignorance Dangerous priors Bayesian inference

the prior distribution is typically hard to determine – ‘solution’ → pick an ‘uninformative’ probability

I Jeffrey’s prior → Gramian of the Fisher information matrix I can be improper (unnormalised), and I it violates the strong version of the likelihood principle: inferences depend not just on the data likelihood but also on the universe of all possible experimental outcomes uniform priors can lead to different results on different spaces, given the same likelihood functions the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior (Bernstein-von Mises theorem) A. W. F. Edwards: “It is sometimes said, in defence of the Bayesian concept, that the choice of prior distribution is unimportant in practice, because it hardly influences the posterior distribution at all when there are moderate amounts of data. The less said about this ‘defence’ the better.”

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 24 / 125 Beyond probability Representing ignorance Modelling pure data Frequentist inference

the frequentist approach is inherently unable to describe pure data, without making additional assumptions on the data-generating process in Nature one cannot ‘design’ an experiment: data come your way, whether you want it or not – you cannot set the ‘stopping rules’

I again, recalls the old image of a scientist ‘analysing’ (from Greek ‘ana’+’lysis’, breaking up) a specific aspect of the world in their lab the same data can lead to opposite conclusions

I different experiments can lead to the same data, whereas the parametric model employed (family of probability distributions) is linked to a specific experiment I apparently, however, frequentists are just fine with this

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 25 / 125 Beyond probability Representing ignorance Dealing with ignorance Shafer vs Bayes

‘uninformative’ priors can be dangerous (Andrew Gelman): they violate the strong likelihood principle, may be unnormalised wrong priors can kill a Bayesian model priors in general cannot handle multiple hypothesis spaces in a coherent way (families of frames, in Shafer’s terminology)

Belief functions and priors Reasoning with belief functions does not require any prior.

Belief functions and ignorance Belief functions naturally represent ignorance via the ‘vacuous’ belief function, assigning mass 1 to the whole hypothesis space.

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 26 / 125 Beyond probability Rare events Extinct dinosaurs The statistics of rare events

dinosaurs probably were worrying about overpopulation risks.. .. until it hit them!

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 27 / 125 Beyond probability Rare events What’s a rare event?

what is a ‘rare’ event? clearly we are interested in them because they are not so rare, after all! examples of rare events, also called ‘tail risks’ or ‘black swans’, are: volcanic eruptions, meteor impacts, financial crashes .. mathematically, an event is ‘rare’ when it covers a region of the hypothesis space which is seldom sampled – it is an issue with the quality of the sample

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 28 / 125 Beyond probability Rare events Rare events and second-order uncertainty probability distributions for the system’s behaviour are built in ‘normal’ times (e.g. while a nuclear plant is working just fine), then used to extrapolate results at the ‘tail’ of the distribution P(Y=1|x) 'rare' event 1 popular statistical procedures (e.g. logistic regression) can sharply underestimate the probability of rare events 0.5 training Harvard’s G. King [2001] has samples proposed corrections based on oversampling the ‘rare’ events w.r.t 0 the ‘normal’ ones −6−4−2 0 2 4 6 x the issue is really one with the reliability of the model! we need to explictly model second-order uncertainty Belief functions and rare events Belief functions can model second-order uncertainty: rare events are a form of lack of information in certain regions of the sample space.

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 29 / 125 Beyond probability Uncertain data Uncertain data

concepts themselves can be not well defined, e.g. ‘dark’ or ‘somewhat round’ object (qualitative data)

I fuzzy theory accounts for this via the concept of graded membership unreliable sensors can generate faulty (outlier) measurements: can we still treat these data as ‘certain’? or is more natural to attach to them a degree of reliability, based on the past track record of the ‘sensor’ (data generating process)? but then, can we still apply Bayes’ rule? people (‘experts’, e.g. doctors) tend to express themselves in terms of likelihoods directly (e.g. ‘I think diagnosis A is most likely, otherwise either A or B’)

I if the doctors were frequentists, and were provided with the same data, they would probably apply logistic regression and come up with the same prediction on P(disease|symptoms): unfortunately doctors are not statisticians multiple sensors can provide as output a PDF on the same space

I e.g., two Kalman filters based one on color, the other on motion (optical flow), providing a normal predictive PDF on the location of the target in the image plane

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 30 / 125 Belief theory Outline

Belief functions Decision making 1 Uncertainty Semantics 5 Second-order uncertainty Dempster’s rule Theories of uncertainty Classical probability Multivariate analysis Imprecise probability 2 Beyond probability Misunderstandings Monotone capacities Probability intervals Set-valued observations 4 Reasoning with belief Propositional evidence Fuzzy and possibility theory functions Probability boxes Scarce data Statistical inference Rough sets Representing ignorance Combination Rare events Conditioning 6 Belief functions on reals Uncertain data Belief vs Bayesian reasoning Continuous belief functions 3 Belief theory Generalised Bayes Theorem Random sets A theory of evidence The total belief theorem 7 Conclusions

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 31 / 125 Belief theory A theory of evidence A mathematical theory of evidence

Shafer called his proposal ‘A mathematical theory of evidence’ the mathematical objects it deals with are called ‘belief functions’ where do these names come from? what interpretation of probability do they entail?

it is a theory of epistemic evidence probability: it is about probabilities as a mathematical representation of knowledge (a human’s knowledge, or a machine’s) it is a theory of evidential probabilistic probability: such probabilities truth knowledge belief representing knowledge are knowledge induced (‘elicited’) by the available evidence

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 32 / 125 Belief theory A theory of evidence Evidence supporting hypotheses

in probabilistic logic, statements such as "hypothesis H is probably true" that the empirical evidence E supports H to a high degree called the epistemic probability of H given E

Rationale There exists evidence in the form of probabilities, which supports degrees of belief on a certain matter.

the space where the evidence lives is different from the hypothesis space they are linked by a map one to many: but this is a random set!

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 33 / 125 Belief theory Belief functions Dempster’s multivalued mappings Dempster’s work formalises random sets via multivalued (one-to-many) mappings Γ from a probability space (Ω, F, P) to the domain of interest Θ �

drunk (0.2) � � not drunk (0.8)

Mary Peter John

examples taken from a famous ‘trial’ example [Shafer] elements of Ω are mapped to subsets of Ω: once again this is a random set I in the example Γ maps {not drunk} ∈ Ω to {Peter, John} ⊂ Θ the probability distribution P on Ω induces a mass assignment m : 2Θ → [0, 1] on the power set 2Θ = {A ⊆ Θ} via the multivalued mapping Γ:Ω → 2Θ

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 34 / 125 Belief theory Belief functions Belief and plausibility measures

the belief in A as the probability that the evidence implies A:

Bel(A) = P({ω ∈ Ω|Γ(ω) ⊆ A})

the plausibility of A as the probability that the evidence does not contradict A:

Pl(A) = P({ω ∈ Ω|Γ(ω) ∩ A 6= ∅}) = 1 − Bel(A)

originally termed by Dempster ‘lower and upper probabilities’ belief and plausibility values can (but this is disputed) be interpreted as lower and upper bounds to the values of an unknown, underlying probability measure: Bel(A) ≤ P(A) ≤ Pl(A) for all A ⊆ Θ

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 35 / 125 Belief theory Belief functions Basic probability assignments Mass functions

belief functions (BF) are functions from 2Θ, the set of all subsets of Θ, to [0, 1], assigning values to subsets of Θ it can be proven that each belief function has form X Bel(A) = m(B) B⊆A

where m is a mass function or basic probability assignment on Θ, defined as a function 2Θ → [0, 1], such that: P m(∅) = 0 A⊆Θ m(A) = 1

any subset A of Θ such that m(A) > 0 is called a focal element (FE) of m working with belief functions reduces to manipulating focal elements

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 36 / 125 Belief theory Belief functions A generalisation of sets, fuzzy sets, probabilities

belief functions generalise traditional (‘crisp’) sets:

I a logical (or “categorical”) mass function has one focal set A, with m(A) = 1 belief functions generalise standard probabilities:

I a Bayesian mass function has as only focal sets elements (rather than subsets) of Θ complete ignorance is represented by the vacuous mass function: m(Θ) = 1 belief functions generalise fuzzy sets (see possibility theory later), which are assimilated to consonant BFs whose focal elements are nested: A1 ⊂ ... ⊂ Am �

consonant Bayesian vacuous

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 37 / 125 Belief theory Semantics Semantics of belief functions Modelling second-order uncertainty p(x) = 1

probability simplex

p(z) = 0.7 � B p(x) = 0.6 A � 1 Bel �(B) p(x) = 0.2 �(A) 0

p(z) = 1 p(z) = 0.2 p(y) = 1 belief functions have multiple interpretations as set-valued random variables (random sets) as (completely monotone) capacities (functions from the power set to [0, 1]) as a special class of credal sets (convex sets of probability distributions) [Levi,Kyburg] as such, they are a very expressive of modelling uncertainty on the model itself, due to lack of data quantity or quality, or both

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 38 / 125 Belief theory Semantics Axiomatic definition

belief functions can also be defined in axiomatic terms, just like Kolmogorov’s additive probability measures this is the definition proposed by Shafer in 1976

Belief function

A function Bel : 2Θ → [0, 1] from the power set 2Θ to [0, 1] such that: Bel(∅) = 0, Bel(Θ) = 1; Θ for every n and for every collection A1, ..., An ∈ 2 we have that: X X n+1 Bel(A1 ∪ ... ∪ An) ≥ Bel(Ai ) − Bel(Ai ∩ Aj ) + ··· + (−1) Bel(A1 ∩ ... ∩ An) i i

makes clearer that belief measures generalise standard probability measures: replace additivity with superadditivity (third axiom)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 39 / 125 Belief theory Dempster’s rule Jeffrey’s rule of conditioning belief measures include probability ones as a special case: what does replace Bayes’ rule? Jeffrey’s rule of conditioning: a step forward from certainty and Bayes’ rule an initial probability P stands corrected by a second probability P0, defined only on a number of events suppose P is defined on a σ-algebra A 0 there is a new prob measure P on a sub-algebra B of A, and the updated probability P00 has to: 0 1 meet the prob values specified by P for events in B 2 be such that ∀ B ∈ B, X, Y ⊂ B, X, Y ∈ A ( P00(X) P(X) if P(Y ) > 0 = P(Y ) P00(Y ) 0 if P(Y ) = 0

00 X 0 there is a unique solution: P (A) = P(A|B)P (B) B∈B generalises Bayes’ conditioning! (obtained when P0(B) = 1 for some B)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 40 / 125 Belief theory Dempster’s rule Conditioning versus combination

what if I have a new probability on the same σ-algebra A? Jeffrey’s rule cannot be applied! as we saw, this happens when multiple sensors provide predictive PDFs belief function deal with uncertain evidence by moving away from the concept of conditioning (via Bayes’ rule) .. .. to that of combining pieces of evidence supporting multiple (intersecting) propositions to various degrees

Belief functions and evidence Belief reasoning works by combining existing belief functions with new ones, which are able to encode uncertain evidence.

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 41 / 125 Belief theory Dempster’s rule Dempster’s combination �� drunk (0.2)

�� � new piece of not drunk (0.8) evidence: a blond Mary hair has been Peter found; also, there is John a probability 0.6 that the room has been cleaned before the cleaned (0.6) � � crime not cleaned (0.4)

��

the assumption is that pairs of outcomes in the source spaces ω1 ∈ Ω1 and Θ ω2 ∈ Ω2 support the intersection of their images in 2 : θ ∈ Γ1(ω1) ∩ Γ2(ω2)

if this is done independently, then the probability that pair (ω1, ω2) is selected is P1({ω1})P2({ω2}), yielding Dempster’s rule of combination: 1 X (m ⊕ m )(A) = m (B)m (C), ∀∅ 6= A ⊆ Θ, 1 2 1 − κ 1 2 B∩C=A Bayes’ rule is a special case of Dempster’s rule

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 42 / 125 Belief theory Dempster’s rule Dempster’s combination A simple numerical example

B1

�3 A �1 1 �

�2 �4 A2 B2 = �

m({θ }) = 0.7∗0.4 = 0.48 1 1−0.42 �� � m����1 1�������� 1 ∅ Bel2 Bel 0.3∗0.6 1 m({θ2}) = 1−0.42 = 0.31

m����������������1 1 2 ������� �� � 1 2 2 0.3∗0.4 m({θ1, θ2}) = 1−0.42

m����������2 m�������������������2 2 3 4 = 0.21

X1 � X3

X2

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 43 / 125 Belief theory Dempster’s rule A generalisation of Bayesian inference

belief theory generalises Bayesian probability (it contains it as a special case), in that:

I classical probability measures are a special class of belief functions (in the finite case) or random sets (in the infinite case) I Bayes’ ‘certain’ evidence is a special case of Shafer’s bodies of evidence (general belief functions) I Bayes’ rule of conditioning is a special case of Dempster’s rule of combination

F it also generalises set-theoretical intersection: if mA and mB are logical mass functions and A ∩ B 6= ∅, then mA ⊕ mB = mA∩B however, it overcomes its limitations

I you do not need a prior: if you are ignorant, you will use the vacuous BF 0 mΘ which, when combined with new BFs m encoding data, will not change the result 0 0 mΘ ⊕ m = m

I however, if you do have prior knowledge you are welcome to use it!

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 44 / 125 Belief theory Multivariate analysis Multivariate analysis Refinements and coarsenings

the theory allows us to handle evidence impacting on different but related domains assume we are interested in the nature of an object in a road scene. We could describe it, e.g., in the frame Θ = {vehicle, pedestrian}, or in the finer frame Ω = {car, bicycle, motorcycle, pedestrian} other example: different image features in pose estimation a frame Ω is a refinement of a frame Θ (or, equivalently, Θ is a coarsening of Ω) if elements of Ω can be obtained by splitting some or all of the elements of Θ

Θ Ω ρ θ1

θ2 θ3

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 45 / 125 Belief theory Multivariate analysis Families of compatible frames Multivariate analysis

when Ω is a refinement for a collection Θ1, ..., ΘN of other frames it is called their common refinement two frames are said to be compatible if they do have a common refinement compatible frames can be associated with different variables/attributes/features:

I let ΘX = {red, blue, green} and ΘY = {small, medium, large} be the domains of attributes X and Y describing, respectively, the color and the size of an object I in such a case the common refinement ΘX ⊗ ΘY = ΘX × ΘY is simply the Cartesian product or, they can be descriptions of the same variable at different levels of granularity (as in the road scene example) evidence can be moved from one frame to another within a family of compatible frames

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 46 / 125 Belief theory Multivariate analysis Families of compatible frames Pictorial illustration

�i

�1 Ai �n

A1 A �i n

�1 �n

... �1 �n ���

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 47 / 125 Belief theory Multivariate analysis Marginalisation

let ΘX and ΘY be two compatible frames XY let m be a mass function on ΘX × ΘY XY it can be expressed in the coarser frame ΘX by transferring each mass m (A) to the projection of A on ΘX :

�Y

�X

B=A �X A

�X �Y

we obtain a marginal mass function on ΘX :

XY ↓X X XY m (B) = m (A) ∀B ⊆ ΘX

{A⊆ΘXY ,A↓ΘX =B}

(again, it generalizes both set projection and probabilistic marginalization)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 48 / 125 Belief theory Multivariate analysis Vacuous extension

the “inverse” of marginalization X a mass function m on ΘX can be expressed in ΘX × ΘY by transferring each mass mX (B) to the cylindrical extension of B: �Y

�X

B A=B �Y

�X �Y

this operation is called the vacuous extension of mX in ΘX × ΘY : ( mX (B) if A = B × Θ mX↑XY (A) = Y 0 otherwise

a strong feature of belief theory: the vacuous belief function (our representation of ignorance) is left unchanged when moving from one space to another!

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 49 / 125 Belief theory Misunderstandings Belief functions are not (general) credal sets

p(x) = 1

a belief function on Θ is in 1-1 correspondence with a convex Cre set of probability distributions there (a credal set) however, belief functions are a Bel special class of credal sets, those induced by a random set mapping

p(z) = 1 p(y) = 1

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 50 / 125 Belief theory Misunderstandings Belief functions are not parameterised families of distributions, or confidence intervals p(x) = 1 obviously, a parameterised family of distributions on Θ is a subset of the set of all possible distributions (just like belief Fam functions) not all families of distributions correspond to belief functions Bel example: Gaussian PDFs with 0 mean and arbitrary + {N (0, σ), σ ∈ R } is not a belief function p(z) = 1 p(y) = 1 they are not confidence intervals either: confidence intervals are one-dimensional, and their interpretation is entirely different. Confidence intervals are interval estimates

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 51 / 125 Belief theory Misunderstandings Belief functions are not second-order distributions

Dirichlet distribution

Belief function as uniform meta-distribution unlike hypothesis testing, general Bayesian inference leads to probability distributions over the space of parameters these are second order probabilities, i.e. probability distributions on hypotheses which are themselves probabilities belief functions can be defined on the hypothesis space Ω, or on the parameter space Θ

I when defined on Ω they are sets of PDFs and can then be seen as ‘indicator’ second order distributions (see figure) I when defined on the parameter space Θ, they amount to families of second-order distributions in the two cases they generalise MLE/MAP and general Bayesian inference, respectively

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 52 / 125 Reasoning with belief functions Outline

Belief functions Decision making 1 Uncertainty Semantics 5 Second-order uncertainty Dempster’s rule Theories of uncertainty Classical probability Multivariate analysis Imprecise probability 2 Beyond probability Misunderstandings Monotone capacities Probability intervals Set-valued observations 4 Reasoning with belief Propositional evidence Fuzzy and possibility theory functions Probability boxes Scarce data Statistical inference Rough sets Representing ignorance Combination Rare events Conditioning 6 Belief functions on reals Uncertain data Belief vs Bayesian reasoning Continuous belief functions 3 Belief theory Generalised Bayes Theorem Random sets A theory of evidence The total belief theorem 7 Conclusions

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 53 / 125 Reasoning with belief functions Reasoning with belief functions

1 inference: building a belief function from data (either statistical or qualitative) 2 reasoning: updating belief representations when new data arrives

I either by combination with another belief function I or by conditioning with respect to new events/observations 3 manipulating conditional belief functions

I via a generalisation of Bayes’ theorem I vie network propagation I via a generalisation of the total probability theorem 4 using the resulting belief function(s) for:

I decision making I regression I classification I etc (estimation, optimisation..)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 54 / 125 Reasoning with belief functions Reasoning with belief functions

EFFICIENT COMPUTATION

COMBINED CONDITIONING BELIEF FUNCTIONS

DECISION INFERENCE MANIPULATION MAKING

STATISTICAL TOTAL/ BELIEF CONDITIONAL DECISIONS MARGINAL DATA/ FUNCTIONS COMBINATION BELIEF OPINIONS FUNCTIONS BELIEF FUNCTIONS MEASURING UNCERTAINTY

CONTINUOUS FORMULATION

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 55 / 125 Reasoning with belief functions Statistical inference Dempster’s approach to statistical inference Fiducial argument

consider a statistical model n o f (x|θ), x ∈ X, θ ∈ Θ ,

where X is the sample space and Θ the parameter space having observed x, how to quantify the uncertainty about the parameter θ, without specifying a prior probability distribution? suppose that we known a data-generating mechanism [Fisher]

X = a(θ, U)

where U is an (unobserved) auxiliary variable with known probability distribution µ : U → [0, 1] independent of θ for instance, to generate a continuous random variable X with cumulative distribution function (CDF) Fθ, one might draw U from U([0, 1]) and set

−1 X = Fθ (U)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 56 / 125 Reasoning with belief functions Statistical inference Dempster’s approach to statistical inference the equation X = a(Θ, U) defines a multi-valued mapping Γ: U → 2X×Θ: n o

Γ: u 7→ Γ(u) = (x, θ) ∈ X × Θ x = a(θ, u) ⊂ X × Θ

under the usual measurability conditions, the probability space (U, B(U), µ) and

the multi-valued mapping Γ induce a belief function BelX×Θ on X × Θ

I conditioning it on θ yields BelX(.|θ) ∼ f (·|θ) on X I conditioning it on X = x gives BelΘ(·|x) on Θ

X U Bel (.|�) | � X � × BelX × � | x Bel � (.|x) �: U → [0,1] �

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 57 / 125 Reasoning with belief functions Statistical inference Inference from classical likelihood [Shafer76, Denoeux]  consider a statistical model L(θ; x) = f (x|θ), x ∈ X, θ ∈ Θ , where X is the sample space and Θ the parameter space

BelΘ(θ|x) is the consonant belief function (with nested focal elements) with plausibility of the singletons equal to the normalized likelihood: L(θ; x) pl(θ|x) = 0 supθ0∈Θ L(θ ; x)

takes the empirical normalised likelihood to be the upper bound to the probability density of the sought parameter! (rather than the actual PDF)

the corresponding plausibility function is PlΘ(A|x) = supθ∈A pl(θ|x) the plausibility of a composite hypothesis A ⊂ Θ

supθ∈A L(θ; x) PlΘ(A|x) = supθ∈Θ L(θ; x) is the usual likelihood ratio statistics compatible with the likelihood principle

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 58 / 125 Reasoning with belief functions Statistical inference Coin toss example Inference with belief functions

consider a coin toss experiment we toss the coin n = 10 times, obtaining the sample n o X = H, H, T , H, T , H, T , H, H, H

with k = 7 successes (heads H) and n − k = 3 fails (tails T) parameter of interest: the probability θ = p of heads in a single toss inference problem consists then on gathering information on the value of p

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 59 / 125 Reasoning with belief functions Statistical inference Coin toss example General Bayesian inference

trials are typically assumed to be independent (they are equally distributed) the likelihood of the sample is binomial: P(X|p) = pk (1 − p)n−k apply Bayes’ rule to get the posterior P(X|p)P(p) P(p|X) = ∼ P(X|p) = pk (1 − p)n−k P(X) as we do not have a-priori information on the prior

likelihood function

maximum likelihood estimate

p

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 60 / 125 Reasoning with belief functions Statistical inference Coin toss example Frequentist inference what would a frequentist do? k it is reasonable that p be equal to p = n , i.e., the fraction of successes we can then test this hypothesis in the classical frequentist setting this implies assuming independent and equally distributed trials, so that the conditional distribution of the sample is the binomial we can then compute the p-value for, say, a confidence level of α = 0.05 k the right-tail p-value for the hypothesis p = n (the integral area in pink) is equal 1 to 2 >> α = 0.05. Hence, the hypothesis cannot be rejected likelihood function

p-value = 1/2

p

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 61 / 125 Reasoning with belief functions Statistical inference Coin toss example Inference with likelihood-based belief functions likelihood-based belief function inference yields the following belief measure, conditioned on the observed sample X, over Θ = [0, 1] c PlΘ(A|X) = sup Lˆ(p|X); BelΘ(A|X) = 1 − PlΘ(A |X), ∀A ⊆ Θ p∈A where Lˆ(p|X) is the normalised version of the traditional likelihood random set induced by likelihood

� determines an entire envelope of PDFs on the parameter space Θ = [0, 1] (a belief function there)

�����X p the random set associated with this belief measure is: n o

ω ∈ Ω = [0, 1] 7→ ΓX (ω) = θ ∈ Θ PlΘ({θ}|X) ≥ ω ⊂ Θ = [0, 1] which is an interval centered around the ML estimate of p Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 62 / 125 Reasoning with belief functions Statistical inference Coin toss example Inference with likelihood-based belief functions

the same procedure can applied to the normalised empirical counts ˆ 7 ˆ 3 f (H) = 7 = 1, f (T ) = 7 , rather than to the normalised likelihood function 3 imposing PlΩ(H) = 1, PlΩ(T ) = 7 on Ω = {H, T }, and looking for the least committed belief function there with these plausibility values 4 3 we get the mass assignment: m(H) = , m(T ) = 0, m(Ω) = , 7 7 Bel

corresponds to the credal set on the left 4/7 7/10 0 1 p MLE p = 1 needs to be excluded, as the available sample evidence reports that we had n(T ) = 3 counts already, so that 1 − p 6= 0 this outcome (a belief function on Ω = {H, T }) ‘robustifies’ classical MLE

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 63 / 125 Reasoning with belief functions Statistical inference Summary on inference

general Bayesian inference → continuous PDF on the parameter space Θ (a second-order distribution) MLE/MAP estimation → a single parameter value = a single PDF on Ω generalised maximum likelihood → a belief function on Ω (a convex set of PDFs on Ω)

I generalises MAP/MLE likelihood-based / Dempster-based belief function inference → a belief function on Θ = a convex set of second-order distributions

I generalises general Bayesian inference Dempster’s approach requires a data-generating process likelihood approach produces only consonant BFs

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 64 / 125 Reasoning with belief functions Combination Combining vs conditioning Reasoning with belief functions

belief theory is a generalisation of Bayesian reasoning whereas in Bayesian theory evidence is of the kind ‘A is true’ (e.g. a new datum is available) .. .. in belief theory, new evidence can assume the more general form of a belief function

I a proposition A is a very special case of belief function with m(A) = 1 in most cases, reasoning needs then to be performed by combining belief functions, rather than by conditioning with respect to an event nevertheless, conditional belief functions are of interest, especially for statistical inference

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 65 / 125 Reasoning with belief functions Combination Dempster’s rule under fire Zadeh’s paradox

question is: is Dempster’s sum the only possible rule of combination? seems to have paradoxical behaviour in certain circumstances doctors have opinions about the condition of a patient Θ = {M, C, T }, where M stands for meningitis, C for concussion and T for tumor two doctors provide the following diagnoses:

I D1:“I am 99% sure it’s meningitis, but there is a small chance of 1% that it is concussion". I D2:“I am 99% sure it’s a tumor, but there is a small chance of 1% that it is concussion". can be encoded by the following mass functions:    0.99 A = {M}  0.99 A = {T } m1(A) = 0.01 A = {C} m2(A) = 0.01 A = {C} (1)  0 otherwise  0 otherwise,

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 66 / 125 Reasoning with belief functions Combination Dempster’s rule under fire Zadeh’s paradox

their (unnormalised) Dempster’s combination is:

 0.9999 A = {∅} m(A) = 0.0001 A = {C}

as the two masses are highly conflicting, normalisation yields the belief function focussed on C → “it is definitively concussion”, although both experts had left it as only a fringe possibility objections:

I the belief functions in the example are really probabilities, so this is a problem with Bayesian representations, in case! I diseases are never exclusive, so that it may be argued that Zadeh’s choice of a frame of discernment is misleading → open world approaches with no normalisation I doctors disagree so much that any person would conclude that one of the them is just wrong → reliability of sources needs to be accounted for

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 67 / 125 Reasoning with belief functions Combination Dempster’s rule under fire Tchamova’s paradox

this time, the two doctors generate the following mass assignments over Θ = {M, C, T }:    a A = {M}  b1 A = {M, C} m1(A) = 1 − a A = {M, C} m2(A) = b2 A = Θ  0 otherwise  1 − b1 − b2 A = {T }. (2) assuming equal reliability of the two doctors, Dempster’s combination yields m1 ⊕ m2 = m1, i.e, Doctor 2’s diagnosis is completely absorbed by that of Doctor 1! here the ‘paradoxical’ behaviour is not a consequence of conflict in Dempster’s combination, every source of evidence has a ‘veto’ power over the hypotheses it does not believe to be possible if any of them gets it wrong, the combined belief function will never give support to the ‘correct’ hypothesis

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 68 / 125 Reasoning with belief functions Combination Yager’s and Dubois’ rules

first answer to Zadeh’s objections based on view that conflict is generated by non-reliable information sources P conflicting mass m(∅) = B∩C=∅ m1(B)m2(C) should be re-assigned to the whole frame Θ

let m∩(A) = m1(B)m2(C) whenever B ∩ C = A  m∩(A) ∅= 6 A ( Θ mY (A) = (3) m∩(Θ) + m(∅) A = Θ.

Dubois and Prade’s idea: similar to Yager’s, BUT conflicting mass is not transferred all the way up, but to B ∪ C (due to applying the minimum specificity principle) X mD(A) = m∩(A) + m1(B)m2(C). (4) B∪C=A,B∩C=∅

the resulting BF dominates Yager’s combination: mD(A) ≥ mY (A) ∀A

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 69 / 125 Reasoning with belief functions Combination Conjunctive and disjunctive rules

rather than normalising (as in Dempster’s rule) or re-assigning the conflicting mass m(∅) to other non-empty subsets (as in Yager’s and Dubois’ proposals), Smets’ conjunctive rule leaves the conflicting mass with the empty set: X m∩(A) = m1(B)m2(C) (5) B∩C=A

applicable to unnormalised belief functions in an open world assumption: current frame only approximately describes the set of possible hypotheses disjunctive rule of combination: X m∪(A) = m1(B)m2(C) (6) B∪C=A

consensus between two sources is expressed by the union of the supported propositions, rather than by their intersection

I not that Bel1 ∪ Bel2(A) = Bel1(A) ∗ Bel2(A): belief values are simply multiplied!

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 70 / 125 Reasoning with belief functions Combination Combination: some conclusions

Yager’s rule is rather unjustified .. Dubois’ is kinda intermediate between conjunction and disjunction my take on this: Dempster’s (conjunctive) combination and disjunctive combination are the two extrema of a spectrum of possible results

Proposal: combination tubes? Meta-uncertainty on the sources generating the input belief functions (their independence and reliability) induces uncertainty on the result of the combination, represented by a bracket of combination rules, which produce a ‘tube’ of BFs.

fits well with belief likelihood concept, and was already hinted at by Pearl in “Reasoning with belief functions: An analysis of compatibility” we should probably work with intervals of belief functions then?

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 71 / 125 Reasoning with belief functions Conditioning Conditional belief functions Approaches

P(A ∩ B) in Bayesian theory conditioning is done via Bayes’ rule: P(A|B) = P(B) for belief functions, many approaches to conditioning have been proposed (just as for combination!)

I original Dempster’s conditioning I Fagin and Halpern’s lower envelopes I “geometric conditioning” [Suppes] I unnormalized conditional belief functions [Smets] I generalised Jeffrey’s rules [Smets] I sets of equivalent events under multi-valued mappings [Spies] several of them are special cases of combination rules: Dempster’s, Smets’ .. others are the unique solution when interpreting belief functions as convex sets of probabilities (Fagin’s) once again, a duality emerges between the most and least cautious conditioning approaches

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 72 / 125 Reasoning with belief functions Conditioning Dempster’s conditioning

Dempster’s rule of combination induces a conditioning operator given a new event B, the “logical” belief function such that m(B) = 1 .. ... is combined with the a-priori belief function Bel using Dempster’s rule the resulting BF is the conditional belief function given B Bel

B Bel (A|B)

in terms of belief and plausibility values, Dempster’s conditioning yields

¯ ¯ ( | ) = Bel(A∪B)−Bel(B) = Pl(B)−Pl(B\A) , ( | ) = Pl(A∩B) Bel⊕ A B 1−Bel(B¯) Pl(B) Pl⊕ A B Pl(B)

obtained by Bayes’ rule by replacing probability with plausibility measures!

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 73 / 125 Reasoning with belief functions Conditioning Lower envelopes of conditional probabilities

we know that a belief function can be seen as the lower envelope of the family of probabilities consistent with it: Bel(A) = infP∈P[Bel] P(A) conditional belief function as the lower envelope (the inf) of the family of conditional probability functions P(A|B), where P is consistent with Bel: . . BelCr (A|B) = inf P(A|B), PlCr (A|B) = sup P(A|B) P∈P[Bel] P∈P[Bel]

quite incompatible with the random set interpretation nevertheless, whereas lower/upper envelopes of arbitrary sets of probabilities are not in general belief functions, these actually are belief functions:

( | ) = Bel(A∩B) , ( | ) = Pl(A∩B) BelCr A B Bel(A∩B+Pl(A¯∩B) PlCr A B Pl(A∩B)+Bel(A¯∩B)

they provide a more conservative estimate then Dempster’s conditioning

BelCr (A|B) ≤ Bel⊕(A|B) ≤ Pl⊕(A|B) ≤ PlCr (A|B)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 74 / 125 Reasoning with belief functions Conditioning Geometric conditioning

Suppes and Zanotti proposed a ‘geometric’ conditioning approach Bel(A ∩ B) Bel(B) − Bel(B \ A) Bel (A|B) = , Pl (A|B) = G Bel(B) G Bel(B)

retains only the masses of focal elements inside B, and normalises them: m(A) m (A|B) = A ⊆ B G Bel(B)

it is a consequence of the focussing approach to belief update: no new information is introduced, we merely focus on a specific subset of the original set replaces probability with belief measures in Bayes’ rule

Pl(A∩B) Bel(A∩B) Pl⊕(A|B) = Pl(B) ↔ BelG(A|B) = Bel(B)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 75 / 125 Reasoning with belief functions Conditioning Conjunctive rule of conditioning

it is induced by the conjunctive rule of combination: m ∩ (A|B) = m ∩ mB (mB is the logical BF focussed on B) [Smets] its belief and plausibility values are:  Bel(A ∪ B¯) A ∩ B 6= ∅  Pl(A ∩ B) A 6⊃ B Bel ∩ (A|B) = Pl ∩ (A|B) = 0 A ∩ B = ∅ 1 A ⊃ B = ∅

it is compatible with the principles of belief revision [Gilboa, Perea]: a state of belief is modified to take into account a new piece of information

I in probability theory, both focussing and revision are expressed by Bayes’ rule, but they are conceptually different operations which produce different results on BFs it is more committal than Dempster’s rule!

Bel⊕(A|B) ≤ Bel ∩ (A|B) ≤ Pl ∩ (A|B) ≤ Pl⊕(A|B)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 76 / 125 Reasoning with belief functions Conditioning Disjunctive rule of conditioning

induced by the disjunctive rule of combination: m ∪ (A|B) = m ∪ mB obviously dual to conjunctive conditioning assigns mass only to subsets containing the conditioning event B belief and plausibility values:

 Bel(A) A ⊃ B  Pl(A) A ∩ B = ∅ Bel ∪ (A|B) = Pl ∪ (A|B) = 0 A 6⊃ B 1 A ∩ B 6= ∅

it is less committal not only than Dempster’s rule, but also than credal conditioning

Bel ∪ (A|B) ≤ BelCr (A|B) ≤ PlCr (A|B) ≤ Pl ∪ (A|B)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 77 / 125 Reasoning with belief functions Conditioning Conditioning - an overview

belief plausibility Pl(B) − Pl(B \ A) Pl(A ∩ B) Dempster’s ⊕ Pl(B) Pl(B) Bel(A ∩ B) Pl(A ∩ B) Credal Cr Bel(A ∩ B) + Pl(A¯ ∩ B) Pl(A ∩ B) + Bel(A¯ ∩ B) Bel(A ∩ B) Bel(B) − Bel(B \ A) Geometric G Bel(B) Bel(B) Conjunctive ∩ Bel(A ∪ B¯), A ∩ B 6= ∅ Pl(A ∩ B), A 6⊃ B Disjunctive ∪ Bel(A), A ⊃ B Pl(A), A ∩ B = ∅

Nested conditioning operators Conditioning operators form a nested family, from the more committal to the least one!

Bl ∪ (·|·) ≤ BlCr (·|·) ≤ Bl⊕(·|·) ≤ Bl ∩ (·|·) ≤ Pl ∩ (·|·) ≤ Pl⊕(·|·) ≤ PlCr (·|·) ≤ Pl ∪ (·|·)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 78 / 125 Reasoning with belief functions Belief vs Bayesian reasoning Belief vs Bayesian reasoning A toy example

suppose we want to estimate the class of an object appearing in an image, based on feature measurements extracted from the image (e.g. by convolutional neural networks) we capture a training set of images, complete with annotated object labels assuming a PDF of a certain family (e.g. mixture of Gaussians) we can learn from the training data a likelihood function p(y|x), where y is the object class and x the image feature vector

suppose n different ‘sensors’ extract n features xi from each image: x1, ..., xn let us compare how data fusion works under the Bayesian and the belief function paradigms!

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 79 / 125 Reasoning with belief functions Belief vs Bayesian reasoning (Naive) Bayesian data fusion Belief vs Bayesian reasoning

the likelihoods of the individual features are computed using the n likelihood functions learned during training: p(xi |y), for all i = 1, ..., n measurements are typically assumed to be conditionally independent, yielding Q the product likelihood p(x|y) = i p(xi |y) Bayesian inference is applied, typically assuming uniform priors (for there is no Q reason to think otherwise), yielding p(y|x) ∼ p(x|y) = i p(xi |y) uniform prior

x1 p(x |y) likelihood 1 function conditional independence ... × Bayes' rule

xn p(x i|y) likelihood �i function p(x n|y)

p(y|x) ~ p(x i|y) �i

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 80 / 125 Reasoning with belief functions Belief vs Bayesian reasoning Dempster-Shafer data fusion Belief vs Bayesian reasoning

with belief functions, for each feature type i a BF is learned from the the individual likelihood p(xi |y), e.g. via the likelihood-based approach by Shafer

this yields n belief functions Bel(y|xi ), on the range of possible object classes Y

a combination rule is applied to compute an overall BF (e.g. ∩ , ⊕, ∪ ), obtaining

Bel(Y |x) = Bel(Y |x1) ... Bel(Y |xn), Y ⊆ Y

p(x 1|y)

Bel(Y|x1 ) x1 likelihood likelihood-based function inference belief function combination

... Bel(Y|x)

xn likelihood likelihood-based function inference

Bel(Y|xn )

p(x n|y)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 81 / 125 Reasoning with belief functions Belief vs Bayesian reasoning Inference under partially reliable data Belief vs Bayesian reasoning

in the fusion example we have assumed that the data are measured correctly what if the data-generating process is not completely reliable? problem: suppose we want to just detect an object (binary decision: yes Y or no N)

two sensors produce image features x1 and x2, but we learned from the training data that both are reliable only 20% of the time

at test time we get an image, measure x1 and x2, and unluckily sensor 2 got it wrong! the object is actually there we get the following normalised likelihoods

p(x1|Y ) = 0.9, p(x1|N) = 0.1; p(x2|Y ) = 0.1, p(x2|N) = 0.9

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 82 / 125 Reasoning with belief functions Belief vs Bayesian reasoning Inference under partially reliable data Belief vs Bayesian reasoning

how do the two fusion pipelines cope with this? the Bayesian scholar assumes the two sensors/processes are conditionally independent, and multiply the likelihoods obtaining

p(x1, x2|Y ) = 0.9 ∗ 0.1 = 0.09, p(x1, x2|N) = 0.1 ∗ 0.9 = 0.09

1 1 so that p(Y |x1, x2) = 2 , p(N|x1, x2) = 2 Shafer’s faithful follower discounts the likelihoods by assigning mass .2 to the whole hypothesis space Θ = {Y , N}:

m(Y |x1) = 0.9 ∗ 0.8 = 0.72, m(N|x1) = 0.1 ∗ 0.8 = 0.08, m(Θ|x1) = 0.2; m(Y |x2) = 0.1 ∗ 0.8 = 0.08, m(N|x2) = 0.9 ∗ 0.8 = 0.72 m(Θ|x2) = 0.2

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 83 / 125 Reasoning with belief functions Belief vs Bayesian reasoning Inference under partially reliable data Belief vs Bayesian reasoning thus, when we combine them by Dempster’s rule we get the BF Bel on {Y , N}:

m(Y |x1, x2) = 0.458, m(N|x1, x2) = 0.458, m(Θ|x1, x2) = 0.084

when combined using the disjunctive rule (the least committal one) we get Bel0:

0 0 0 m (Y |x1, x2) = 0.09, m (N|x1, x2) = 0.09, m (Θ|x1, x2) = 0.82

the corresponding (credal) sets of probabilities are

Bel Bel'

P(Y|x1 ,x 2) 0.09 0.46 0.54 0.91 0 Bayes 1 the credal interval for Bel is quite narrow: reliability is assumed to be 80%, and we got a faulty measurement in two! (50%) the disjunctive rule is much more cautious about the correct inference

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 84 / 125 Reasoning with belief functions Generalised Bayes Theorem Generalised Bayes Theorem Generalising full Bayesian inference

in Smets’ generalised Bayesian theorem setting, the input is a set of ‘conditional’ belief functions on Θ, rather than likelihoods p(x|θ) there

BelX(X|θ), X ⊂ X, θ ∈ Θ each associated with a value θ of the parameter

I (these are not the same conditional belief functions we saw, where a conditioning event B ⊂ Θ alters a prior belief function BelΘ mapping it to BelΘ(.|B)) they can be seen as a parameterised family of BFs on the data the desired output is another family of belief functions on Θ, parameterised by all sets of measurements X on X:

BelΘ(A|X), ∀X ⊂ X

each piece of evidence mX(X|θ) has an effect on our beliefs on the parameters coherent with the random set setting, as we condition on set-valued observations

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 85 / 125 Reasoning with belief functions Generalised Bayes Theorem Generalised Bayes Theorem

Generalised Bayes Theorem

Implements this inference BelX(X|θ) 7→ BelΘ(A|X) by: 1 computing an intermediate family of BFs on X parameterised by sets of parameter values: Y ∪ BelX(X|A) = θ∈ABelX(X|θ) = BelX(X|θ) θ∈A

via the disjunctive rule of combination ∪

2 assuming that PlΘ(A|X) = PlX(X|A) ∀A ⊂ Θ, X ⊂ X

3 Y ¯ this yields BelΘ(A|X) = BelX(X|θ) θ∈A¯

generalises Bayes’ rule (by replacing P with Pl) when priors are uniform

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 86 / 125 Reasoning with belief functions The total belief theorem The total belief theorem Generalising the law of total probability conditional belief functions are crucial for our approach to inference complementary link of the chain: generalisation of the law of total probability recall that a refining is a mapping from elements of one set Ω to elements of a disjoint partition of a second set Θ

� Bel 0 = 2 [0,1] �

�i

�i Bel i = 2 [0,1]

������������i i

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 87 / 125 Reasoning with belief functions The total belief theorem The total belief theorem Statement

Total belief theorem Suppose Θ and Ω are two finite sets, and ρ : 2Ω → 2Θ the unique refining between them. Let Bel0 be a belief function defined over Ω = {ω1, ..., ω|Ω|}. Suppose there Πi exists a collection of belief functions Beli : 2 → [0, 1], where Π = {Π1, ..., Π|Ω|}, Πi = ρ({ωi }), is the partition of Θ induced by Ω. Then, there exists a belief function Bel : 2Θ → [0, 1] such that:

1 Bel0 is the marginal of Bel to Ω (Bel0(A) = Bel(ρ(A)));

2 Bel ⊕ BelΠi = Beli ∀i = 1, ..., |Ω|, where BelΠi is the logical belief function:

mΠi (A) = 1 A = Πi , 0 otherwise

several distinct solutions exists, and they likely form a graph with symmetries one such solution is easily identifiable

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 88 / 125 Reasoning with belief functions The total belief theorem The total belief theorem Existence of a solution [Zhou & Cuzzolin, UAI 2017]

assume Θ0 ⊇ Θ, and m a mass function over Θ −→ 0 m can be identified with a mass function m Θ0 over the larger frame Θ : for any 0 0 −→ 0 0 0 −→ 0 E ⊆ Θ , m Θ0 (E ) = m(E) if E = E ∪ (Θ \ Θ) and m Θ0 (E ) = 0 otherwise −→ 0 such m Θ0 is called the conditional embedding of m into Θ −−→ Πi let Beli be the conditional embedding of Beli into Θ for all Beli : 2 → [0, 1], and −→ −−→ −−−→ Bel = Bel1 ⊕ · · · ⊕ Bel|Ω|

Total belief theorem: existence

. ↑Θ −→ The belief function Bel = Bel0 ⊕ Bel is a valid total belief function.

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 89 / 125 Reasoning with belief functions Decision making Decision making with belief functions

a decision problem can be formalised by defining:

I a set Ω of possible states of the world I a set X of consequences I and a set F of acts, where an act is a function f :Ω → X mapping a world state to a consequence problem: to select an act f from an available list F (i.e., to make a decision), which optimises a certain objective function various approaches to decision making with belief functions; among those:

I decision making in the TBM is based on expected utility via pignistic transform I generalised expected utility [Gilboa] based on classical expected utility theory [Savage,von Neumann] also a lot of interest in multicriteria decision making (based on a number of attributes)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 90 / 125 Reasoning with belief functions Decision making Decision making with the pignistic probability

classical expected utility theory is due to Von Neumann in Smets’ Transferable Belief Model, decision making is done by maximising the expected utility of actions based on the ‘pignistic’ transform this maps a belief function Bel on Ω to a probability distribution there:

X m(A) BetP[Bel](ω) = ∀ω ∈ Ω |A| A3ω

the set of possible actions F and the set Ω of possible outcomes are distinct, and the utility function u is defined on F × Ω the optimal decision maximises X E[u] = u(f , ω)Pign(ω) ω∈Ω

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 91 / 125 Reasoning with belief functions Decision making Savage’s sure thing principle

let < be a preference relation on F, such that f < g means that f is at least as desirable as g Savage (1954) has showed that < verifies some rationality requirements iff there exists a probability measure P on Ω and a utility function u : X → R s.t.

∀f , g ∈ F, f < g ⇔ EP (u ◦ f ) ≥ EP (u ◦ g)

does that mean that using belief functions is irrational? given f , h ∈ F and E ⊆ Ω, let fEh denote the act defined by ( f (ω) if ω ∈ E (fEh)(ω) = h(ω) if ω 6∈ E

then the sure thing principle states that ∀E, ∀f , g, h, h0:

0 0 fEh < gEh ⇒ fEh < gEh

Ellsberg’s paradox: empirically the Sure Thing Principle is violated! this is because people are averted to second-order uncertainty

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 92 / 125 Reasoning with belief functions Decision making Ellsberg’s paradox

suppose you have an urn containing 30 red balls and 60 balls, either black or yellow

I f1: you receive 100 euros if you draw a red ball I f2: you receive 100 euros if you draw a black ball I f3: you receive 100 euros if you draw a red or yellow ball I f4: you receive 100 euros if you draw a black or yellow ball

in this example Ω = {R, B, Y }, fi :Ω → R and X = R

empirically most people strictly prefer f1 to f2, but they strictly prefer f4 to f3 RBY Now, pick E = {R, B}: by definition f1 100 0 0 f 0 100 0 2 f1{R, B}0 = f1, f2{R, B}0 = f2 f 100 0 100 3 f1{R, B}100 = f3, f2{R, B}100 = f4 f4 0 100 100

since f1 < f2, i.e. f1{R, B}0 < f2{R, B}0 the Sure Thing principle would imply f1{R, B}100 < f2{R, B}100, i.e., f3 < f4 empirically the Sure Thing Principle is violated!

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 93 / 125 Reasoning with belief functions Decision making Lower and upper expected utilities

Gilboa (1987) proposed a modification of Savage’s axioms a preference relation < meets these weaker requirements iff there exists a (non necessarily additive) measure µ and a utility function u : X → R such that

∀f , g ∈ F, f < g ⇔ Cµ(u ◦ f ) ≥ Cµ(u ◦ g),

where Cµ is the Choquet integral, defined for X :Ω → R as Z +∞ Z 0 Cµ(X) = µ(X(ω) ≥ t)dt + [µ(X(ω) ≥ t) − 1]dt. 0 −∞

given a belief function Bel on Ω and a utility function u, this theorem supports making decisions based on the Choquet integral of u with respect to Bel for finite Ω, it can be shown that X X CBel (u ◦ f ) = m(B) min u(f (ω)) CPl (u ◦ f ) = m(B) max u(f (ω)) ω∈B ω∈B B⊆Ω B⊆Ω

(lower and upper expectations of u ◦ f with respect to Bel)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 94 / 125 Reasoning with belief functions Decision making Decision making Possible strategies let P(Bel) as usual be the set of probability measures P compatible with Bel, i.e., such that Bel ≤ P. Then, it can be shown that

CBel (u ◦ f ) = min EP (u ◦ f ) = E(u ◦ f ) CPl (u ◦ f ) = E(u ◦ f ) P∈P(Bel)

two expected utilities E(f ) and E(f ): how do we make a decision? possible decision criteria based on interval dominance: 1 f < g iff E(u ◦ f ) ≥ E(u ◦ g) (conservative strategy) 2 f < g iff E(u ◦ f ) ≥ E(u ◦ g) (pessimistic strategy) 3 f < g iff E(u ◦ f ) ≥ E(u ◦ g) (optimistic strategy) 4 f < g iff αE(u ◦ f ) + (1 − α)E(u ◦ f ) ≥ αE(u ◦ g) + (1 − α)E(u ◦ g) for some α ∈ [0, 1] called a pessimism index (Hurwicz criterion) the conservative strategy yields only a partial preorder: f and g are not comparable if E(u ◦ f ) < E(u ◦ g) and E(u ◦ g) < E(u ◦ f ) Ellberg’s paradox is actually explained by the pessimistic strategy

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 95 / 125 Theories of uncertainty Outline

Belief functions Decision making 1 Uncertainty Semantics 5 Second-order uncertainty Dempster’s rule Theories of uncertainty Classical probability Multivariate analysis Imprecise probability 2 Beyond probability Misunderstandings Monotone capacities Probability intervals Set-valued observations 4 Reasoning with belief Propositional evidence Fuzzy and possibility theory functions Probability boxes Scarce data Statistical inference Rough sets Representing ignorance Combination Rare events Conditioning 6 Belief functions on reals Uncertain data Belief vs Bayesian reasoning Continuous belief functions 3 Belief theory Generalised Bayes Theorem Random sets A theory of evidence The total belief theorem 7 Conclusions

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 96 / 125 Theories of uncertainty Theories of uncertainty

several different mathematical theories of uncertainty compete to be adopted by practitioners consensus is that there no such a thing as the best mathematical description of uncertainty random sets are not the most general framework; however, we argue here, they naturally arise from set-valued observations scholars have extensively discussed and compared the various approaches to uncertainty theory [Klir, Destercke] theoretical and empirical comparisons between belief functions and other theories were conducted [Lee,Yager, Helton,Regan..] some attempts have been made to unify most approaches to uncertainty theory [Klir,Zadeh,Walley]

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 97 / 125 Theories of uncertainty A hierarchy of uncertainty theories

Lower/upper previsions

CREDAL Lower/upper probabilities SETS MONOTONE CAPACITIES 2-monotone capacities 2-MON CAPACITIES BELIEF FUNCTIONS FEASIBLE Random sets PROB RANDOM SETS INTERVALS ∞-MON CAPACITIES Probability intervals

Generalised p-boxes NORMALISED SUM FUNCTIONS P-boxes

Possibilities Probabilities

Left: relation between BFs and other uncertainty measures Right: Destercke’s partial hierarchies of uncertainty theories

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 98 / 125 Theories of uncertainty Imprecise probability Coherent lower probabilities Walley’s Imprecise Probability behavioural approach to probability a lower probability P is a function from a sigma-algebra to the unit interval [0, 1] such that: P(A ∪ B) ≥ P(A) + P(B) ∀A ∩ B = ∅ (super-additivity) a lower probability P avoids sure loss if . n o P(P) = P : P(A) ≥ P(A), ∀A ⊆ Ω 6= ∅

(the lower bound constraints P(A) can be satisfied by some probability measure) it is coherent if inf P(A) = P(A) p∈P(P) (P is the lower envelope on subsets of P(P)) not all convex sets of probabilities can be described by merely focusing on events [Walley] → notion of gamble following De Finetti, imprecise probability equals ‘belief’ to ‘inclination to act’: an agent believes in an outcome to the extent it is willing to accept a bet on it

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 99 / 125 Theories of uncertainty Imprecise probability Desirable gambles Gamble

A gamble is a bounded real-valued function on Θ: X :Θ → R, θ 7→ X(θ). a lower probability can be seen as a functional defined on the class of all indicator functions of sets (the traditional events) D an agent’s set of desirable gambles by D ⊆ L(Ω), X where L(Ω) is the set of all bounded real valued functions on Ω Y since whether a gamble is desirable depends on the 0 agent’s belief on the outcome, D can be used as a model of the agent’s uncertainty about the problem Coherence of desirable gambles

A set D of desirable gambles is coherent iff it is a convex cone (it is closed under convex combination).

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 100 / 125 Theories of uncertainty Imprecise probability Lower and upper previsions

suppose the agent buys a gamble X for a price µ: this yields a new gamble X − µ . lower prevision P(X) of a gamble X: P(X) = sup{µ : X − µ ∈ D}, the supremum acceptable price for buying X selling a gamble X for a price µ also yields a new gamble µ − X . upper prevision P(X) of a gamble X: P(X) = inf{µ : µ − X ∈ D}, supremum acceptable price for selling X when lower and upper prevision coincide, P(X) = P(X) = P(X) is called the precise prevision of X (what de Finetti called ‘fair price’)

for prices in [P(X), P(X)] we are undecided as to whether buy or sell gamble X

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 101 / 125 Theories of uncertainty Imprecise probability Rules of rational behaviour Rational behaviour the agent does not specify betting rates such as they lose utility whatever the outcomes (avoiding sure loss) the agent is fully aware of the consequences of its betting rates (coherence)

if the first condition is not met, there exists a positive combination of gambles which the agent finds individually desirable which is not desirable to them one consequence of avoiding sure loss is that P(A) ≤ P(A) a consequence of coherence is that lower previsions are superadditive a precise prevision P is coherent iff: (i) P(λX + µY ) = λP(X) + µP(Y ); (ii) if X > 0 then P(X) ≥ 0; (iii) P(Ω) = 1, and coincides with de Finetti’s notion of coherent prevision

A powerful theory Generalises probability measures, de Finetti previsions, 2-monotone capacities, Choquet capacities, possibility/necessity measures, belief/plausibility measures, random sets but also probability boxes, credal sets, and robust Bayesian models.

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 102 / 125 Theories of uncertainty Monotone capacities Monotone capacities Choquet [1953], Sugeno [1974] the theory of capacity is a generalisation of classical measure theory

Monotone capacity

Given a domain Θ and a non-empty family F of subsets of Θ, a monotone capacity or fuzzy measure is a function µ : F → [0, 1] such that µ(∅) = 0 if A ⊆ B then µ(A) ≤ µ(B), for every A, B ∈ F (‘monotonicity’)

for any nonnegative measurable function f on (Θ, F), the Choquet integral of f on any A ∈ F is defined as: . Z ∞ Cµ(f ) = µ(Fα ∩ A)dα, 0

where Fα = {x ∈ Θ|f (x) ≥ α}, α ∈ [0, ∞) both Choquet integral of monotone capacities and natural extension of lower probabilities are generalisations of the Lebesgue integral

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 103 / 125 Theories of uncertainty Monotone capacities Order of a capacity Special types of capacities Order of a capacity

A capacity µ is said to be of order k if

 k    [ X |K |+1 \ µ Aj ≥ (−1) µ Aj j=1 ∅6=K ⊆[1,...,k] j∈K for all collections of k subsets Aj , j ∈ K of Θ

if k 0 > k the resulting theory is less general than a theory of capacities of order k

Capacities and belief functions Belief functions are infinitely monotone capacities.

just compare the definition of order with the third axiom of belief functions

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 104 / 125 Theories of uncertainty Probability intervals Probability intervals Set of probability intervals A system of constraints on a probability distribution p :Θ → [0, 1] of the form: . n o P(l, u) = p : l(x) ≤ p(x) ≤ u(x), ∀x ∈ Θ

probability intervals typically arise through measurement errors, or measurements inherently of interval nature a set of probability intervals also determines a credal set, a sub-class of all credal sets generated by lower and upper probabilities each belief function induces a set of probability intervals

Belief functions and probability intervals The minimal probability interval containing a pair of belief/plausibility functions is that whose lower bound is the belief of singletons, the upper bound is their plausibility:

l∗(x) = Bel(x), u∗(x) = Pl(x) ∀x ∈ Θ

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 105 / 125 Theories of uncertainty Fuzzy and possibility theory Fuzzy sets and possibility theory Zadeh,Dubois and Prade

concept of fuzzy set was introduced by Lotfi A. Zadeh [1965] elements belong to a set with a certain ‘degree of membership’ theory was further developed by Didier Dubois and Henri Prade into a mathematical theory of partial belief, called possibility theory a possibility measure on Θ is a function Π: 2Θ → [0, 1] such that Π(∅) = 0, Π(Θ) = 1 and ! [ Π Ai = sup Π(Ai ) i i Θ for every family of subsets {Ai ∈ 2 } each possibility measure is uniquely characterised by a membership function . π :Θ → [0, 1] s.t. π(x) = Π({x}) via the formula: Π(A) = supx∈A π(x) the dual quantity N(A) = 1 − Π(Ac ) is called necessity measure

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 106 / 125 Theories of uncertainty Fuzzy and possibility theory Possibility and belief measures

call plausibility assignment pl the restriction of the plausibility function to singletons pl(x) = Pl({x}) - then [Shafer]: Bel is a necessity measure iff Bel is consonant the membership function coincides with the plausibility assignment

��

� �� ��

a finite fuzzy set is equivalent to a consonant belief function

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 107 / 125 Theories of uncertainty Fuzzy and possibility theory Belief functions on fuzzy sets

belief functions defined on fuzzy sets have also been proposed basic idea: belief measures generalised on fuzzy sets as follows: X Bel(X) = I(A ⊆ X)m(A) A∈M where X is a fuzzy set defined on Θ, m is a mass function defined on the collection M of fuzzy sets on Θ I(A ⊆ X) is a measure of how much fuzzy set A is included in fuzzy set X various measures of inclusion in [0, 1] can be proposed:

I Lukasiewicz: I(x, y) = min{1, 1 − x − y} [Ishizuka] I Kleene-Dienes: I(x, y) = max{1 − x, y} [Yager] V from which one can get: I(A ⊆ B) = x∈Θ I(A(x), B(y)) [Wu 2009]

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 108 / 125 Theories of uncertainty Probability boxes Probability boxes and random sets

a probability box or p-box [Ferson and Hajagos] hF, Fi is a class of cumulative n o distribution functions (CDFs) hF, Fi = F CDF : F ≤ F ≤ F

every pair Bel, Pl defined on the real line R (a random set), generates a unique p-box: F(x) = Bel((−∞, x]), F(x) = Pl((−∞, x])

1.0 conversely, every p-box

0.8 generates an entire equivalence � class of random intervals 0.6 e.g. the one with as focal 0.4 -3 -2 -1 elements F  0.2 h −1 −1 i F F = γ = F (α), F (α) 0.0 X  −5−3 −2−4 −1 01 2 3 4 5 ∀α ∈ [0, 1] � −1 . . where F (α) = inf{F(x) ≥ α}, F −1(α) = inf{F(x) ≥ α} are the ‘quasi-inverses’ of the upper and lower CDFs

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 109 / 125 Theories of uncertainty Rough sets Rough sets

first described by Polish computer scientist Zdzislaw I. Pawlak [1991] strongly linked to the idea of partition of the universe of hypotheses provide a formal approximation of a traditional set in terms of a pair of lower and an upper approximating sets let R ⊆ Θ × Θ be an equivalence relation which partitions Θ into a family of disjoint subsets Θ/R, called elementary sets measurable sets σ(Θ/R): the unions of one or more elementary sets, plus the empty set ∅ we can then approximate any subset of Θ using those measurable sets X: [ \ apr(A) = X ∈ σ(U/R), X ⊆ A , apr(A) = X ∈ σ(U/R), X ⊇ A

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 110 / 125 Theories of uncertainty Rough sets Rough sets and belief functions

lower approximation elementary set

upper approximation event A

Universe �

any probability P on F = σ(Θ/R) can be extended to 2Θ using inner measures: n o P∗(A) = sup P(X)|X ∈ σ(Θ/R), X ⊆ A = P(apr(A))

these are belief functions! (as was recognised before)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 111 / 125 Belief functions on reals Outline

Belief functions Decision making 1 Uncertainty Semantics 5 Second-order uncertainty Dempster’s rule Theories of uncertainty Classical probability Multivariate analysis Imprecise probability 2 Beyond probability Misunderstandings Monotone capacities Probability intervals Set-valued observations 4 Reasoning with belief Propositional evidence Fuzzy and possibility theory functions Probability boxes Scarce data Statistical inference Rough sets Representing ignorance Combination Rare events Conditioning 6 Belief functions on reals Uncertain data Belief vs Bayesian reasoning Continuous belief functions 3 Belief theory Generalised Bayes Theorem Random sets A theory of evidence The total belief theorem 7 Conclusions

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 112 / 125 Belief functions on reals Continuous formulations of the theory of belief functions

in the original formulation by Shafer [1976], belief functions are defined on finite sets only need for generalising this to arbitrary domains was soon recognised main approaches to continuous formulation:

I Shafer’s allocations of probability [1982] I ‘continuous’ belief functions on Borel intervals of the real line [Strat90,Smets] I belief functions as random sets [Nguyen78, Molchanov06] other approaches, with limited (so far) impact

I generalised evidence theory I MV algebras I several others

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 113 / 125 Belief functions on reals Continuous belief functions Continuous belief functions [Strat, Smets] take as frame of discernment Θ the set of possible closed intervals

right extremum 0 a b 1 0 a b 1 0 a b 1 1 1 1

left b b b extremum a a a intervals contained 0 in [a,b] 0 0 Bel([a,b]) Pl([a,b]) intervals containing [a,b]

Z b Z b Z b Z N Bel([a, b]) = m(x, y)dydx, Pl([a, b]) = m(x, y)dydx a x 0 max(a,x)

Dempster’s rule generalises in terms of double integrals . Z a Z 1 m(x, y) continuous pignistic PDF: Bet(a) = lim dx dy →0 0 a+ y − x

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 114 / 125 Belief functions on reals Continuous belief functions Special cases of random closed intervals Fuzzy sets and p-boxes

Consonant random interval p-box �(x) 1 1 F*

F Γ(�) * � Γ(�) �

x x 0 0 U(�) V(�) U(�) V(�)

a fuzzy set on the real line induces a mapping to a collection of nested intervals, parameterised by the level c a p-box, i.e, upper and lower bounds to a cumulative distribution function, also induces a family of intervals (as we already saw)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 115 / 125 Belief functions on reals Random sets Belief functions as random sets [Nguyen,1978], [Hestir,1991], [Shafer,1987]

given a multi-valued mapping Γ, a straightforward step is to consider the probability value P(ω) as attached to the subset Γ(ω) ⊆ Θ: this is a random set in Θ, i.e., a probability measure on a collection of subsets the degree of belief Bel(A) of an event A becomes the cumulative distribution function (CDF) of the open interval of sets {B ⊆ A} in 2Θ the lower inverse and upper inverse of Γ are:

. n o ∗ . n o Γ∗(A) = ω ∈ Ω : Γ(ω) ⊂ A, Γ(ω) 6= ∅ Γ (A) = ω ∈ Ω : Γ(ω) ∩ A 6= ∅

given two σ-fields A, B on Ω, Θ, Γ is said strongly measurable iff ∀B ∈ B, Γ∗(B) ∈ A . the lower probability measure on B is defined as P∗(B) = P(Γ∗(B)) for all B ∈ B - this is nothing but a belief function!

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 116 / 125 Belief functions on reals Random sets Belief functions as random sets Molchanov’s work

recently, strong renewed interest in a theory of random sets, thanks to Molchanov [2006,2017] and others

I theory of calculus with capacities and random sets I Radon-Nikodym theorems for capacities and random sets and derivatives of capacities I (conditional) expectations of random sets I limit theorems: strong law of large numbers, central limit theorem, Gaussian RSs I examined set-valued random processes powerful mathematical framework! way forward for the theory in my view

I no mentioning of conditioning and combination yet I connections with mathematical statistics to develop special case of random element [Frechet], random variable with structured output

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 117 / 125 Belief functions on reals Random sets Random closed sets

the family of all sets is too large, we typically restrict ourselves to the case of random elements in the space of closed subsets of a certain topological space E the family of closed subsets of E is denoted by C K denotes the family of all compact subsets of E let (Ω, F, P) be a probability space a map X :Ω → C is called a random closed set if, for every compact set K in E: {ω : X(ω) ∩ K 6= ∅} ∈ F

this is equivalent to strong measurability, whenever the σ-field on Θ is replaced by the family K of compact subsets of Θ the consequence is that the upper probability of K exists ∀K ∈ K

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 118 / 125 Belief functions on reals Random sets Random closed sets Some examples

if ξ is a random variable, then X = (−∞, ξ] is a random closed set d if ξ1, ξ2 and ξ3 are three random vectors in R , then the triangle with vertices ξ1, ξ2 and ξ3 is a random closed set

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 119 / 125 Belief functions on reals Random sets Capacity functionals Random closed sets

a functional TX : K → [0, 1] given by

TX (K ) = P({X ∩ K 6= ∅}), K ∈ K

is said to be the capacity functional of X

in particular, if X = {ξ} is a classical random variable, then TX (K ) = P({ξ ∈ K }) is the probability distribution of the random variable ξ

the name ‘capacity functional’ follows from the fact that TX is a functional on K which takes values in [0, 1], equals 0 on the empty set, is monotone and upper semicontinuous (i.e., TX is a capacity, and also completely alternating on K)

TX (K ) is the plausibility measure induced by the multivalued mapping X, restricted to compact subsets the links betwen random closed sets and belief/plausibility functions, upper and lower probabilities, and contaminated models in statistics are very briefly hinted at in [Molchanov 2005], Chapter 1, Section 9

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 120 / 125 Conclusions Outline

Belief functions Decision making 1 Uncertainty Semantics 5 Second-order uncertainty Dempster’s rule Theories of uncertainty Classical probability Multivariate analysis Imprecise probability 2 Beyond probability Misunderstandings Monotone capacities Probability intervals Set-valued observations 4 Reasoning with belief Propositional evidence Fuzzy and possibility theory functions Probability boxes Scarce data Statistical inference Rough sets Representing ignorance Combination Rare events Conditioning 6 Belief functions on reals Uncertain data Belief vs Bayesian reasoning Continuous belief functions 3 Belief theory Generalised Bayes Theorem Random sets A theory of evidence The total belief theorem 7 Conclusions

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 121 / 125 Conclusions A summary

the theory of belief functions is grounded in the beautiful mathematics of random sets has strong relationships with other theories of uncertainty (can be efficiently implemented by Monte-Carlo approximation) statistical evidence may be represented in several ways:

I by likelihood-based belief functions, generalizing both likelihood-based and Bayesian inference I by Dempster’s idea of using auxiliary variables I in the framework of the Generalised Bayesian Theorem (propagation on graphical models can be performed) decision making strategies based on intervals of expected utilities can be formulated that are more cautious than traditional ones the extension to continuous domains can be tackled via the Borel interval representation, in the more general case using the theory of random sets (a toolbox of estimation, classification, regression tools based on the theory of belief functions is available)

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 122 / 125 Conclusions What still needs to be resolved

clarify once and for all the epistemic interpretation of belief function theory → random variables for set-valued observations mechanism for evidence combination still debated, depend on meta-information on sources hardly accessible → working with intervals of belief functions may be the way forward

I acknowledges the meta-uncertainty on the nature of the sources generating the evidence same holds for conditioning (as we showed) what about computational complexity? → not an issue, just apply sampling for approximate inference

I we do not need to assign mass to all subsets, but we need to be allowed to do so when observations are indeed sets belief functions on reals → Borel intervals are nice, but the way forward is grounding the theory into the mathematics of random sets

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 123 / 125 Conclusions Future of random set/belief function theory

fully developed theory of statistical inference with random sets

I generalised likelihood, logistic regression I limit theorem, total probability for random sets I random set random variables and processes I frequentist inference with random sets propose solutions to high impact problems

I rare event prediction I robust foundations for machine learning I robust climatic change predictions further development of machine learning tools

I random set random forests I generalised max entropy classification I robust statistical learning theory

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 124 / 125 Appendix For Further Reading For Further ReadingI

G. Shafer. A mathematical theory of evidence. Princeton University Press, 1976.

I. Molchanov. Theory of Random Sets. Springer, 2017.

F. Cuzzolin. Visions of a generalized probability theory. Lambert Academic Publishing, 2014.

F. Cuzzolin. The geometry of uncertainty - The geometry of imprecise probabilities. Springer-Verlag (in press).

Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 125 / 125