Overview Pathologies of the Gamma Distribution for Small Parameters

Pathological Properties of Deep Bayesian Hierarchies Jacob Steinhardt and Zoubin Ghahramani Overview Convergence of Martingale Sequences Computing the Limit Solution 1: Pitman-Yor Processes •In hierarchical models, we need a distribution over •Consider the following sequences (thought of as •The limiting variance of the distributions in a •Pitman-Yor processes have the following the latent parameters at each node parameters on a path down a hierarchy): martingale must be 0, which implies: consistency property: if G1 | G0 ~ PYP(α,d0,G0), θn+1 θn DP(cθn),θn+1 θn BP(cθn), •θ converges to a single atom (DP and PYP) and G2 | G1 ~ PYP(αd1,d1,G1), then G2 | G0 ~ PYP •Common solution: recursively draw from a | ∼ | ∼ θn+1 θn GammaP(cθn),θn+1 θn PYP(cθn) (αd1,d0d1,G0). distribution such as a Dirichlet process, beta | ∼ | ∼ •All masses converge to 0 or 1 (beta process) process, Pitman-Yor process, etc. •All have the property that E[θn+1 | θn] = θn. •θ converges to 0 (gamma process) •In general, Gn | G0 ~ PYP(αd1...dn,d0...dn,G0). If G ({p}) = ε, then G ({p}) is approximately •We show that for DP, BP, and GammaP, this won’t •Called the martingale property •DP, BP, and GammaP all involve draws from a 0 n 1 •Philosophically desirable because it means that gamma random variable, so we will necessarily d d work for deep hierarchies α 0··· n information is preserved as we move down the run into the pathology described in Lemma 1! •But!...Pitman-Yor is okay d0 + α hierarchy •See Example 3 for a martingale that can converge Pathologies of the Gamma Distribution •Theorem (Doob): All non-negative martingale to an arbitrary value in [0,1] (also used in Solution 1 sequences have a limit with probability 1. 2) 0.9 for Small Parameters 0.8 •For small settings of the parameters, samples from Examples of Martingale Sequences 0.7 a gamma distribution can end up very close to zero. 0.6 0.9 •Lemma 1: If y ~ Gamma(cx,c), and cx ⩽ 1, then 1 120 0.5 0.9 0.8 0.4 100 1 1 1 0.8 cx 0.7 Cumulative probability P y 2− 0.3 ≤ c ≥ 2e 0.7 80 0.6 0.14 1 0.6 0.2 0.9 0.5 0.12 0.5 60 0.8 0.1 0.4 0.1 0.7 0.4 0.6 0 0.08 40 0.3 ï100 ï80 ï60 ï40 ï20 0 20 0.3 0.5 Mass (Logistic Scale) 0.06 0.4 0.2 0.2 20 0.04 0.3 Figure 3: The cdf of the mass placed on an atom of base mass 0.1, 0.1 0.1 0.2 0.02 for a draw from PYP(0.1,0.05). 0.1 0 0 0 0 20 40 60 80 100 120 140 160 180 200 0 10 20 30 40 50 60 70 80 90 100 0 0 0 2 4 6 8 10 12 14 16 18 20 ï20 ï15 ï10 ï5 0 5 ï20 ï15 ï10 ï5 0 5 Example 2: parameters of a hierarchical Figure 1: The pdf (left) and cdf (right) of log(Y), where Y ~ Example 1: parameters of a hierarchical Example 3: a martingale given by Gamma process. Gamma(0.2,1.0). Note the relatively large amount of Beta process. θn=αn/(αn+βn), where: Solution 2: Adding Inertia θn+1 | θn ~ Beta(50θn,50(1-θn)) xn+1 | xn ~ Gamma(xn,1) αn+1 | αn ~ αn+Gamma(αn,1), probability mass placed on values as small as exp(-20). •Instead of xn+1 ~ Gamma(cnxn,cn), have, e.g., βn+1 | βn ~ βn+Gamma(βn,1). •So, we should avoid choosing such small This construction can be used to rectify dn ~ Gamma(cnxn,cn), and xn+1 = (1-an)xn+andn. parameters. But for deep hierarchies, this turns the problems with HBPs and HDPs. •Still a martingale, even for Dirichlet out to be unavoidable! •Rate of decay controlled by the sequence an proposed model •Gamma, beta, and Dirichlet sequences all decay Proving That the Decay Rate is a Naive Solution: Mixing with Noise 0.13 towards 0 or 1 at a rate governed by a tower of Tower of Exponentials •Break martingale property and take, e.g., θn+1 ~ DP 0.12 exponentials: 1/e^(e^(e^(e^(...)))). (c[(1-ε)θn+ εµ0]), where µ0 is some global base •Theorem: If xn+1 ~ Gamma(cnxn,cn), where {cn} is M measure 0.11 bounded, then xk ⩽ (exp) (1) with probability 1-ε, •Issue: with N atoms, µ0 places mass 1/N on some Why Call This Behavior Pathological? where k = bM and b depends only on ε. 0.1 •Practically: if the parameters converge extremely •Note: (exp)M means exponentiation composed M atom, so DP has at least one parameter as small as cε/N mass rapidly, then posterior inference is extremely times 0.09 •Even more trouble with infinitely many atoms sensitive to parameter values deep in the tree, •Proof sketch: xn+1 << xn with non-negligible 0.08 which are too small to represent accurately on a probability by Lemma 1, but the martingale property •Forgets information after 1/ε steps kappa = 10.0, epsilon = 0.1 0 computer 10 0.07 together with Markov’s inequality bounds the Figure 2: The mass assigned to ï5 10 •The difference between a parameter value of 0, probability that xn+1 is ever more than a constant an atom for a hierarchical 0.06 -1000000000 -100 Dirichlet process with noise 0 50 100 150 200 250 300 350 400 450 500 10 , and 10 matters significantly to the greater than xn. ï10 depth 10 mixed in. Here we have conditional distribution of a parameter 3 levels up •Similar convergence properties (tower of mass Figure 4: the mass of an atom on an inertia-added hierarchical Beta ï15 10 parameters c = 10.0, ε = 0.1, and process. The sequence θn is generated as: •Philosophically: as Bayesians, we would never exponentials) for Beta and Dirichlet. µ0 a uniform distribution over ï20 αn+1=αn+Gamma(αn/θn,5) report confidences as high as exp(exp(...(1))), so 10 10 atoms. βn+1=βn+Gamma(βn/θn,5) ï25 10 ours models should not, either. 0 50 100 150 200 250 300 350 400 450 500 θn+1=αn+1/(αn+1+βn+1) depth.

Overview Pathologies of the Gamma Distribution for Small Parameters

STOCHASTIC COMPARISONS and AGING PROPERTIES of an EXTENDED GAMMA PROCESS Zeina Al Masry, Sophie Mercier, Ghislain Verdier

Introduction to Lévy Processes

Nonparametric Bayesian Latent Factor Models for Network Reconstruction

Option Pricing in Bilateral Gamma Stock Models

Bayesian Nonparametrics for Sparse Dynamic Networks

Compound Random Measures and Their Use in Bayesian Nonparametrics

A Simple Proof of Pitman-Yor's Chinese Restaurant Process from Its Stick-Breaking Representation

Nonparametric Dynamic Network Modeling

PROBABILISTIC CONSTRUCTION and PROPERTIES of GAMMA PROCESSES and EXTENSIONS Sophie Mercier

Combinatorial Clustering and the Beta Negative Binomial Process

The Kernel Beta Process

Real-Time Inference for a Gamma Process Model of Neural Spiking