Pathological Properties of Deep Bayesian Hierarchies Jacob Steinhardt and Zoubin Ghahramani

Overview Convergence of Martingale Sequences Computing the Limit Solution 1: Pitman-Yor Processes •In hierarchical models, we need a distribution over •Consider the following sequences (thought of as •The limiting variance of the distributions in a •Pitman-Yor processes have the following the latent parameters at each node parameters on a path down a hierarchy): martingale must be 0, which implies: consistency property: if G1 | G0 ~ PYP(α,d0,G0), θn+1 θn DP(cθn),θn+1 θn BP(cθn), •θ converges to a single atom (DP and PYP) and G2 | G1 ~ PYP(αd1,d1,G1), then G2 | G0 ~ PYP •Common solution: recursively draw from a | ∼ | ∼ θn+1 θn GammaP(cθn),θn+1 θn PYP(cθn) (αd1,d0d1,G0). distribution such as a , beta | ∼ | ∼ •All masses converge to 0 or 1 (beta process) process, Pitman-Yor process, etc. •All have the property that E[θn+1 | θn] = θn. •θ converges to 0 (gamma process) •In general, Gn | G0 ~ PYP(αd1...dn,d0...dn,G0). If G ({p}) = ε, then G ({p}) is approximately •We show that for DP, BP, and GammaP, this won’t •Called the martingale property •DP, BP, and GammaP all involve draws from a 0 n 1 •Philosophically desirable because it means that gamma random variable, so we will necessarily d d work for deep hierarchies α￿ 0··· n information is preserved as we move down the run into the pathology described in Lemma 1! •But!...Pitman-Yor is okay d0 + α￿ hierarchy •See Example 3 for a martingale that can converge ￿ ￿ Pathologies of the •Theorem (Doob): All non-negative martingale to an arbitrary value in [0,1] (also used in Solution 1 sequences have a limit with 1. 2) 0.9 for Small Parameters 0.8 •For small settings of the parameters, samples from Examples of Martingale Sequences 0.7 a gamma distribution can end up very close to zero. 0.6

0.9 •Lemma 1: If y ~ Gamma(cx,c), and cx ⩽ 1, then 1 120 0.5 0.9 0.8 0.4 100 1 1 1 0.8 cx 0.7 Cumulative probability P y 2− 0.3 ≤ c ≥ 2e 0.7 ￿ ￿ 80 0.6 0.14 1 0.6 0.2

0.9 0.5 0.12 0.5 60 0.8 0.1 0.4 0.1 0.7 0.4

0.6 0 0.08 40 0.3 ï100 ï80 ï60 ï40 ï20 0 20 0.3 0.5 Mass (Logistic Scale) 0.06 0.4 0.2 0.2 20 0.04 0.3 Figure 3: The cdf of the mass placed on an atom of base mass 0.1, 0.1 0.1 0.2 0.02 for a draw from PYP(0.1,0.05). 0.1 0 0 0 0 20 40 60 80 100 120 140 160 180 200 0 10 20 30 40 50 60 70 80 90 100 0 0 0 2 4 6 8 10 12 14 16 18 20 ï20 ï15 ï10 ï5 0 5 ï20 ï15 ï10 ï5 0 5 Example 2: parameters of a hierarchical Figure 1: The pdf (left) and cdf (right) of log(Y), where Y ~ Example 1: parameters of a hierarchical Example 3: a martingale given by Gamma process. Gamma(0.2,1.0). Note the relatively large amount of Beta process. θn=αn/(αn+βn), where: Solution 2: Adding Inertia θn+1 | θn ~ Beta(50θn,50(1-θn)) xn+1 | xn ~ Gamma(xn,1) αn+1 | αn ~ αn+Gamma(αn,1), probability mass placed on values as small as exp(-20). •Instead of xn+1 ~ Gamma(cnxn,cn), have, e.g., βn+1 | βn ~ βn+Gamma(βn,1). •So, we should avoid choosing such small This construction can be used to rectify dn ~ Gamma(cnxn,cn), and xn+1 = (1-an)xn+andn. parameters. But for deep hierarchies, this turns the problems with HBPs and HDPs. •Still a martingale, even for Dirichlet out to be unavoidable! •Rate of decay controlled by the sequence an proposed model •Gamma, beta, and Dirichlet sequences all decay Proving That the Decay Rate is a Naive Solution: with Noise 0.13 towards 0 or 1 at a rate governed by a tower of Tower of Exponentials •Break martingale property and take, e.g., θn+1 ~ DP 0.12 exponentials: 1/e^(e^(e^(e^(...)))). (c[(1-ε)θn+ εµ0]), where µ0 is some global base •Theorem: If xn+1 ~ Gamma(cnxn,cn), where {cn} is M measure 0.11 bounded, then xk ⩽ (exp) (1) with probability 1-ε, •Issue: with N atoms, µ0 places mass 1/N on some Why Call This Behavior Pathological? where k = bM and b depends only on ε. 0.1 •Practically: if the parameters converge extremely •Note: (exp)M means exponentiation composed M atom, so DP has at least one parameter as small as cε/N mass rapidly, then posterior inference is extremely times 0.09 •Even more trouble with infinitely many atoms sensitive to parameter values deep in the tree, •Proof sketch: xn+1 << xn with non-negligible 0.08 which are too small to represent accurately on a probability by Lemma 1, but the martingale property •Forgets information after 1/ε steps kappa = 10.0, epsilon = 0.1 0 computer 10 0.07 together with Markov’s inequality bounds the Figure 2: The mass assigned to

ï5 10 •The difference between a parameter value of 0, probability that xn+1 is ever more than a constant an atom for a hierarchical 0.06 -1000000000 -100 Dirichlet process with noise 0 50 100 150 200 250 300 350 400 450 500 10 , and 10 matters significantly to the greater than xn. ï10 depth 10 mixed in. Here we have

conditional distribution of a parameter 3 levels up •Similar convergence properties (tower of mass Figure 4: the mass of an atom on an inertia-added hierarchical Beta ï15 10 parameters c = 10.0, ε = 0.1, and process. The sequence θn is generated as: •Philosophically: as Bayesians, we would never exponentials) for Beta and Dirichlet. µ0 a uniform distribution over ï20 αn+1=αn+Gamma(αn/θn,5) report confidences as high as exp(exp(...(1))), so 10 10 atoms. βn+1=βn+Gamma(βn/θn,5)

ï25 10 ours models should not, either. 0 50 100 150 200 250 300 350 400 450 500 θn+1=αn+1/(αn+1+βn+1) depth