Tamara Broderick ITT Career Development Assistant Professor, MIT

Coresets for Bayesian Logistic Regression Tamara Broderick ITT Career Development Assistant Professor, MIT

With: Jonathan H. Huggins, Trevor Campbell Bayesian inference

1 Bayesian inference • Complex, modular

1 Bayesian inference • Complex, modular; coherent uncertainties

1 Bayesian inference • Complex, modular; coherent uncertainties; prior info

1 Bayesian inference • Complex, modular; coherent uncertainties; prior info p(✓ y) p(y ✓)p(✓) | /✓ |

1 Bayesian inference • Complex, modular; coherent uncertainties; prior info p(✓ y) p(y ✓)p(✓) | /✓ | • MCMC

1 Bayesian inference • Complex, modular; coherent uncertainties; prior info p(✓ y) p(y ✓)p(✓) | /✓ | • MCMC: Accurate but can be slow [Bardenet, Doucet, Holmes 2015]

1 Bayesian inference • Complex, modular; coherent uncertainties; prior info p(✓ y) p(y ✓)p(✓) | /✓ | • MCMC: Accurate but can be slow [Bardenet, Doucet, Holmes 2015] • (Mean-ﬁeld) variational Bayes: (MF)VB

(3.6M) (350K)

(3.6M) (350K) • Misestimation & lack of quality guarantees [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011]

(3.6M) (350K) • Misestimation & lack of quality guarantees [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011; Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2015]

(3.6M) (350K) • Misestimation & lack of quality guarantees [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011; Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2015; Opper, Winther 2003; Giordano, Broderick, Jordan 2015]

1 streaming, distributed algs. with theoretical guarantees Data summarization

2 Data summarization • Exponential family likelihood N p(y px(y x,, ✓)= exp [T (y ,x ) ⌘(✓)] 1:N | 1:|N n n · n=1 Y N =exp T (y ,x ) ⌘(✓) n n · "(n=1 ) # X

2 Data summarization • Exponential family likelihood N Sufﬁcient statistics p(y px(y x,, ✓)= exp [T (y ,x ) ⌘(✓)] 1:N | 1:|N n n · n=1 Y N =exp T (y ,x ) ⌘(✓) n n · "(n=1 ) # X

2 Data summarization • Exponential family likelihood N Sufﬁcient statistics p(y px(y x,, ✓)= exp [T (y ,x ) ⌘(✓)] 1:N | 1:|N n n · n=1 Y N =exp T (y ,x ) ⌘(✓) n n · "(n=1 ) # X • Scalable, single-pass, streaming, distributed, complementary to MCMC • But: Often no simple sufﬁcient statistics • E.g. Bayesian logistic regression; GLMs; “deeper” models

3 [Agarwal et al 2005, Feldman & Langberg 2011] Baseball Curling

3 [Agarwal et al 2005, Feldman & Langberg 2011] Coresets

Baseball Curling

3 [Agarwal et al 2005, Feldman & Langberg 2011] Coresets

Baseball Curling

• Pre-process data to get a smaller, weighted data set

3 [Agarwal et al 2005, Feldman & Langberg 2011] Coresets

Baseball Curling

• Pre-process data to get a smaller, weighted data set • Theoretical guarantees on quality

3 [Agarwal et al 2005, Feldman & Langberg 2011] Coresets

Baseball Curling

• Pre-process data to get a smaller, weighted data set • Theoretical guarantees on quality • Fast algorithms; error bounds for streaming, distributed

3 [Agarwal et al 2005, Feldman & Langberg 2011] Coresets

Baseball Curling

• Pre-process data to get a smaller, weighted data set • Theoretical guarantees on quality • Fast algorithms; error bounds for streaming, distributed • Cf. data squashing, big data GP ideas

3 [Agarwal et al 2005, Feldman & Langberg 2011, DuMouchel et al 1999, Madigan et al 1999] Coresets

Baseball Curling

3 [Agarwal et al 2005, Feldman & Langberg 2011, DuMouchel et al 1999, Madigan et al 1999] Coresets

Baseball Curling

3 [Agarwal et al 2005, Feldman & Langberg 2011, DuMouchel et al 1999, Madigan et al 1999] Coresets

Baseball Curling

3 [Agarwal et al 2005, Feldman & Langberg 2011, DuMouchel et al 1999, Madigan et al 1999] Coresets

Baseball Curling

[Agarwal et al 2005, Feldman & Langberg 2011, DuMouchel et al 1999, Madigan et al 1999; 3 Huggins, Campbell, Broderick 2016] Coresets

Baseball Curling

• Pre-process data to get a smaller, weighted data set • Theoretical guarantees on quality • Fast algorithms; error bounds for streaming, distributed • Cf. data squashing, big data GP ideas, subsampling • We develop: coresets for Bayes • Focus on: Logistic regression [Agarwal et al 2005, Feldman & Langberg 2011, DuMouchel et al 1999, Madigan et al 1999; 3 Huggins, Campbell, Broderick 2016] Coresets

Baseball Curling

5 [Huggins, Campbell, Broderick 2016] Step 1: calculate sensitivities of each datapoint

5 [Huggins, Campbell, Broderick 2016] Step 2: sample points proportionally to sensitivity

6 [Huggins, Campbell, Broderick 2016] Step 2: sample points proportionally to sensitivity

6 [Huggins, Campbell, Broderick 2016] Step 3: weight points by inverse of their sensitivity

7 [Huggins, Campbell, Broderick 2016] Step 3: weight points by inverse of their sensitivity

7 [Huggins, Campbell, Broderick 2016] Step 4: input weighted points to existing approximate posterior algorithm

8 [Huggins, Campbell, Broderick 2016] Step 4: input weighted points to existing approximate posterior algorithm Fast! Full MCMC

webspam! 350K points! 127 features

8 [Huggins, Campbell, Broderick 2016] Step 4: input weighted points to existing approximate posterior algorithm Fast! Full MCMC

> 2 days webspam! 350K points! 127 features

8 [Huggins, Campbell, Broderick 2016] Step 4: input weighted points to existing approximate posterior algorithm Fast! Full MCMC

> 2 days webspam! 350K points! 127 features

Slow Coreset MCMC

8 [Huggins, Campbell, Broderick 2016] Step 4: input weighted points to existing approximate posterior algorithm Fast! Full MCMC

> 2 days webspam! 350K points! 127 features < 2 hours Slow Coreset MCMC

8 [Huggins, Campbell, Broderick 2016] Theory

9 [Huggins, Campbell, Broderick 2016] Theory • Finite-data theoretical guarantee

Thm sketch (HCB). Choose ε > 0, δ ∈ (0,1). Our algorithm runs 2 in O(N) time and creates coreset-size const ✏ + log(1/) ⇠ · W.p. 1 - δ, it constructs a coreset with ln ln ˜ ✏ ln E E  | E|

9 [Huggins, Campbell, Broderick 2016] Theory • Finite-data theoretical guarantee • On the log evidence (vs. posterior mean, uncertainty, etc) Thm sketch (HCB). Choose ε > 0, δ ∈ (0,1). Our algorithm runs 2 in O(N) time and creates coreset-size const ✏ + log(1/) ⇠ · W.p. 1 - δ, it constructs a coreset with ln ln ˜ ✏ ln E E  | E|

1. If Di’ is an ε-coreset for Di, then D1’ ⋃ D2’ is an ε-coreset for

D1 ⋃ D2. 2. If D’ is an ε-coreset for D and D’’ is an ε’-coreset for D’, then D’’ is an ε’’-coreset for D, where ε’’ = (1 + ε)(1 + ε’) − 1.

9 [Huggins, Campbell, Broderick 2016] 10 [Huggins, Adams, Broderick, submitted] • Subset yields 6M data points, 1K features