Advances in Bayesian inference and stable optimization for large-scale machine learning problems

Francois Fagan

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences

COLUMBIA UNIVERSITY

2019 c 2018

Francois Fagan All Rights Reserved ABSTRACT

Advances in Bayesian inference and stable optimization for large-scale machine learning problems

Francois Fagan

A core task in machine learning, and the topic of this thesis, is developing faster and more accurate methods of posterior inference in probabilistic models. The thesis has two components. The first explores using deterministic methods to improve the efficiency of (MCMC) algorithms. We propose new MCMC algorithms that can use deterministic methods as a “prior” to bias MCMC proposals to be in areas of high posterior density, leading to highly efficient sampling. In Chapter 2 we develop such methods for continuous distributions, and in Chapter 3 for binary distributions. The resulting methods consistently outperform existing state-of-the-art sampling techniques, sometimes by several orders of magnitude. Chapter 4 uses similar ideas as in Chapters 2 and 3, but in the context of modeling the performance of left-handed players in one-on-one interactive sports. The second part of this thesis explores the use of stable stochastic gradient descent (SGD) methods for computing a maximum a posteriori (MAP) estimate in large-scale machine learning problems. In Chapter 5 we propose two such methods for softmax regression. The first is an implementation of Implicit SGD (ISGD), a stable but difficult to implement SGD method, and the second is a new SGD method specifically designed for optimizing a double-sum formulation of the softmax. Both methods comprehensively outperform the previous state-of-the-art on seven real world datasets. Inspired by the success of ISGD on the softmax, we investigate its application to neural networks in Chapter 6. In this chapter we present a novel layer-wise approximation of ISGD that has efficiently computable updates. Experiments show that the resulting method is more robust to high learning rates and generally outperforms standard backpropagation on a variety of tasks. Table of Contents

List of Figures vii

List of Tables xi

1 Introduction 1

I Advances in Bayesian inference: Leveraging deterministic methods for ef- ficient MCMC inference 5

2 Elliptical Slice Sampling with Expectation Propagation 6 2.1 Introduction ...... 6 2.2 Expectation propagation and elliptical slice sampling ...... 8 2.2.1 Elliptical slice sampling ...... 8 2.2.2 Expectation propagation ...... 10 2.2.3 Elliptical Slice Sampling with Expectation Propagation ...... 11 2.3 Recycled elliptical slice sampling ...... 13 2.4 Analytic elliptical slice sampling ...... 17 2.4.1 Analytic EPESS for TMG ...... 19 2.5 Experimental results ...... 20 2.5.1 Probit regression ...... 21 2.5.2 Truncated multivariate Gaussian ...... 22 2.5.3 Log-Gaussian Cox process ...... 24 2.6 Discussion and conclusion ...... 24

i 3 Annular Augmentation Sampling 26 3.1 Introduction ...... 26 3.2 Annular augmentation sampling ...... 28 3.2.1 Annular Augmentations ...... 29 3.2.2 Generic Sampler ...... 31 3.2.3 Operator 2: Rao-Blackwellized Gibbs ...... 32 3.2.4 Choice And Effect Of The Priorp ˆ ...... 34 3.3 Experimental results ...... 35 3.3.1 2D Ising Model ...... 36 3.3.2 Heart Disease Risk Factors ...... 39 3.4 Conclusion ...... 41

4 The Advantage of Lefties in One-On-One Sports 42 4.1 Introduction and Related Literature ...... 42 4.2 The Latent Skill and Handedness Model ...... 48 4.2.1 Medium-Tailed Priors for the Innate Skill Distribution ...... 49 4.2.2 Large N Inference Using Only Aggregate Handedness Data ...... 51 4.2.3 Interpreting the Posterior of L ...... 53 4.3 Including Match-Play and Handedness Data ...... 56 4.3.1 The Posterior Distribution ...... 57 4.3.2 Inference Via MCMC ...... 58 4.4 Using External Rankings and Handedness Data ...... 62 4.5 Numerical Results ...... 63 4.5.1 Posterior Distribution of L ...... 64

4.5.2 On the Importance of Conditioning on Topn,N ...... 64 4.5.3 Posterior of Skills With and Without the Advantage of Left-handedness ...... 66 4.6 A Dynamic Model for L ...... 69 4.7 Conclusions and Further Research ...... 73

ii II Advances in numerical stability for large-scale machine learning problems 76

5 Stable and scalable softmax optimization with double-sum formulations 77 5.1 Introduction ...... 77 5.2 Convex double-sum formulation ...... 80 5.2.1 Instability of ESGD ...... 81 5.2.2 Choice of double-sum formulation ...... 82 5.3 Stable and scalable SGD methods ...... 83 5.3.1 Implicit SGD ...... 83 5.3.2 U-max method ...... 86 5.4 Experiments ...... 88 5.4.1 Experimental setup ...... 88 5.4.2 Results ...... 91 5.5 Conclusion ...... 94 5.6 Extensions ...... 94

6 Implicit backpropagation 95 6.1 Introduction ...... 95 6.2 ISGD and related methods ...... 97 6.2.1 ISGD method ...... 97 6.2.2 Related methods ...... 98 6.3 Implicit Backpropagation ...... 99 6.4 Implicit Backpropagation updates for various activation functions ...... 101 6.4.1 Generic updates ...... 101 6.4.2 Relu update ...... 102 6.4.3 Arctan update ...... 104 6.4.4 Piecewise-cubic function update ...... 104 6.4.5 Relative run time difference of IB vs EB measured in flops ...... 104 6.5 Experiments ...... 106 6.6 Conclusion ...... 109

iii III Bibliography 111

IV Appendices 129

A Elliptical Slice Sampling with Expectation Propagation 130 A.1 Monte Carlo integration ...... 130 A.2 Markov Chain Monte Carlo (MCMC) ...... 130 A.3 Slice sampling ...... 131

B Annular Augmentation Sampling 132 B.1 Rao-Blackwellization ...... 132 B.2 2D-Ising model results ...... 132 B.2.1 LBP accuracy ...... 133 B.2.2 Benefit of Rao-Blackwellization ...... 133 B.2.3 RMSE for node and pairwise marginals ...... 133 B.3 Heart disease dataset ...... 139 B.4 CMH + LBP sampler ...... 139 B.5 Simulated data ...... 141 B.5.1 d =10...... 141

C The Advantage of Lefties in One-On-One Sports 143 C.1 Proof of Proposition 1 ...... 143 C.2 Rate of Convergence for Probability of Left-Handed Top Player Given Normal Innate Skills ...... 146 C.3 Difference in Skills at Given Quantiles with Laplace Distributed Innate Skills . . . . 150 C.4 Estimating the Kalman Filtering Smoothing Parameter ...... 151

D Stable and scalable softmax optimization with double-sum formulations 153 D.1 Comparison of double-sum formulations ...... 153 D.2 Proof of variable bounds and strong convexity ...... 155 D.3 Proof of convergence of U-max method ...... 158 D.4 Stochastic Composition Optimization ...... 160

iv D.5 Proof of general ISGD gradient bound ...... 161 D.6 Update equations for ISGD ...... 161 D.6.1 Single datapoint, single class ...... 162 D.6.2 Bound on step size ...... 167 D.6.3 Single datapoint, multiple classes...... 168 D.6.4 Multiple datapoints, multiple classes ...... 171 D.7 Comparison of double-sum formulations ...... 175 D.8 Results over epochs ...... 177

D.9 L1 penalty ...... 179 D.10 Word2vec ...... 181

E Implicit Backpropagation 185 E.1 Derivation of generic update equations ...... 185 E.2 IB for convolutional neural networks ...... 187 E.3 Convergence theorem ...... 188 E.4 Experimental setup ...... 193 E.4.1 Datasets ...... 193 E.4.2 Loss metrics ...... 194 E.4.3 Architectures ...... 195 E.4.4 Hyperparameters and initialization details ...... 196 E.4.5 Learning rates ...... 196 E.4.6 Clipping ...... 197 E.5 Results ...... 197 E.5.1 Run times ...... 197 E.5.2 Results from MNIST and music experiments ...... 197 E.5.3 MNIST classification ...... 200 E.5.4 MNIST autoencoder ...... 203 E.5.5 JSB Chorales ...... 205 E.5.6 MuseData ...... 209 E.5.7 Nottingham ...... 211 E.5.8 Piano-midi.de ...... 213

v E.5.9 UCI classification training losses ...... 218

vi List of Figures

1.1 Approaches to posterior inference in probabilistic models. The colored boxes and circles indicate the approaches used and developed in each chapter of the thesis. . .2

2.1 ESS ellipse shown in dashed red...... 10 2.2 The EP approximation is in teal and an EPESS elliptical slice is in dashed red. . . . 12 2.3 EPESS vs ESS: 400 samples of EPESS and ESS for a 2-d Gaussian (0,I) truncated N in a rectangular box 50 x 51, 1 y 1 . EPESS explores the parameter { ≤ ≤ − ≤ ≤ } space effectively whereas ESS does not...... 12 2.4 Plots of empirical results. In the top three plots all values are normalized so that EPESS(1) has value 1. In the bottom 2 plots all values are normalized so that Exact- HMC has value 1. The naming convention: EPESS(J) denotes Recycled EPESS with J points sampled per slice. EPESS(1) denotes EPESS without recycling. Ana EPESS(J) denotes Analytic EPESS with J threshold levels per iteration. Error bars in plot 3 for all algorithms are effectively zero...... 23

3.1 Diagram of the annular augmentation. (a) The auxiliary annulus. Each threshold Ti

defines a semicircle where Si = +1, which is indicated by a darker shade. The value of S changes as θ moves around the annulus. (b) The implied path of the annulus on the binary hypercube. Hypercube faces are colored to match the annulus. . . . . 30 3.2 Augmentation with mean-field priorp ˆ. (a) Equation (3.4) in graphical form; in blue,

the implied interval where s1 = 1 along the annulus. We denotep ˆ1 =p ˆ(S1 = 1). (b) Stretched annular augmentation; cf. Figure 3.1a...... 31

vii 3.3 Bias scale c = 0.2 and the bias for each node is drawn as Bi c Unif[ 1, 1]. The ∼ · − target distribution here is multi-modal. As we increase strength W , AAS, AAS + RB and AAG, AAG + RB increase their outperformance over all other samplers, including those using LBP (the LBP approximate degrades for W > 0.6)...... 37 3.4 No bias is applied to any node making the target distribution bimodal, with the modes becoming more peaked as strength W is increased. All annular augmentation samplers very significantly outperform...... 37 3.5 W is fixed to 0.2 and bias scale c is increased making the target distribution uni- modal. HMC and CMH find the mode quickly, as do annular augmentation samplers that leverage LBP. That group outperforms annular augmentation samplers with no access to LBP...... 38 3.6 Absolute error in estimating the log partition function ratio. Since there is zero bias, the LBP prior yields no additional benefit and so the results LBP driven samplers are omitted...... 40

4.1 The value of limN→∞ p(L nl; n, N) as given by the r.h.s. of (4.9) for different | values of n. We assume an N(0, 1) prior for p(L), a value of q = 11% and we fixed ∗ nl/n = 25%. The dashed vertical line corresponds to L from (4.10) and the Laplace approximations are from (4.11)...... 53 4.2 Posteriors of the left-handed advantage, L, using the inference methods developed in Sections 4.2, 4.3 and 4.4. “Match-play” refers to the results of MCMC inference using the full match-play data, “Handedness and Rankings” refers to MCMC infer- ence using only individual handedness data and external skill rankings, “Aggregate handedness” refers to (4.9) and “Aggregate handedness with Laplace Approxima- tion” refers to (4.9) but where the Laplace approximation in (4.11) substitutes for

the likelihood term. “Match-play without conditioning on Topn,N ” is discussed in Section 4.5.2...... 65 4.3 Posterior distribution of Federer’s skill minus Nadal’s skill with and without the advantage of left-handedness...... 68 4.4 Marginal posterior distribution of Federer’s and Nadal’s skill, with and without the advantage of left-handedness...... 68

viii 4.5 Posteriors of the left-handed advantage, Lt, computed using the Kalman filter / smoother. The dashed red horizontal lines (at 11% in the upper figure and 0 in the lower figure) correspond to the level at which there is no advantage to being left-

handed. The error bars in the marginal posteriors of Lt in the lower figure correspond to 1 standard deviation errors...... 72

5.1 Training log-loss vs CPU time. CPU time is marked by the number of epochs for OVE, NCE, IS, ESGD and U-max (they all have virtually the same runtime). ISGD is run for the same CPU time as the other methods, but since it has a slightly different runtime per iteration, it will complete a different number of epochs. Thus when the x-axis is at 50, this indicates the end of the ISGD’s computation but not necessarily its 50th epoch...... 92 5.2 Training log-loss on Eurlex for different learning rates. The x-axis denotes the num- ber of epochs...... 93

6.1 Illustration of the difference between ESGD and ISGD in optimizing f(θ) = θ2/2. The learning rate is η = 1.75 in both cases...... 98 6.2 Relu updates for EB and IB with µ = 0. The arrow tail is the initial value of θ>z while the head is its value after the backpropagation update. Lower values are better.103 6.3 Training loss of EB and IB (lower is better). The plots display the mean performance and one standard deviation errors. The MNIST-classification plot also shows an “exact” version of ISGD with an inner gradient descent optimizer...... 107 6.4 Training loss of EB with clipping and IB with clipping on three dataset-architecture pairs...... 109

B.1 Estimated posterior distributions of Wij for 1 i < j 6 on the heart disease ≤ ≤ dataset. The samples from each MH algorithm have had a fit to them for easier comparison. “True” refers to the MH sampler where exact partition function ratio was used, whereas “CMH” and “AAS” refer to the approximate MH samplers for which the partition function ratio was approximated by CMH or AAS, respectively...... 140

ix D.1 Log-loss of U-max on Eurlex for different learning rates with our proposed double- sum formulation and that of Raman et al. [2016]...... 176 D.2 The x-axis is the number of epochs and the y-axis is the training set log-loss from (5.2).178

x List of Tables

2.1 Datasets for probit experiments...... 21

4.1 Proportion of left-handers in several interactive one-on-one sports [Flatt, 2008, Loff- ing and Hagemann, 2016] with the relative changes in rank under the Laplace distri-

nl 1−q bution for innate skills with q = 11%. Dropalone = represents the drop in n−nl · q rank for a left-hander who alone gives up the advantage of being left-handed while

nl Dropall = nq represents the drop in rank of a left-hander when all left-handers give up the advantage of being left-handed...... 56 4.2 Probability of match-play results with and without the advantage of left-handedness. The players are ordered according to their rank from the top ranked player (Djokovic) to the lowest ranked player (Lopez). A player’s rank is given by the posterior mean of his total skill, S. Each cell gives the probability of the lower ranked player beating the higher ranked player. Above the diagonal the advantage of left-handedness is included in the calculations whereas below the diagonal it is not. The left-handed players are identified in bold font together with the match-play probabilities that change when the left-handed advantage is excluded, i.e. when left- and right-handed players meet...... 69

xi 4.3 Changes in rank of prominent left-handed players when the left-handed advantage is excluded. The change in rank in the “Match-play” column is computed using the MCMC samples generated using the full match-play data. The change in rank in the “Handedness and Rankings” column is computed using the MCMC samples given only individual handedness data and external skill rankings. The “Aggregate handedness” column uses the BTL ranking as the baseline and the change in rank is obtained by multiplying the baseline by the rank scaling factor of 1.36 from Table 4.1 and rounding to the closest integer...... 70

5.1 Datasets with a summary of their properties. Where the number of classes, dimension or number of examples has been altered, the original value is displayed in brackets. All of the datasets were downloaded from http://manikvarma.org/downloads/XC/ XMLRepository.html, except WikiSmall which was obtained from http://lshtc. iit.demokritos.gr/...... 89 5.2 Tuned initial learning rates for each algorithm on each dataset. The learning rate in 100,±1,±2,±3/N with the lowest log-loss after 50 epochs using only 10% of the data is displayed. ESGD applied to AmazonCat, Wiki10 and WikiSmall suffered from overflow with a learning rate of 10−3/N, but was stable with smaller learning rates (the largest learning rate for which it was stable is displayed)...... 90 5.3 Relative log-loss of each algorithm at the end of training. The values for each dataset are normalized by dividing by the corresponding Implicit log-loss. The algorithm with the lowest log-loss for each dataset is in bold...... 92

6.1 IB relu updates ...... 103

B.1 LBP error for different values of W relating to Figure 3.3. The bias scale c is fixed at 0.1...... 134 B.2 LBP error for different values of c relating to Figure 3.5. Strength W is fixed at 0.2. 134 B.3 RMSE of node marginals estimate for increasing values of W relating to Figure 3.4. No bias is applied to any node...... 135 B.4 Numerical values of RMSE (Node Marginals) for Figure 3.3. Bias scale c = 0.2 and we increase the strength (W)...... 135

xii B.5 Numerical values of RMSE (Pairwise Marginals) for Figure 3.4. Bias scale c = 0.2 and we increase the strength (W)...... 136 B.6 Numerical values of RMSE (Node Marginals) for Figure 3.4. No bias is applied to any node and we increase the strength (W)...... 136 B.7 Numerical values of RMSE (Pairwise Marginals) for Figure 3.4. No bias is applied to any node and we increase the strength (W)...... 137 B.8 Numerical values of RMSE (Node Marginals) for Figure 3.5. We fix W = 0.2 and the bias scale c is increased...... 137 B.9 Numerical values of RMSE (Pairwise Marginals) for Figure 3.5. We fix W = 0.2 and the bias scale c is increased...... 138

D.1 Time in seconds taken to run 50 epochs. OVE/NCE/IS/ESGD/U-max with n = 1, m = 5 all have the same runtime. ISGD with n = 1, m = 1 is faster per iteration. The final column displays the ratio of OVE/.../U-max to ISGD for each dataset. . . 177

E.1 Data sources ...... 194 E.2 Data characteristics. An asterisk ∗ indicates the average length of the musical piece. 88 is the number of potential notes in each chord. For information on the UCI classification datasets, see [Klambauer et al., 2017, Appendix 4.2]...... 194 E.3 Theoretical and empirical relative run time of IB vs EB. Empirical measured on AWS p2.xlarge with our basic Pytorch implementation...... 198

xiii Acknowledgments

Foremost, I would like to thank my supervisor Professor Garud Iyengar for guiding me through my PhD. Professor Iyengar has a seemingly boundless interest in all things. I always looked forward to our meetings and would often came away from them not only having gained greater academic understanding, but also with more knowledge (or questions!) about the world. I appreciate the freedom and support he has given me to explore my academic ideas, his generosity in offering fast and constructive feedback, and the patience he has shown me throughout the past five years. One of the great pleasures of working in an interdisciplinary field like machine learning is the opportunity to learn from and collaborate with many people. Professors Krzysztof Choromanski, John Cunningham and Martin Haugh added an extra dimension to my work and were invaluable mentors through the PhD process. It was also a pleasure to learn and grow with Jalaj Bhandari and Hal Cooper, both fellow PhD candidates, through the projects we worked on together. I spent two of my summers at Bloomberg L.P. under Dr Kang Sun, Sergei Yurovski and Gary Kazantsev, who showed great faith in me and gave me my first exposure to industry. In particular, I’m deeply grateful for Gary Kazantsev arranging funding from Bloomberg for my PhD. My final summer was spent interning at Amazon.com where I was incredibly lucky to work with Dr Andres Ayala, Dr Michael Berry and Dr Evan Buntrock, who were all incredible mentors and good friends. One of the most enjoyable aspects of my PhD were the excellent courses I took with my fel- low PhD candidates in my first few years at Columbia. Professors Cliff Stein, Donald Goldfarb, Ward Whitt, Garud Iyengar, Martin Haugh, Daniel Hsu, David Blei, Liam Paninski, Krzysztof Choromanski, John Cunningham and Shipra Agrawal exposed me to new ideas and provided me with a strong foundation for my research. It was also a rewarding experience to TA for and get to know Professors Soulaymane Kachani, Cyrus Mohebbi, Irene Song, Yuan Zhong, Soumyadip Ghosh, Martin Haugh and Ali Hirsa. The staff in the Columbia IEOR department are always wonderful to work with. Thank you

xiv to Darbi Leigh Roberts, Adina Berrios Brooks, Lizbeth Morales, Samuel Lee, Simon Mak, Yosimir Acosta, Gerald Cotiangco, Kristen Maynor, Shi Yee Lee, Jenny Mak, Jaya Mohanty and Carmen Ng for making my life easy. It was amazing to spend time and get to know many the other PhD candidates in the department. To keep the list short, I will just mention those who I have spent the most time with in our offices in room 321: Enrique Lelo de Larrea, Fengpei Li, Shuoguang Yang, Yuan Gao, Chaoxu Zhou, Camilo Hernandez and my desk-mate, Octavio Ruiz Lacedelli. I will miss your daily company. I am also indebted to Professors John Cunningham, Adam Elmachtoub, Donald Goldfarb and Martin Haugh for agreeing to serve on my dissertation committee. They are all remarkable re- searchers for whom I have the greatest respect. I have been extremely fortunate to have had their presence in my academic life.

xv To Chiara

xvi CHAPTER 1. INTRODUCTION 1

Chapter 1

Introduction

According to Box’s Loop [Box, 1976, Blei, 2014], the process of building probabilistic machine learning models involves looping through three tasks: model building, inferring hidden quantities and model criticism. The hope is that after cycling through these tasks a few times a good quality model will emerge that can then be used in practice. This thesis focuses on improving the “inference” part of the loop. The goal here is to infer a posterior distribution, confidence interval or point estimate for the latent variables in a model, e.g. the parameters of a neural network.1 Ideally one would always calculate the full posterior distribution of the latent variables. This gold standard is only achievable for some models where the posterior can be computed analytically, or for low-dimensional models where the posterior can be approximated using methods like Markov Chain Monte Carlo (MCMC). However, for more difficult problems, algorithms like MCMC can become impractically slow and other methods that are faster but return a less accurate posterior must be used instead. In Figure 1.1 we show a range of inference methods ordered by the accuracy of their posterior. In this thesis we consider a diverse range of problems, each requiring new techniques that improve, combine or apply the approaches displayed in Figure 1.1. The thesis is divided into two parts, with the first part focused on developing efficient MCMC methods and the second part proposing new scalable algorithms for maximum a posteriori (MAP) estimation. The problems that we tackle will be fully introduced in their respective chapters. Here we give a short summary

1Here we are interpreting the neural network loss as the log-likelihood of a probabilistic model. CHAPTER 1. INTRODUCTION 2

EPESS/ASS (Part 1, Ch. 2/3)

Lefties (Part 1, Ch. 4)

Softmax (Part 2, Ch. 5)

Neural Networks (Part 2, Ch. 6)

Posterior Accuracy Consistent MAP Laplace Variational MCMC Numerical Analytic Posterior Estimation Approximation Inference Integration

Point Estimate Parametric Non-parametric Exact Approximation Approximation

Figure 1.1: Approaches to posterior inference in probabilistic models. The colored boxes and circles indicate the approaches used and developed in each chapter of the thesis. of each problem and make connections between the techniques that we use to solve them. In the first part of this thesis we explore the use of deterministic methods to help model Bayesian systems and design efficient MCMC algorithms. Chapters 2 and 3 focus on performing posterior inference on common and important distributions like the truncated multivariate Gaussian, probit regression, Ising models and Boltzmann machines. MCMC techniques remain the gold standard for approximate Bayesian posterior inference as they are guaranteed to converge to the true posterior. However, their onerous runtime and sensitivity to tuning parameters often force one to use faster, but less accurate, deterministic approximations. Our high-level insight is that these deterministic methods can extract information about the posterior that can be used to construct highly efficient MCMC algorithms. The challenge here is to: (i) design MCMC methods that can effectively harness information from deterministic methods, and (ii) identify deterministic methods that yield the most useful information. We address (i) by developing MCMC methods that can use the deterministic approx- imate posterior as a “prior” to bias their proposals to be in areas of high posterior density. The appropriate choice of deterministic method in (ii) is key to making the prior strong and likelihood weak, putting the MCMC algorithms in the regime where they are most efficient. CHAPTER 1. INTRODUCTION 3

In Chapter 2 we develop this approach for continuous distributions, combining the deterministic approximation offered by expectation propagation with elliptical slice sampling, a state-of-the-art MCMC method. Chapter 3 extends the methodology to binary distributions. In order to harness the posterior yielded by belief propagation we invent an auxiliary variable MCMC scheme that samples from an annular augmented space, translating to a great circle path around the hyper- cube of the binary sample space. For both the continuous and binary cases the resulting MCMC algorithms consistently outperform the previous state-of-the-art MCMC methods, sometimes by multiple orders of magnitude. Chapter 4 uses similar ideas, but in the context of modeling the performance of lefties in one- on-one interactive sports. Unlike the other chapters in this thesis that focus solely on inference, Chapter 4 involves all three tasks in Box’s Loop: model building, inferring hidden quantities and model criticism. The major challenge in this problem is that match-play data is only available for top ranked players, a fact that we explicitly build into our Bayesian model. Our key insight is that a deterministic approximation of the latent advantage of left-handedness (our main variable of interest) can be derived which only depends on the tail-length of the skill distribution. This deterministic approximation can then be used to inform the design of the Bayesian model, define a proposal distribution for a Metropolis-Hastings sampler and enable efficient inference of the advantage of left-handedness over time using a Kalman filter. The result is the first set of Bayesian inference techniques for inferring the advantage of left-handedness in one-on-one interactive sports where only data on top players is available. The second part of this thesis explores the use of stable SGD methods for computing the max- imum a posteriori probability (MAP) of large-scale machine learning problems. The two problems we focus on are softmax optimization and neural network training. Both softmax and neural net- work models rely on having large amounts of data to make accurate predictions. This large amount of data results in there being a large number of terms in the MAP objective. Stochastic Gradient Descent (SGD) is the method of choice for such problems, as its run time per iteration is inde- pendent of the number of terms in the objective. However, the downside of SGD is that it can be unstable and require extensive tuning of the learning rate to ensure good performance. Implicit SGD (ISGD) is an SGD method that is both stable and robust to the learning rate. The problem is that ISGD is typically difficult to implement as its update equation is highly non- CHAPTER 1. INTRODUCTION 4 trivial. In this work we show that it is possible to apply ISGD to the softmax and neural networks. In Chapter 5 we show that the ISGD updates for a double-sum representation of the softmax can be reduced to a univariate problem that can be solved using a bisection method. We also propose another stable SGD method specifically designed for the softmax, which we call U-max. In Chapter 6 we explore the application of ISGD to neural network training. The ISGD updates are too expensive to compute exactly for neural networks. However, with carefully constructed approximations, the updates can be simplified and solved efficiently. Our proposed methods for softmax and neural network optimization are more stable, more robust and empirically converge faster than other SGD methods. The result is a more reliable set of methods for solving these large-scale problems. In summary, this thesis has two parts. The first part explores the use of deterministic “priors” to speed up MCMC and the second develops new implementations of ISGD to improve the stability of MAP estimation in large-scale problems. The result is a new set of methods which improve the accuracy, speed and reliability of machine learning inference across a range of important and widely used models including: the truncated multivariate Gaussian, probit regression, Ising models, Boltzmann machines, softmax models and neural networks.

Notation

The first part of the thesis focuses on inference in explicitly Bayesian probabilistic models. Here random variables will be denoted by capital letters (e.g. X) and their values by lowercase letters (e.g. x). The exception is Greek letters (e.g. ν) where the lower case is used for both the random variables and values. Whether a Greek letter denotes a or value will either be clear from the context or will be specified in the text. For the second part of the thesis, where we are interested in MLE, lowercase letters will denote scalars or vectors while uppercase letters will refer to matrices. 5

Part I

Advances in Bayesian inference: Leveraging deterministic methods for efficient MCMC inference CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 6

Chapter 2

Elliptical Slice Sampling with Expectation Propagation

2.1 Introduction

Exact posterior inference in Bayesian models is rarely tractable, a fact which has prompted vast amounts of research into efficient approximate inference techniques. Deterministic methods such as the Laplace approximation, Variational Bayes, and Expectation Propagation offer fast and analyti- cal posterior approximations, but introduce potentially significant bias due to their restricted form which cannot capture important characteristics of the true posterior. Markov Chain Monte Carlo (MCMC) methods represent the target posterior with samples, which while asymptotically exact, can be slow, require substantial tuning, and perform poorly when variables are highly correlated. Conceptually, these two techniques can be combined to great benefit: if a deterministic approx- imation can cover the true posterior mass accurately, then a subsequent MCMC sampler should be much faster and be less susceptible to inefficiency due to correlation (if the deterministic approxi- mation has captured this correlation). In practice, however, this is quite difficult. First, both the Laplace and Variational Bayesian approximations yield a local approximation of the posterior (in Variational Bayes this is sometimes called the exclusive property of optimizing the Kullback-Liebler divergence from the approximation to the true posterior [Minka, 2005]). While excellent in many situations, this property is inappropriate for initializing an MCMC sampler, since it will be very difficult for that sampler to explore other areas of posterior mass (e.g., other modes). Expectation CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 7

Propagation (EP, [Minka, 2001]), on the other hand, is typically derived as an inclusive approxi- mation that, at least approximately, attempts to match the global sufficient statistics of the true posterior (most often the first and second moments, producing a Gaussian approximation). Such a choice is a superior basis for an MCMC sampler. Secondly, we require a sensible choice of MCMC sampler so as to leverage a deterministic approximation like EP. Given an unnormalized target distribution p∗(X), we can write

p∗(X) p∗(X) =p ˆ(X) pˆ(X) ˆ(X) pˆ(X) ≡ L for anyp ˆ, which allows us to treat the true posterior p∗ as the product of an effective priorp ˆ and likelihood ˆ = p∗/pˆ. We then have freedom to choosep ˆ, which we will set to be the deterministic L (Gaussian) posterior approximation from EP. Amongst all MCMC methods, Elliptical Slice Sam- pling (ESS, [Murray et al., 2010]) handles the above reformulations seamlessly. ESS has become an important and generic method for posterior inference with models that have a strong Gaussian prior. It inherits the attractive properties of slice sampling generally [Neal, 2003], and notably lacks tuning parameters that are often highly burdensome in other state-of-the-art methods like Hamiltonian Monte Carlo (HMC; Neal et al. [2011]). The critical observation is that if EP provides a quality posterior approximationp ˆ p∗, the likelihood term ˆ will be weak, which puts ESS in ≈ L the regime where it is most efficient. What results is a new MCMC sampler that combines EP and ESS, is faster than state-of-the-art samplers like HMC, and is able to explore the parameter space efficiently even in the presence of strong dependency among variables. The outline of the chapter is as follows. In Section 2.2, we propose Expectation Propagation based Elliptical Slice Sampling (EPESS) where we justify the use of EP as the “prior” for ESS. In Section 2.3, we investigate a method to improve the overall run time of ESS by sampling multiple points each iteration. It reduces the average number of shrinkage steps giving it a computational advantage. We call it Recycled ESS and integrate it with EPESS to further increase its efficiency. We extend our method to Analytic Elliptical Slice Sampling in Section 2.4. As the name suggests, we can analytically find the region corresponding to a slice and sample uniformly from it. In addition to decorrelating samples, it offers the computational advantage of avoiding expensive shrinkage steps. It is applicable to only a few target distributions and we illustrate it, in the context of EPESS, for the linear Truncated Multivariate Gaussian (TMG). We offer an empirical evaluation CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 8 of EPESS (Section 2.5), which shows an order of magnitude improvement over the state-of-the art MCMC methods for TMG and probit models.

2.2 Expectation propagation and elliptical slice sampling

In this section we introduce our combined EP and ESS sampling method. We begin with background of the two building blocks of this method, to place them in context of current literature. Further background on Monte Carlo, MCMC and slice sampling methods is provided in Appendix A.

2.2.1 Elliptical slice sampling

There are many problems where dependency between latent variables is induced through a Gaussian prior, for example in Gaussian Processes. Elliptical Slice Sampling (ESS, [Murray et al., 2010]) is specifically designed for efficiently sampling from such distributions and is considered state-of-the- art on these problems. ESS considers posteriors of the form

p∗(X) (X; 0, Σ) (X) (2.1) ∝ N ·L where is a likelihood function and (0, Σ) is a multivariate Gaussian. L N ESS is a variant of slice sampling [Neal, 2003] that takes advantage of the Gaussian prior to improve mixing time and eliminate parameter tuning. At the beginning of each iteration of ESS two random variables are sampled. The first is the slice height Y = y which is uniformly distributed over [0, (x)], where X = x is the current sample. The second variable ν is sampled from the prior L (0, Σ) and, together with the current sample x, defines an ellipse: N x0(θ) = x cos(θ) + ν sin(θ). (2.2)

The next point in the Markov chain will be sampled from this ESS ellipse. To do so, a one- dimensional angle bracket [θmin, θmax] of length 2π is proposed containing the point θ = 0 (corre- sponding to the current point x). The bracket is then shrunk toward θ = 0 until a point is found within the bracket that satisfies (x0(θ)) y. This point is accepted as the next point in the L ≥ Markov chain. The pseudocode for ESS is given in Algorithm 1. ESS is known to work well when the prior aligns with the posterior and the likelihood is weak [Murray et al., 2010, Section 2.5]. The most extreme case is when the likelihood is a constant and CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 9

Algorithm 1: Elliptical Slice Sampling Input : Log-likelihood function (log ), initial point x(0) Rd, prior (0, Σ), number of L ∈ N iterations N Output: Samples from Markov Chain (x(1), x(2), ..., x(N))

1 for i = 1 to N do

2 Choose ellipse: ν (0, Σ) ∼ N 3 Log-likelihood threshold: u Uniform[0, 1] ∼ log y log (x(i−1)) + log u ← L 4 Define initial bracket:

θmax Uniform[0, 2π] ∼

θmin θ 2π ← − 5 do

6 Draw proposal:

θ Uniform [θmin, θmax] ∼ x0 x cos(θ) + ν sin(θ) ← 7 Shrink bracket:

8 if θ < 0 then θmin θ ← 9 else θmax θ ← 10 while log (x0) < log y L 11 Accept point: x(i) x0 ← 12 end

13 return (x(1), x(2), ..., x(N)) p∗(X) (X; 0, Σ). In this case the first point proposed on the ESS ellipse is always accepted, ∝ N and the Markov chain mixes fast. However, when this is not the case then ESS can perform poorly, as we demonstrate below. Figure 2.1 illustrates the problem when the prior and the posterior do not align: here we have a (0,I) prior with a Bernoulli likelihood (x) = 1(x A) for some rectangle A. The posterior N L ∈ is a truncated Gaussian within A. In this example we have placed A away from the origin, with densome in other state-of-the-art methods like Hamilto- efficiently sampling from such distributions and is con- nian Monte Carlo (HMC; Neal [2011]). The critical ob- sidered state-of-the-art on these problems. ESS consid- servation is that, if EP provides a quality posterior ap- ers posteriors of the form proximation pˆ p⇤, the likelihood term ˆ will typically ⇡ L 1 be weak, which puts ESS in the regime where it is most p⇤(x)= (x; 0, ⌃) (x) (1) efficient. Z N L where is a likelihood function, (0, ⌃) is a multivari- What results is a new MCMC sampler that combines L N EP and ESS, is faster than state-of-the-art samplers like ate Gaussian prior and Z is the normalizing constant. HMC, and is able to explore the parameter space effi- ESS is a variant of slice sampling [Neal, 2003] that takes ciently even in the presence of strong dependency among advantage of the Gaussian prior to improve mixing time variables. Specifically, our contributions include: and eliminate parameter tuning. At the beginning of each iteration of ESS two random variables are sampled. The 1. In Section 2, we propose Expectation Propagation first is the slice height y which is uniformly distributed based Elliptical Slice Sampling (EPESS) where we over [0, (x)], where x is the current sample. The second L justify the use of EP as the “prior” for ESS. variable ⌫ is sampled from the prior (x; 0, ⌃) and, to- N gether with the current sample x, defines an ellipse: 2. In Section 3, we investigate a method to improve the overall run time of ESS by sampling multiple x0(✓)=x cos(✓)+⌫ sin(✓). (2) points each iteration. It reduces the average number of shrinkage steps giving it a computational advan- Next, a one-dimensional angle bracket [✓min,✓max] of tage. We call it Recycled ESS and integrate it with length 2⇡ is proposed containing the point ✓ =0(cor- EPESS to further increase its efficiency. responding to the current point x). The bracket is then shrunk toward ✓ =0until a point is found within the bracket that satisfies (x0(✓)) >y. This point is ac- 3. We extend our method to Analytic Elliptical Slice L Sampling in Section 4. As the name suggests, we cepted as the next point in the Markov chain. can analytically find the region corresponding to ESS is known to work well when the prior aligns with a slice and sample uniformly from it. In addi- the posterior and the likelihood is weak [Murray et al., tion to decorrelating samples, it offers the compu- 2010, Section 2.5]. However, when this is not the case tational advantage of avoiding expensive shrinkage then ESS can perform poorly, as we demonstrate below. steps. It is applicable to only a few target distribu- tions and we illustrate it, in the context of EPESS, Figure 1 illustrates the problem when the prior and the posterior do not align: here we have a (0, I) prior with for linear Truncated Multivariate Gaussian (TMG) N an observed Bernoulli likelihood (x)=1(x A) for quadrature. L 2 some rectangle A. The posterior is a truncated Gaus- 4. We offer empirical evaluation ofCHAPTER EPESS (Sec- 2. ELLIPTICALsian within SLICEA. In SAMPLING this example WITH we have EXPECTATION placed A away PROPAGATION 10 tion 5), which show an order of magnitude improve- from the origin, with the result that that most of the pos- ment over the state-of-the art MCMCthe methods result that for mostterior of the density posterior lies density vertically lies onvertically the left on boundary the left boundary of the of the rectangle. TMG and probit models. box. Accordingly, a good sampler should be able to make Accordingly, a goodlarge sampler vertical should moves be able to to effectively make large explore vertical the moves posterior to effectively explore the posterior mass. mass. 2 EXPECTATION PROPAGATION AND ELLIPTICAL SLICE SAMPLING x In this section we introduce our combined EP and ESS sampling method. We begin with background of the two ⌫ building blocks of this method, to place them in context A of current literature. Figure 2.1:1: ESS ESS ellipse ellipse shown shown in in dashed dashed red. 2.1 ELLIPTICAL SLICE SAMPLING As the likelihoodAs rectangle the likelihoodA moves rectangle further right,A moves the posterior further moves right, away the from the prior. As There are many problems where dependencya result, between most la- of theposterior points proposed moves away on an from ESS ellipsethe prior. will Asnot a lie result in A, most so more shrinkage steps tent variables is induced through a Gaussian prior, for ex- of the points proposed on the ellipse will not lie in A, so ample in Gaussian Processes. Elliptical Slicewill be Sampling necessary untilmore a point shrinkage is accepted, steps will making be necessaryESS inefficient. until Moving a pointA isfurther to the right (ESS, [Murray et al., 2010]) is specificallyalso designed makes thefor ESSaccepted, ellipse more leading eccentric to an which inefficient prevents algorithm. vertical movement, Moving A resulting in further inefficiency. The other pathology afflicting ESS is that of strong likelihoods. This happens when (x) is L extremely large in regions of non-negligible posterior density. Once the sampler is in such a region, only with low probability will it be able to accept points proposed outside the region, hence it will get stuck. This will occur, for instance, when the prior underestimates the variance of the posterior and (x) becomes large in the tails. We refer the reader to an extended explanation of this effect in L [Nishihara et al., 2014, Sec. 3]. Indeed, this motivates our choice of EP as a prior, since Variational Bayes and Laplace approximations are known to often underestimate posterior variance whereas EP does not [Minka, 2005]. This is the same reason why using EP as a proposal distribution in an importance sampler yields a lower variance estimate than if Variational Bayes is used [Minka, 2005, App. E]. We address both these problems by choosing an EP prior for ESS. How to incorporate EP into ESS is explained in Section 2.2.3.

2.2.2 Expectation propagation

Expectation Propagation (EP) is a method for computing a Gaussian approximation q to a given distribution p∗ by iteratively matching local moments and then updating the global approximation CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 11 via a so-called ‘tilted’ distribution [Minka, 2001]. At termination the distribution q minimizes a global objective that approximates the Kullback-Liebler divergence KL(p∗ q) [Wainwright and Jor- || dan, 2008]. The resulting Gaussian approximation is an inclusive estimate of p∗ that approximately matches its zeroth, first, and second moments. Although EP has few theoretical guarantees [Dehaene and Barthelm´e,2015], it is known to be relatively accurate for many models including the truncated multivariate gaussian [Cunning- ham et al., 2011], probit and logistic regression [Nickisch and Rasmussen, 2008], log-Gaussian Cox processes [Ko and Seeger, 2015], and more [Minka, 2001]. It is also known to have superior perfor- mance compared to the Laplace approximation and Variational Bayes in terms of approximating marginal distributions accurately [Kuss and Rasmussen, 2005, Cseke and Heskes, 2011, Deisenroth and Mohamed, 2012].

2.2.3 Elliptical Slice Sampling with Expectation Propagation

As outlined in Section 2.1, we can incorporate a posterior approximationp ˆ as a proposal distribution for ESS. We do so by defining:

p∗(X) p∗(X) =p ˆ(X) =p ˆ(X) ˆ(X) (2.3) pˆ(X) L where p∗ is the posterior distribution of interest from (2.1),p ˆ is our new prior and ˆ = p∗/pˆ is our L new likelihood. As explained in Section 2.2.1, for ESS to work well,p ˆ should have two desirable properties: (i) It should approximate the posterior p∗. The most obvious candidates forp ˆ includes Laplace, Variational Bayes and EP approximations, (ii) It should ensure that the new likelihood ˆ = p∗/pˆ L is weak, in the sense as described in Section 2.2.1. Using either Laplace or Variational Bayes may result in large values of ˆ in the tails due to variance underestimation, which could cause the L sampler to get stuck. The more inclusive nature of the EP estimate, on the other hand, makes it a sensible choice to obtain a Gaussian posterior approximationp ˆ. The formulation in (2.3) is intimately related to importance sampling where p∗ would be the target distribution andp ˆ the proposal distribution. The properties that make EP a good prior for ESS also make EP a good proposal distribution for importance sampling [Minka, 2005, App. E]. Indeed, the EPESS algorithm may be thought of as an MCMC version of importance sampling. further to the right also makes the ellipse more eccentric where p⇤ is the posterior distribution of interest from which prevents vertical movement, resulting in further Equation (1), pˆ is our new prior and ˆ is our new like- L inefficiency. lihood. As explained in Section 2.1, for ESS to work well, pˆshould have two desirable properties: (i) It should The other pathology afflicting ESS is that of strong likeli- approximate the posterior p . The most obvious candi- hoods. This happens when (x) is extremely large in re- ⇤ L dates for pˆ includes Laplace, Variational Bayes and EP gions of non-negligible posterior density. Once the sam- approximations, (ii) It should ensure that the new like- pler is in such a region, only with low probability will lihood ˆ = p⇤/pˆ is weak, in the sense as described in it be able to accept points proposed outside the region, L Section 2.1. Using either Laplace or Variational Bayes hence it will get stuck. This will occur, for instance, may result in large values of ˆ in the tails due to vari- when the prior underestimates the variance of the pos- L ance underestimation, which could cause the sampler to terior and (x) becomes large in the tails. We refer the L get stuck. The more inclusive nature of the EP estimate, reader to an extended explanation of this effect in [Nishi- on the other hand, makes it a sensible choice to obtain a hara et al., 2014]. Indeed, this motivates our choice of Gaussian posterior approximation pˆ. EP as a prior, since Variational Bayes and Laplace ap- proximations are known to often underestimate posterior To demonstrate the power of this approach we return to variance whereas EP does not [Minka, 2005]. the problematic example given in Figure 1. Using the EP CHAPTER 2. ELLIPTICALapproximation SLICE we SAMPLING can shift our WITH prior EXPECTATION to align with the PROPAGATION 12 We address both these problems by choosing an EP prior posterior density on the left side of the likelihood rect- (Section 2.2) for ESS. How to incorporate EP into ESS ange A. The ellipses become short and vertical, allowing is explained in Section 2.3. To demonstrate the power of EPESS we return to the problematic example given in Figure 2.1. ESS to mix efficiently. This is illustrated in Figure 2. ∗ Using the EP approximationTo demonstrate we can the shift difference our “prior” in thep ˆ to sampling align with behavior the posterior density p on 2.2 EXPECTATION PROPAGATION the left side of thebetween likelihood EPESS rectangle andA ESS,. The Figure ESS ellipses 2.3 plots become 400 samples short and vertical, allowing from both EPESS and ESS. EPESS is clearly superior Expectation Propagation (EP) is a methodESS for findingto mix efficiently. a This is illustrated in Figure 2.2. To demonstrate the difference in the and manages to explore the entire distribution whereas Gaussian approximation q to a given distribution p by it- sampling⇤ behavior betweenESS moves EPESS consistently and ESS, less. Figure 2.3 plots 400 samples from both EPESS and eratively matching local moments and then updating the global approximation via a so-called ‘tilted’ESS. distribution EPESS is clearly superior as it manages to explore the entire distribution whereas ESS does [Minka, 2001]. At termination the distributionnot. q will op- timize a global objective that approximates the Kullback- Liebler divergence KL(p⇤ q) [Wainwright and Jordan, || x 2008]. The resulting Gaussian approximation is an inclu- sive estimate of p⇤ that approximately matches its zeroth, first, and second moments. ⌫ Although EP has few theoretical guarantees [Dehaene and Barthelme,´ 2015], it is known to be accurate for many models including truncated multivariate gaussian [Cunningham et al., 2011], probit and logistic regression Figure 2.2: The EP approximation is in teal and an EPESS elliptical slice is in dashed red. [Nickisch and Rasmussen, 2008], log-Gaussian Cox pro- Figure 2: The EP approximation is in teal and an EPESS cesses [Ko and Seeger, 2015], and more [Minka, 2001]. elliptical slice is in dashed red. EPESS ESS It is also known to have superior performance compared 1 1 to the Laplace approximation and Variational Bayes in 0.8 0.8 terms of approximating marginal distributions accurately 0.6 0.6 The idea of Equation (3) is not unique to this paper. [Kuss and Rasmussen, 2005, Cseke and Heskes, 2011, 0.4 0.4 Nishihara et al. [2014] use a similar construction where Deisenroth and Mohamed, 2012]. 0.2 0.2 the Gaussian approximation is learned from samples. Al-

y 0 y 0 though this has the advantage of not relying on EP to do 2.3 ELLIPTICAL SLICE SAMPLING WITH -0.2 -0.2 moment matching, it requires parallelism and expensive EXPECTATION PROPAGATION -0.4 -0.4 moment calculations. EPESS will be simpler and more -0.6 -0.6 As outlined in Section 1, we incorporate a posterior ap- efficient when an accurate EP approximation is available. proximation pˆ as a proposal distribution for ESS. We do Braun-0.8 and Bonfrer [2011] also-0.8 have a similar method -1 -1 so by defining: where they50 use the50.5 Laplace51 approximation,50 50.5 which51 as dis- x x cussed, is a poor choice. We remark that using Power EP p⇤(x) p⇤(x)=ˆp(x) =ˆp(x) ˆFigure(x) 2.3:(3) EPESSapproximations vs ESS: 400 samples is also of a EPESS viable and choice ESS for for a a prior, 2-d Gaussian a point (0,I) truncated pˆ(x) L that we will return to in Section 6. N in a rectangular box 50 x 51, 1 y 1 . EPESS explores the parameter space effectively { ≤ ≤ − ≤ ≤ } whereas ESS does not.

The idea of (2.3) is not unique to this paper. Nishihara et al. [2014] use a similar construc- tion where the Gaussian approximation is learned from samples. Although their method has the CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 13 advantage of not relying on EP to do moment matching, it requires parallelism and expensive moment calculations. EPESS will be simpler and more efficient when an accurate EP approx- imation is available. Braun and Bonfrer [2011] also have a similar method where they use the Laplace approximation, which as discussed above, is a poor choice. We remark that using Power EP approximations is also a viable choice for a prior, a point that we will return to in Section 2.6.

2.3 Recycled elliptical slice sampling

In this section we show how to sample J > 1 points each ESS iteration without a significant increase in computational complexity. This idea is inspired by the work of Nishimura and Dunson [2015] on HMC. In that work an HMC algorithm is devised which “recycles” the intermediate points as valid samples from the target distribution. We borrow the phrase “recycling” from them and call our method Recycled Elliptical Slice Sampling. Recall from Section 2.2.1 that in every ESS iteration, we propose points along an ellipse within an angle bracket, which is iteratively shrunk, until a point is accepted. In Recycled ESS, we don’t stop after accepting the first point but continue to propose points starting from the last angle bracket used. This procedure is continued until J points are accepted. One of the J accepted points is then randomly selected to propagate the Markov chain.

As we shrink the angle bracket [θmin, θmax] towards θ = 0 (corresponding to the current point), the probability of the next proposal point being accepted tends to increase. Hence the number of shrinkage steps required to accept latter points is typically smaller than that for first accepted point. Since the number of likelihood function evaluations is proportional to the number of shrinkage steps, Recycled ESS is able to sample more points with only a small increase in computational complexity, leading to improved run times per sample. This approach is formalized in Algorithm 2. Note that recycled ESS is equivalent to standard ESS if J = 1. (i) It is implied in Algorithm 2 that we treat each sample Xj as an element in a large Markov (i) (i) (i) chain with state space (X1 , ..., XJ ). We prove in Theorem 2.3.2 that each element Xj has its stationary marginal distribution as p∗. In order to do so, we first show in Lemma 2.3.1 that the transition operator of accepting the jth point is reversible.

(i−1) (i) Lemma 2.3.1. Let Tj correspond to the transition operator from X Xˆ . Then Tj is 1 → j CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 14

Algorithm 2: Recycled ESS (0) Input : Log-likelihood function (log ), initial point x Rd, prior (0, Σ), number of L 1 ∈ N iterations N, number of recycled points J (1) (1) (N) (N) Output: Samples from Markov Chain ((x1 , ..., xJ ), ..., (x1 , ..., xJ )) 1 for i = 1 to N do

2 Choose ellipse: ν (0, Σ) ∼ N 3 Log-likelihood threshold: u Uniform[0, 1] ∼ (i−1) log y log (x ) + log u ← L 1 4 Define initial bracket:

θmax Uniform[0, 2π] ∼

θmin θ 2π ← −

5 for j = 1 to J do

6 Draw initial proposal:

θ Uniform [θmin, θmax] ∼ x0 x cos(θ) + ν sin(θ) ←

7 while log (x0) < log y do L 8 Shrink bracket:

9 if θ < 0 then θmin θ ← 10 else θmax θ ← 11 Draw new proposal:

θ Uniform [θmin, θmax] ∼ x0 x cos(θ) + ν sin(θ) ← 12 end (i) 13 Accept point:x ˆ x0 j ← 14 end (i) (i) (i) (i) 15 (x , ..., x ) random permutation(ˆx , ..., xˆ ) 1 J ← 1 J 16 end (1) (1) (N) (N) 17 return ((x1 , ..., xJ ), ..., (x1 , ..., xJ )) CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 15 invariant to p∗.

Proof. Our proof is similar to that of the original ESS algorithm [Murray et al., 2010, Sec. 2.3].

The approach is to show that Tj is reversible, i.e.

(i−1) (i) (i−1) (i) (i−1) (i) p∗(X = x ) p(X¯ =x ˆ X = x ) = p∗(X =x ˆ ) p(X¯ = x X =x ˆ ), 1 · j | 1 j · 1 | j

∗ from which it follows that Tj is invariant to p .

Let θj,k , k = 1, 2,...Kj, be the sequence of angles sampled during Tj. The distribution of { } (i−1) ∗ the current state X = x1 with respect to p (as defined in (2.1)) multiplied by the distribution (i) of random variables Y, ν, θj,k generated to transition to X¯ =x ˆ is { } j

∗ (i−1) (i−1) p (X = x ) p(Y, ν, θj,k X = x ) 1 · { }| 1 ∗ (i−1) (i−1) (i−1) = p (X = x ) p(Y X = x ) p(ν) p( θj,k X = x , Y, ν) 1 · | 1 · · { }| 1 (i−1) (i−1) (x ; 0, Σ) (ν; 0, Σ) p( θj,k X = x , Y, ν) ∝ N 1 ·N · { }| 1

(i−1) (i−1) (i−1) where p(Y = y X = x ) = I[0 y (x )] / (x ). The key to proving reversibility is | 1 ≤ ≤ L 1 L 1 showing that1

∗ (i−1) (i−1) p (X = x ) p(Y = y, ν = ν, θj,k = θj,k X = x ) 1 · { } { }| 1 ∗ (i) (i) = p (X =x ˆ ) p(Y = y, ν =ν, ˆ θj,k = θˆj,k X =x ˆ ) (2.4) j · { } { }| j where

(i−1) νˆ = ν cos(θj,K ) x sin(θj,K ) j − 1 j

θj,k θj,K if k < Kj ˆ − j θj,k =   θj,K if k = Kj. − j  The valuesν ˆ and θˆj,k are constructed such that

(i−1) (i) ˆ ˆ x1 cos(θj,k) + ν sin(θj,k) =x ˆj cos(θj,k) +ν ˆ sin(θj,k)

1 We have overloaded our notation with ν and {θj,k}. In the expression ν = ν the left ν refers to the random variable and the right ν to its value. Likewise for {θj,k}. The notation was chosen to be consistent with [Murray et al., 2010]. CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 16

(i) (i−1) for all k < Kj. The points proposed in the reverse direction (fromx ˆj to x1 ) are thus the same (i−1) (i) as in the forward direction (from x1 tox ˆj ), except for when k = Kj. To prove (2.4), we first show that:

(i−1) (i) p( θj,k = θj,k X = x ,Y = y, ν = ν) = p( θj,k = θˆj,k X =x ˆ ,Y = y, ν =ν ˆ) (2.5) { } { }| 1 { } { }| j

The argument is as follows: the probability density for the first angle θj,1 is always 1/2π. The intermediate angles were drawn with probability densities 1/(θmax θmin) where (θmin, θmax) denotes j,k − j,k j,k j,k (i) the angle bracket for θj,k. Whenever the bracket was shrunk, it was done so thatx ˆj remained (i) selectable. Now lets consider the reverse transitions starting fromx ˆj . The reverse transitions ˆmin ˆmax make the same intermediate proposals. Since same size angle brackets (θj,k , θj,k ) are sampled, the probabilities for drawing angles in forward and reverse transitions is the same. Additionally, we have that

(i−1) (i) (x ; 0, Σ) (ν; 0, Σ) = (ˆx ; 0, Σ) (ˆν; 0, Σ) (2.6) N 1 ·N N j ·N since after taking logs and cancelling constants in (2.6) we have

(i)> (i) > xˆj Σˆxj +ν ˆ Σˆν (i−1) > (i−1) = (x1 cos(θj,Kj ) + ν sin(θj,Kj )) Σ(x1 cos(θj,Kj ) + ν sin(θj,Kj )) (i−1) > (i−1) + (ν cos(θj,K ) x sin(θj,K )) Σ(ν cos(θj,K ) x sin(θj,K )) j − 1 j j − 1 j (i−1)> (i−1) > = x1 Σx1 + ν Σν.

Equation (2.6) combined with the result in (2.5) proves (2.4). Integrating over y, ν and θj,k { } ∗ proves reversibility and shows that Tj is invariant to p .

Theorem 2.3.2 easily follows:

Theorem 2.3.2. Each element in the Recycled ESS Markov chain has marginal stationary distri- bution p∗.

(i) Proof. The sequence of points X follow a Markov Chain. At each step the transition operator is { 1 } ∗ uniformly sampled from the set Tj : j = 1, ..., J , with each Tj being invariant to p (Lemma 2.3.1). { } (i) dist. Therefore we have that X X∗ where X∗ p∗. Also, at any fixed iteration i, we have 1 −−−→ ∼ CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 17

(i) that all points in X : j = 1, ..., J are identically distributed. This follows from the random { j } permutations:

(i) (i) (i) (i) (i) (i) (i) (i) p(X (Xˆ , ..., Xˆ )) = Uniform(Xˆ , ..., Xˆ ) = p(X (Xˆ , ..., Xˆ )). j | 1 J 1 J k | 1 J

(i) dist. (i) dist. Since we have that X X∗, it follows that for all j: X X∗. 1 −−−→ j −−−→ The downside of Recycled ESS is that the latter accepted points (corresponding to j J) are ≈ sampled from a very small angle bracket and so are highly correlated. On the other hand these points only require a small number of function evaluations. Overall the effect of recycling is a small increase in the effective number of samples, with a small increase in computational complexity. Whether or not this is beneficial is investigated empirically in Section 2.5.

2.4 Analytic elliptical slice sampling

Consider the ellipse

(x, ν) = x0 : x0 = x cos(θ) + ν sin(θ) for some θ [0, 2π) E { ∈ } as in an ESS iteration as defined by (2.2). Let (y; ) be the slice corresponding to the acceptable S E points in for a given slice height y: E

(y; ) = x0 : (x0) > y . S E { ∈ E L }

If we can analytically characterize (y; ) then we only need to sample a point uniformly from the S E slice to propagate the Markov Chain [Neal, 2003]. This has three advantages: (i) We eliminate ex- pensive slice shrinkage steps which reduces the computational cost of our sampler; (ii) In standard slice sampling algorithms, shrinkage steps bias the next sample to be close to the current sample thereby introducing correlations. Since we uniformly sample over (y; ), the resulting samples S E are less correlated as we are not biased towards the current point; (iii) We can easily incorporate the recycling idea here resulting in an extremely efficient algorithm, which we refer to as Analytic Elliptical Slice Sampling. As in Recycled ESS, in Analytic ESS we sample J > 1 points from each ellipse . To do so we E first sample J different Y values, y1, y1, ..., yJ , which are evenly spaced in a Quasi Monte-Carlo CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 18

Algorithm 3: Analytic Slice Sampling (0) Input : Likelihood ˆ, priorp ˆ, initial point x , subroutine Sample Ellipse to sample an L 1 ellipse, subroutine Characterize Slice to analytically characterize ( ; ), S · E number of iterations N, number of slices per iteration J (1) (1) (N) (N) Output: Samples from Markov Chain ((x1 , ..., xJ ), ..., (x1 , ..., xJ )) 1 for i = 1 to N do (i−1) 2 Sample Ellipse(x , pˆ) E ← 1 3 ( ; ) Characterize Slice( ) S · E ← E 4 u Uniform [0, 1] ∼ 5 for j = 1 to J do (i−1) 6 Define slice height: yj (j u)/J ˆ(x ) ← − · L 1 (i) 7 Sample accepted point: x Uniform x : x (yj; ) j ← { ∈ S E } 8 end (i) (i) (i) (i) 9 (x , ..., x ) rand perm(ˆx , ..., xˆ ) 1 J ← 1 J 10 end (1) (1) (N) (N) 11 return ((x1 , ..., xJ ), ..., (x1 , ..., xJ ))

fashion. Corresponding to each yj value, we analytically solve for (yj; ) (which has only a small S E amortized computational cost). One point is then uniformly sampled from each slice (yj; ). The S E pseudocode for Analytic ESS is given in Algorithm 3 and in Theorem 2.4.1 we prove its validity.

Theorem 2.4.1. Each element in the Analytic ESS Markov chain has marginal stationary distri- bution p∗.

Proof. The proof follows exactly the same argument as in Theorem 2.3.2.

Unfortunately solving for (y; ) in closed form is not possible in general, although it can be S E done for the Truncated Multivariate Gaussian (TMG) as shown below. CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 19

2.4.1 Analytic EPESS for TMG

The (linear) TMG distribution is defined as:

m ∗ > p (X) (X; 0,I) 1(L X cj). ∝ N j ≥ j=1 Y Using the EP approximation (X; µ, Σ) and (2.3), we can rewrite the density p∗ as: N m ∗ > p (X) (X; 0,I) 1(L X cj) ∝ N j ≥ j=1 Y m (X; 0,I) > (X; µ, Σ) N 1(L X cj) ∝ N · (X; µ, Σ) j ≥ N j=1 mY (Z; µ, I) > = (Z; 0, Σ) N − 1(L (Z + µ) cj) N · (Z; 0, Σ) j ≥ j=1 N Y (Z; 0, Σ) ˆ(Z) ≡ N · L p˜(Z) ≡ where Z = X µ is a transformation with identity Jacobian. We are able to apply Analytic ESS − top ˜(Z) = (Z; 0, Σ) ˆ(Z) and can then recover samples for X by reversing the transformation N · L via X = Z + µ. A TMG slice is analytically characterized as:

m 0 (y; ) = θ [0, 2π): ˆ(z (θ)) > y = Θj Θy, S E { ∈ L } ∩ j=1 \ where z0(θ) = z cos(θ) + ν sin(θ) and

> 0 > Θj = θ [0, 2π): L z (θ) + L µ cj { ∈ j j ≥ } (z0(θ); µ, I) Θy = θ [0, 2π): N − > y . { ∈ (z0(θ); 0, Σ) } N

The region Θj is the part of the ellipse that lies in the halfspace defined by Lj and cj. Since it is defined by a linear inequality of sin(θ) and cos(θ), it is easily characterized using basic trigonometry.

The resulting region may be rewritten as Θj = [0, 2π) (lj, uj) for some 0 < lj uj < 2π. The − ≤ m intersection of all m regions can be computed as Θj = [0, 2π) (lj, uj), which takes j=1 − j: lj 6=uj O(m log m) time to compute. T S CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 20

The inequality defining Θy can be simplified by taking logarithms on both sides, which reduces it to the form: 2 a0 + a1 cos θ + a2 sin θ + a3 cos θ sin θ + a4 cos θ > 0, (2.7) where ai = ai(µ, Σ, y, z, ν). The roots of this inequality can be obtained by solving a quartic m equation. Finally, (y; ) = Θj Θy can be computed by taking the intersection of the S E j=1 ∩ m intervals defined by (2.7) and thoseT defined by Θj = [0, 2π) (lj, uj). j=1 − j: lj 6=uj m The most expensive operations in Analytic ESST are generating Sand computing Θj, after E j=1 which sampling from (y; ) is relatively cheap. To see this, let d denote the dimensionT of X and S E m the number of linear truncations. To sample the ellipse we must draw ν from the Gaussian E 2 m prior which costs O(d ). Computing j=1 Θj involves taking m inner products of Z = z and ν with the linear truncations Lj, which canT be calculated in O(md), and taking the intersection of the 2 intervals Θj at a cost of O(m log m). The total computational overhead is thus O(d +md+m log m). Once this is done, sampling from (y; ) is cheap. It involves sampling the slice height y, solving S E m the quartic equation from (2.7) for Θy, taking the intersection of Θy and j=1 Θj, and sampling uniformly from the resultant domain, all of which can be done in O(d + m)T (here we have added in the expense of storing each sample at a cost O(d)). m 2 Since the upfront cost of generating and computing Θj is O(d + md + m log m), we can E j=1 sample O(d) points per ellipse without significantly increasingT the total computational complexity. E This leads to a high effective sample size relative to the computational complexity. The Exact-HMC algorithm for TMG [Pakman and Paninski, 2014] has an intimate relationship with Analytic ESS and inspired our analytic framework. We explain this connection in Section 2.5.2.

2.5 Experimental results

In this section we compare the empirical performance of the algorithms introduced in Sections 2.2.3– 2.4 to other state-of-the-art MCMC methods. Comparisons are shown for the Probit regression and the TMG problem, both of which are often encountered in machine learning contexts, as well as the log-Gaussian Cox process. We quantify the mixing of MCMC samplers by comparing their effective number of samples. Effective sample size is estimated using the method as described in [Gelman et al., 2014] which is implemented in the MCMC Diagnostics Tool box for Matlab [Aki CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 21

Table 2.1: Datasets for probit experiments.

Dataset Dimension Examples

Breast Cancer 31 569 Ionosphere 31 351 Sonar 61 97 Musk 165 419

Vehtari]. We compare the results in terms of effective sample size divided by the number of density function evaluations of p∗, which is the dominant computational expense of running the samplers.

2.5.1 Probit regression

Probit regression is one of the most common problems in machine learning and is often used as a benchmark for comparing Bayesian computation methods. A nice review of the state-of-the-art algorithms for probit can be found in [Chopin and Ridgway, 2015]. For our experiments, we choose 4 data sets of moderate size from UCI repository as listed in Table 2.1, with their dimension and number of datapoints. These are the Breast Cancer [Wolberg et al., 1995], Ionosphere [Sigillito, 1989], Sonar [Son] and Musk [AI Group at Arris Pharmaceutical Corporation, 1994] data sets. As is standard, each dataset is preprocessed to have zero mean and unit variance for each regressor, a unit intercept term has been included and the prior on each latent variable is (0, 10). N We compare EPESS and Recycled EPESS (denoted by EPESS(J) where J is the number of recycled points per slice) against Metropolis-Hastings with an EP proposal (EPMH) and HMC using the No-U-Turn sampler as implemented in Stan [Carpenter et al., 2015]. EPMH is considered as state-of-the-art for Probit [Chopin and Ridgway, 2015] . The chains were initialized at the EP mean for all EPESS methods and the Stan implementation decides on its own initialization. We use the R package of Ridgway [2016] to find the EP approximation. Its CPU time is negligible compared to the time to run the samplers. We run 100 chains with 20,000 samples per chain. For EPMH we ran it until 20,000 unique samples (i.e., accepted proposed points) were collected to make it comparable with the other meth- CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 22 ods. The results are shown in the top three plots of Figure 2.4. EPESS outperforms HMC and EPMH by about a factor of 5 for effective sample size relative to number of function evaluations. As compared to EPMH, EPESS gives a slightly smaller effective number of samples but takes far fewer function evaluations. The number of function evaluations for Recycled EPESS is smaller than that of EPESS, however the effective number of samples are also proportionately small as the samples are highly correlated (this is expected: see Section 2.3). Overall Recycled EPESS does not improve the effective sample size relative to number of function evaluations above EPESS.

2.5.2 Truncated multivariate Gaussian

The Truncated Multivariate Gaussian (TMG) is an important distribution which commonly arises in diverse models such as Probit/Tobit models, Neural models [Pillow et al., 2003], Bayesian bridge model in finance [Polson et al., 2014], True-skill model for competitions [Herbrich et al., 2007] and many others. There has been some recent work including [Lan and Shahbaba, 2015] focusing on sampling from TMG, but Exact-HMC algorithm [Pakman and Paninski, 2014] is considered to be state-of-the-art (see [Altmann et al., 2014] for a nice review). We treat it as the benchmark for comparisons in our experiments. The equations of motion for Exact-HMC for the TMG are the same as that of standard ESS,

x0(t) = x cos(t) + ν sin(t), and so it suffers from the same problems as described in Section 2.2.1 and illustrated in Figure 2.1. The elliptical path enables exact calculation of where the HMC particle hits the truncations, hence the term Exact-HMC. These are the same calculations used to find the regions Θj in Analytic ESS, although being an HMC method, it does not have a slice height y with the corresponding region

Θy. It also cannot incorporate an EP prior as this would destroy the elliptical path and render the calculations intractable. A tuning parameter T is required and for all our experiments we have fixed T to be π/2 as recommended by Pakman and Paninski [2014]. We only show comparative results for Analytic EPESS as it is faster than EPESS since it avoids slice shrinkages. We obtain an EP approximation for TMG using the method as described in [Cunningham et al., 2011] which is fast and scales well for high dimensions. It runs in negligible CPU time as compared to the running time of different sampling algorithms. CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 23

EPESS(1) EPESS(2) 1 EPESS(5) EPESS(10) EPMH HMC

0.5 Eff. Samps / Fn evals Fn / Samps Eff. 0 Breast Cancer Ionosphere Sonar Musk

2.5

2

1.5

1

0.5 Effective Sample Size Sample Effective 0 Breast Cancer Ionosphere Sonar Musk

4

3

2

1 Function evaluations Function 0 Breast Cancer Ionosphere Sonar Musk

Ana EPESS(5) Ana EPESS(1) Exact HMC 102

0 Eff. Samps / Fn evals Fn / Samps Eff. 10 101 102 Shift

200

150

100

50 Eff. Samps / Fn evals Fn / Samps Eff. 0 0 20 40 60 80 100 120 Dimension

Figure 2.4: Plots of empirical results. In the top three plots all values are normalized so that EPESS(1) has value 1. In the bottom 2 plots all values are normalized so that Exact-HMC has value 1. The naming convention: EPESS(J) denotes Recycled EPESS with J points sampled per slice. EPESS(1) denotes EPESS without recycling. Ana EPESS(J) denotes Analytic EPESS with J threshold levels per iteration. Error bars in plot 3 for all algorithms are effectively zero. CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 24

We run 100 chains with 20,000 samples per chain. The fourth plot in Figure 2.4 shows the results for a 2-d standard Gaussian where the truncated region is a rectangular box: s x s + 1, { ≤ ≤ 1 y 1 . As we increase s and shift the box to the right, we see Analytic EPESS outperforming − ≤ ≤ } Exact-HMC by orders of magnitude. This trend carries over to higher dimensions as shown in the fifth plot in Figure 2.4. Results for d = 500 and d = 1000 have been omitted as Exact-HMC with T = π/2 takes prohibitively long to run.

2.5.3 Log-Gaussian Cox process

We conducted experiments on a Log-Gaussian Cox Process (LGCP) applied to the coal mining disaster dataset as set up in the original ESS paper [Murray et al., 2010]. Although a convergent EP is available for the LGCP, it is not accurate with the EP mean substantially deviating from the true mean [Ko and Seeger, 2015]. Our experiments showed that EPESS fared no better than ESS on this problem, with the effective number of samples being about the same. This demonstrates the fact that EPESS will only perform well when EP is accurate.

2.6 Discussion and conclusion

In this work we have shown how the ideas of ESS, EP and recycling can be combined to yield highly efficient MCMC methods. For both probit regression and the TMG, EPESS outperforms state-of-the-art samplers by an order of magnitude. In the case of TMG, this can be multiple orders of magnitude. We investigated two different types of recycling: sampling multiple points per slice (Recycled ESS), and sampling multiple points at different slice heights from the same ellipse (Analytic ESS). The benefit of Recycled ESS is questionable as it seems not to improve performance in probit, due to having highly correlated samples. It also introduces a tuning parameter which makes the algorithm more difficult to implement. Analytic EPESS for TMG does not have the above-mentioned issues of Recycled ESS. In this case recycling is of clear benefit as can be seen in the experimental results of Section 2.5.2. It is here where EPESS outperforms the state-of-the-art by the largest margin. The example of the Log-Gaussian Cox process shows that EPESS will only offer an advantage over ESS when EP is accurate. This restricts the applicability of EPESS as a general method. CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 25

Improving the accuracy of EP is a subject of active research and any developments made will be immediately be inherited by EPESS. There are multiple directions for extending the ideas behind EPESS. Instead of choosing the prior that minimizes α = 1 divergence, we could choose a prior corresponding to α > 1 by replacing EP with Power-EP [Minka, 2004]. This might make the likelihood even weaker in EPESS and further improve performance. Another direction to adapt the idea of using a deterministic method as a “prior” for MCMC for discrete distributions. This is direction is pursued in the next chapter. CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 26

Chapter 3

Annular Augmentation Sampling

3.1 Introduction

Binary probabilistic models are a fundamental and widely used framework for encoding dependence between discrete variables, with applications from physics [Wang and Landau, 2001] and computer vision [Nowozin and Lampert, 2011] to natural language processing [Johnson et al., 2007]. The exponential nature of these distributions — a d-dimensional binary model has sample space of size 2d — renders intractable key operations like exact marginalization and exact inference. Indeed these operations are formally hard: calculating the partition function is in general #P-complete [Chandrasekaran et al., 2008]. This intractability has prompted extensive research into approxi- mation techniques, with deterministic approximation methods such as belief propagation [Murphy et al., 1999, Wainwright and Jordan, 2008] and its variants [Murphy, 2012] enjoying substantial impact. Weighted model counting is another promising approach that has received increasing at- tention [Chavira and Darwiche, 2008, Chakraborty et al., 2014]. However, Markov Chain Monte Carlo (MCMC) methods have been relatively unexplored, and in many settings basic Metropolis Hastings (MH) is still the default choice of sampler for binary probabilistic models [Zhang et al., 2012]. This scarcity of MCMC methods contrasts with the literature on continuous probabilistic mod- els, where MCMC methods like Hamiltonian Monte Carlo (HMC), [Neal et al., 2011] and slice sampling [Neal, 2003] have enjoyed substantial success. Whether explicitly or implicitly stated, the key idea for many of these state-of-the-art continuous samplers is to utilize a two-operator con- CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 27 struction: at each iteration of the algorithm, first a subset of the full sample space is chosen, and second a new point is sampled from that subspace. Consider HMC, which samples a point from the Hamiltonian flow (the subset) induced by a particular sample of the momentum variable. This two-operator construction is enabled by an auxiliary variable augmentation (e.g. the momentum variable in HMC) that separates each MCMC operator. Recently this two-operator concept for HMC has been modified to sample binary distributions [Zhang et al., 2012, Pakman and Paninski, 2013], and its improved performance over MH and has been demonstrated, giving promise to the notion of auxiliary augmentations for binary samplers. Another continuous example of this two-operator formulation, which inspires our present work, is Elliptical Slice Sampling (ESS), [Murray et al., 2010]: auxiliary variables are introduced into a latent Gaussian model such that an elliptical contour (the subset) is defined by the first sampling operator, and then the second operator draws the next sample point from that ellipse. Here we leverage this central idea of auxiliary augmentation to create a two-operator MCMC sampler for binary distributions. At each iteration, we first choose a subset of 2d points ( 2d, the  size of the sample space) defined over an annulus in the auxiliary variable space (or alternatively, a great circle around the hypercube corresponding to the sample space), and in the second operator we sample over that annulus. Critically, this augmentation strategy leads to an overrelaxed sampler, allowing long range exploration of the sample space (flipping multiple dimensions of the binary vector simultaneously), unlike more basic coordinate-by-coordinate MH strategies, which require geometrically many iterations to flip many dimensions of the binary vector. We also extend this auxiliary formulation to exploit available deterministic approximations to improve the quality of the sampler. This mirrors our approach in Chapter 2 of using the determin- istic EP approximation to guide the ESS algorithm. The main difference between EPESS and the work in this chapter is that EPESS is for continuous distributions and used two well established techniques: EP and ESS. For the binary distributions considered in this chapter, there is a well established deterministic approximation (loopy belief propagation), but no corresponding MCMC method. Hence a new MCMC method must be invented to harness the information extracted by the deterministic approximation. We evaluate our sampler on 2D Ising models and a real world problem of modeling heart disease risk factors; where it significantly outperforms MH and HMC techniques for binary distributions. CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 28

The outline of the chapter is as follows. In Section 3.2.1 we propose a novel annular augmen- tation that maps a continuous path in the auxiliary space to the discrete sample space of binary distributions. The augmentation is then extended to incorporate deterministic posterior approxi- mations such as mean-field belief propagation. A Rao-Blackwellized Gibbs sampler that exploits this annular augmentation is developed in Section 3.2.3. It is both free of tuning parameters and straightforward to implement with generic code. Experimental results demonstrating the perfor- mance improvements achieved by annular augmentation sampling over other state-of-the-art meth- ods are given in Section 3.3. We conclude by discussing some extensions to annular augmentation, and future directions for the class of augmented binary MCMC techniques in Section 3.4.

3.2 Annular augmentation sampling

We are interested in sampling from a probability distribution p(S) defined over d-dimensional binary vectors S 1, +1 d. The density p(S) is given in terms of a function f(S) as ∈ {− } 1 p(S) = f(S), Z where Z is the normalizing constant. It is perhaps most straightforward to set up an augmentation by partitioning p into the product of some tractable distributionp ˆ and its remainder : L

p(S) (S)ˆp(S), ∝ L where = f/pˆ can be thought of as a likelihood with respect to the priorp ˆ. This is the same step L as (2.3) in the EPESS chapter. Like in EPESS, we would likep ˆ to capture the structure of p (else samples in the resulting MCMC algorithm will be wasted, similar to a bad importance sampler). As in other works using augmentation strategies [Pakman and Paninski, 2013, Murray et al., 2010], we will rewrite S as a deterministic function S = h(θ, T ) of auxiliary random variables (θ, T ) (to be defined). In our setup h(θ, T ) and the prior distribution of the auxiliary variablesp ¯(θ, T ) will be linked to the priorp ˆ(S) via the relation1

pˆ(S = s) = I[h(θ, t) = s]¯p(θ = θ, T = t)dθdt. (3.1) Z 1Like in Lemma 2.3.1 of Chapter 2 we have overloaded the notation on θ. In the expression θ = θ the left θ refers to the random variable and the right θ to its value. CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 29

It follows that

p(S = s) (s)ˆp(S = s) = I[h(θ, t) = s] (h(θ, t))¯p(θ = θ, T = t)dθdt p(h(θ, T ) = s) ∝ L L ∝ Z where the probability p(h(θ, T ) = s) is taken over the posterior distribution

p(θ, T ) (h(θ, T ))¯p(θ, T ). (3.2) ∝ L

Thus we can sample p(S) by sampling from p(θ, T ) and then setting S = h(θ, T ).

3.2.1 Annular Augmentations

d Let us first consider the choice of a uniform prior:p ˆ(S) = i=1 pˆ(Si) withp ˆ(Si = +1) = d pˆ(Si = 1) = 1/2. We can now replace S with auxiliary variablesQθ [0, 2π) and T [0, 2π) as − ∈ ∈

θ unif[0, 2π) ∼

Ti unif[0, 2π) ∼

Si = h(θ, T )i = sgn[cos(Ti θ)]. (3.3) −

This choice of priorp ¯ and the map h maintains the distributionp ˆ(S) in the sense of (3.1), but induces conditional dependence given the auxiliary variables. Critical to this work is the observation that, conditioned on a value of T , the sample S = h(θ, T ) can take on 2d possible values (not 2d); which value S takes is determined by θ. Thus T defines a subset of the sample space over which we can sample S using θ.

Geometrically, this fact can be seen in two ways. First, Figure 3.1a shows each Ti as a red, π green, or blue point, where the interval Ti is shown as a similarly colored bar; this bar defines ± 2 the threshold for flipping the ith coordinate of S (i.e., the threshold of the sgn function in (3.3)). As θ traverses 0 2π, 2d bit flips alter S0 to S0 and back. Alternatively, a second view of this → − geometric structure is to consider a great circle path around the d-dimensional hypercube, as shown in Figure 3.1b. Owing to this fundamental ring structure, we call this auxiliary variable scheme annular augmentation. We use this augmentation to create an MCMC sampler in Section 3.2.2. Before introducing the sampler, we introduce an important extension. In some settings we may have access to a mean-field posterior approximationp ˆ (e.g., from loopy belief propagation), which should in general serve us better than the uniformp ˆ just introduced. An accurate posterior Francois Fagan, Jalaj Bhandari, John P. Cunningham CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 30 Francois Fagan, Jalaj Bhandari, John P. Cunningham +1 +1 +1 8+19 8+19 = S0 +1 +1 +1 8+19 <>+1=> <> 1 => +1 +1 = S0 +1 +1 = S <>+1=> +1 8 9 8 9 8 9 0 8+19 <> 1 => <>+1=> > > <> 1 => > > +1 = S0 : ; : ; <>+>1=> > 8 9 : ; T <> 1 => > > > > 2 :> ;> : ; : ; 1 > > T1 : ; T2 ⇡ > > +1 2 : ; 81 9 ⇡ T1 2 8<>++191=> <>+1=> ⇡ 0 > > ✓ : ; T⇡3 ✓ 0 +1 :> ;> T 3 +1 1 3⇡ 1 8 9 3⇡ 2 8 9<> 1 => 2 <> 1 => > > :> ;>: ; 1 1 1 1 1 1 1 1 SS0 0== 1 1 1 = 0 = 1 1 889 9 8 198 9 S0 8 S1 9 8 9 8 19 8 9 >> > > <+<1+=1= <> 1=>< 1=> <>+1=> <>+1=> <> 1=> <> 1=> > > ::> ; ;> :> ;> > :> ;> > > :> ;> > > (a)(a)(a) Annular Annular Annular augmentation augmentation: ; (b)(b) Path Path(b): of of annulus Path; of on on annulus hypercube hypercube on hypercube: ;

Figure 1: Diagram of the annuluar augmentation. (a) The auxiliary annulus. Each threshold Ti defines a Figure 1: Diagram of the annuluar augmentation. (a) The auxiliary annulus. Each threshold Ti defines a semicircleFigure where 3.1:s Diagram=+1, which of the is annular indicated augmentation. by a darker shade. (a) The The auxiliary value of s annulus.changes as Each✓ moves threshold aroundTi the semicircle wherei s =+1, which is indicated by a darker shade. The value of s changes as ✓ moves around the annulus. (b) The impliedi path of the annulus on the binary hypercube. Hypercube faces are colored to match the defines a semicircle where Si = +1, which is indicated by a darker shade. The value of S changes annulus.annulus. (b) The implied path of the annulus on the binary hypercube. Hypercube faces are colored to match the annulus.as θ moves around the annulus. (b) The implied path of the annulus on the binary hypercube.

1 BeforeHypercube introducing faces the sampler, are colored we introduce to match an the impor- annulus.where the uniform 2⇡ prior for ✓ and t have been 1 tantBefore extension. introducing In some the settings sampler, we we may introduce have access an impor-absorbedwhere into the the constant uniform of proportionality.2⇡ prior for ✓ Inand Oper-t have been totant a mean-field extension. posterior In some approximation settings wepˆ may(e.g., have from accessator 1,absorbed we use t to into define the a constant subspace, of and proportionality. in Operator In Oper- loopyapproximation belief propagation), should which “flatten” should in the general likelihood serve =2wesampleoverf/pˆ in the spirit✓: of [Wang and Landau, 2001], to a mean-field posterior approximation pˆ (e.g.,L from ator 1, we use t to define a subspace, and in Operator us better than the uniform pˆ just introduced: a pos- loopythereby belief propagation), improving the which efficiency should of our in general sampler. serve This is2wesampleover the same idea as in✓: Chapter 2 where EP terior approximation should “flatten” the likelihood us better than the uniform pˆ just introduced: a pos- Operator 1: Sample t subject to the constraint (s)=wasf( useds)/pˆ(s to) in improve the spirit the of performance (Wang and of Landau, ESS. Our annular• = h augmentation(✓, ) generalizes nicely to this Lterior approximation should “flatten” the likelihood s t is fixed. The simplest such operator 2001) and (Fagan et al., 2016), thereby improving the uniformlyOperator samples ti 1:fromSample the domaint subject where tosi theis constraint (s)=setting,f(s) by/pˆ( alterings) in the (3.3) spirit to: of (Wang and Landau, • eLfficiency of our sampler. Our annuluar augmentation fixed: s = h(✓, t) is fixed. The simplest such operator generalizes2001) and nicely (Fagan to this et al., setting, 2016), by therebyaltering Eqn. improving 1 to: the uniformly samples ti from the domain where si is efficiency of our sampler. Our annuluar augmentation ti fixed:unif(✓ +(1 si)⇡/2 ⇡pˆ(si), si = sgn[cos(ti ✓) cos(⇡pˆS(sii== sgn[cos( 1))]. T(2)i θ) cos(πpˆ(S⇠i = 1))]. (3.4) generalizes nicely to this setting, by altering Eqn.− 1− to: ✓ +(1 s )⇡/2+⇡pˆ(s )). i i Figure 2a demonstrates that the proportion of the ti unif(✓ +(1 si)⇡/2 ⇡pˆ(si), annulusFigure wheresi = 3.2a sgn[cos(s =1 demonstratesbecomes:ti ✓) thatcos( the⇡pˆ(s proportioni = 1))]. of the(2) annulus where S⇠i = 1 becomes: i Operator 2:Sample✓ using an MCMC transition • ✓ +(1 si)⇡/2+⇡pˆ(si)). Figure(ti + ⇡ 2apˆ(si demonstrates= 1)) (ti ⇡pˆ that(si = the 1)) proportion of the operator (e.g. Gibbs sampling). (Ti + πpˆ(=ˆSip=(si 1))= 1),(Ti πpˆ(Si = 1)) annulus where s 2=1⇡ becomes: − − =p ˆ(Si = 1), i 2π Operator 2:Sample✓ using an MCMC transition resulting in the implied prior measure on s being pre- First note• that, in Operator 1, the density of the pro- (ti + ⇡pˆ(si = 1)) (ti ⇡pˆ(si = 1)) operator (e.g. Gibbs sampling). ciselyresulting the mean-field in the impliedapproximation prior measurepˆ. Indeed,=ˆ onpS( whensbeingi = 1) precisely,posed point the is mean-field equal to that approximation of the currentp ˆ.point Indeed, since pˆ(s = 1) = 1/2,Equation2revertsbacktoEquation1,2⇡ (h(✓, t)) is unchanged. Thus the sample from Op- i L soresulting thewhen augmentations inp ˆ(S thei = implied 1) = are 1/2, consistent. prior (3.4) measure reverts Geometrically back on s tobeing (3.3), pre-erator so the 1First augmentations is accepted note that, with are in probability consistent. Operator 1 1, andGeometri- the is densitya valid of the pro- MCMC step (which follows from evaluating the stan- thiscisely changecally the this tomean-field Equation change to 1 approximation (3.3) induces induces stretched stretchedpˆ. thresholds Indeed, thresholds when aroundposed the point annulus, is equal as shown to that in Figure of the 3.2b. current point since dard MH acceptance probability). Second, Operator 2 aroundpˆ(si = the1) = annulus, 1/2,Equation2revertsbacktoEquation1, as shown in Figure 2b. We use (h(✓, t)) is unchanged. Thus the sample from Op- is also aL valid MCMC transition by definition. Together, theso thestretchedWe augmentations use the annulus stretched to good are annulus econsistent.ffect into Section good Geometrically effect 3. in Section 3.3.erator 1 is accepted with probability 1 and is a valid this two-operator sampling approach leaves the target this change to Equation 1 induces stretched thresholds MCMC step (which follows from evaluating the stan- 2.2 Generic Sampler distribution in Equation 3 invariant and is ergodic, and around the annulus, as shown in Figure 2b. We usethus thedard Markov MH chain acceptance will converge probability). to the stationary Second, Operator 2 Asthe outlined stretched in Section annulus 1, to to sample good e fromffectp in(s) Sectionwe define 3. distributionis alsop. a valid MCMC transition by definition. Together, two MCMC operators to sample our auxiliary variables this two-operator sampling approach leaves the target (2.2✓, t), and Generic then we Sampler read out the corresponding values of 2.3 Operatordistribution 2: Rao-Blackwellizedin Equation 3 invariant Gibbs and is ergodic, and s = h(✓, t) as given by Equation 1 or, more generally, thus the Markov chain will converge to the stationary EquationAs outlined 2. The in Section posterior 1, over to sample the auxiliary from variablesp(s) we defineIn the previousdistribution sectionp. we proposed a generic sampler (two✓, t)is: MCMC operators to sample our auxiliary variablesfor the annular augmentation; it now remains to specify p(✓, t) (h(✓, t)), (3) Operator 2 to sample ✓ conditioned on t. Though a (✓, t), and then we read/L out the corresponding values of 2.3 Operator 2: Rao-Blackwellized Gibbs s = h(✓, t) as given by Equation 1 or, more generally, Equation 2. The posterior over the auxiliary variables In the previous section we proposed a generic sampler (✓, t)is: for the annular augmentation; it now remains to specify p(✓, t) (h(✓, t)), (3) Operator 2 to sample ✓ conditioned on t. Though a /L CHAPTER 3. ANNULAR AUGMENTATIONAnnular Augmentation SAMPLING Sampling 31 Annular Augmentation Sampling

1 cos(T1 ✓) 1 cos(T1 ✓) T T2 2 T1 T1 ⇡ cos(⇡pˆ1) ⇡ 2 cos(⇡pˆ1) 2

✓ ✓ ⇡ ✓ ⇡ 0 ✓ 0 T ⇡T1 ⇡T ⇡pˆT1 T⇡pˆ+1 ⇡pˆT1 + ⇡Tpˆ1+ ⇡ T1 + ⇡ T3 T3 1 1 1 1 1 1 3⇡ 3⇡ 2 2

1 1

(a) Interval(a) Interval where s wherei =1 si =1 (b) Stretched(b) Stretched thresholds thresholds (a) Interval where Si = 1 (b) Stretched thresholds FigureFigure 2: Augmentation 2: Augmentation with mean-field with mean-field prior pˆ.(a)Equation2ingraphicalform;inblue,theimpliedinterval prior pˆ.(a)Equation2ingraphicalform;inblue,theimpliedinterval Figure 3.2: Augmentation with mean-field priorp ˆ. (a) Equation (3.4) in graphical form; in blue, wherewheres1 =1salong1 =1 thealong annulus. the annulus. We denote We denotepˆ1 = pˆ(spˆ11 == 1)pˆ(.(b)Stretchedannuluaraugmentation;cf.Figure1a.s1 = 1).(b)Stretchedannuluaraugmentation;cf.Figure1a. the implied interval where s1 = 1 along the annulus. We denotep ˆ1 =p ˆ(S1 = 1). (b) Stretched

numbernumber ofannular choices of choices augmentation;are available, are available, a cf. particularly Figure a particularly 3.1a. effective effectivef with af changewith a in change sign in in one sign coordinate in one coordinate and a uniform and a uniform choicechoice is Rao-Blackwellized is Rao-Blackwellized Gibbs sampling, Gibbs sampling, which we whichrandom we random variable variable for acceptance-rejection. for acceptance-rejection. In AAG In AAG will nowwill explain. now explain. we requirewed requireuniformd uniform random variables random variables to sample tot, sample t, 3.2.2 Generic Sampler must sort the threshold edges and perform 2d function Gibbs sampling in the annular augmentation is made must sort the threshold edges and perform 2d function Gibbs sampling in the annular augmentation is madeevaluations of f with a change in sign in one coordinate. possible by the fact that s = h(t,✓) is piecewise con- evaluations of f with a change in sign in one coordinate. possibleThe posterior by the fact over that thes auxiliary= h(t,✓ variables) is piecewise (θ, T ) con-Although from (3.2) this is: is more expensive, using RB we get 2d stant over the annulus (see braces in Figure 1a). In Although this is more expensive, using RB we get 2d stant over the annulus (see braces in Figure 1a).samples, In hence the cost per sample is similar. For each of the intervals between threshold edges, constant samples, hence the cost per sample is similar. For each of the intervals between threshold edges, constantuniform priors it is possible to avoid the cost of sorting s = h(t,✓) implies (h(t,✓)) is also constant. Inp Gibbs(θ, T ) (h(θ, Tuniform)), priors it is possible to avoid(3.5) the cost of sorting s = h(t,✓) impliesL (h(t,✓)) is also constant. In∝ Gibbsthe L threshold edges if we constrain t to lie on a lattice, sampling the probability ofL sampling a point within an the threshold edges if we constrain t to lie on a lattice, sampling the probability of sampling a point withini.e. anti = ki⇡/d where ki 1,...,2d . interval iswhere proportional the uniform to the prior integralp ¯ on ( ofθ, Tthe) from density (3.3) has been absorbedi.e. ti = k intoi⇡/d thewhere2 constant{ ki } of1 proportion-,...,2d . over thatinterval interval. is proportional Here it is (proportional to the integral to) the of prod- the densityWe noted previously that the annular2{ augmentation} can uct ofover theality. that interval interval. As outlined length Here and in it Section the is (proportional value 3.1, of to sampleon to) the the from prod-bep( sampledθ, T )We we withnoted will definemethods previously two other MCMC that than the operators. Gibbs. annular Indeed, augmentation In in can uct of the interval length and the valueL of on the be sampled with methods other than Gibbs. Indeed, in interval. ThereforeOperator we 1, wecan use GibbsT to sample define✓ aby subspace, first sam-L and inour Operator experiments 2 we we sample will compareover θ. Let AAGs denote to a similarly the pling aninterval. interval Therefore and then we uniformly can Gibbs sampling sample a✓ pointby first sam-definedour Annular experiments Augmentation we will Slice compare sampler AAG (AAS) to a similarly withinpling it.current We an furtherinterval value improve andof S then= h estimates(θ, uniformly T ) in the from sampling Markov Gibbs chain. a point(Neal, 2003).defined Suwa Annular and Todo Augmentation (2010) recently Slice proposed sampler (AAS) samplingwithin via it. Rao-Blackwellization We further improve (RB) estimates (Douc et from al., Gibbsa non-reversible(Neal, 2003). MCMC Suwa sampler and Todo for discrete(2010) recently spaces proposed 2011).sampling To calculateOperator via Rao-Blackwellization an expectation 1: Sample ofT somesubject (RB) function to (Douc the constraint etwith al., state-of-the-art thata non-reversibles = h(θ, T ) mixing is fixed. MCMC times; The sampler simplest this Suwa-Todo for such discrete spaces • g(s),theRBestimatoris:2011). To calculate an expectation of some function(ST) algorithmwith state-of-the-art can also be used mixing in Operator times; this 2. In Suwa-Todo operator uniformly samples Ti from the domain where si is fixed: g(s),theRBestimatoris: Section 3.1,(ST) we algorithm extensively can explore also the be benefits used in of Operator using 2. In N 2d ˆ 1 n n differentSection Operator 3.1, 2 we approaches extensively on exploreIsing models. the benefits of using E[g]= Ti unifrNk(θ2g+d(s (1k ), si)π/2 πpˆ(Si = si), θ + (1 si)π/2 + πpˆ(Si = si)). N ∼1 · n− n − different− Operator 2 approaches on Ising models. Eˆ[g]=n=1 k=1 r g(s ), X NX k · k n=1 2.4 Choice And Effect Of The Prior pˆ n X kX=1 th where rk is theOperator normalized 2: density Sample ofθ using the k an MCMCinter- transition2.4 operator Choice (e.g. And Gibbs E sampling).ffect Of The Prior pˆ • th val betweenwhere thresholdrn is the edges normalized in the densityn iteration of the andkth inter-Annular augmentation requires a distribution pˆ which, k k sn is itsval corresponding betweenFirst note threshold value.that, in edges Annular Operator in the augmentation 1,n theth iteration density of andfor the optimal proposedAnnular performance, point augmentation is equal should to that requires approximately of the a current distribution match pˆ which, permitssk anis itsefficient corresponding implementation value. of Annular RB Gibbs augmentation sam- the marginalsfor optimal of p.AnaturalchoiceisLoopyBe- performance, should approximately match n point since (h(θ, T )) is unchanged. Thus the sample from Operator 1 is accepted with probability 1 pling,permits which we an show efficientL in pseudocode implementation in Algorithm of RB Gibbs 1. sam-lief Propagationthe marginals (LBP), ofa deterministicp.AnaturalchoiceisLoopyBe- approximation As anpling, implementationand which is a valid we note,MCMC show the in step pseudocode rotational (which follows invariance in Algorithm from evaluatingwhich 1. incurslief the standardPropagation minimal MHcomputational (LBP), acceptance a deterministic overhead. probability). The approximation use ✓ ✓ =0 of t andAs anallows implementation us to set note,at the the rotational beginning invariance of of deterministicwhich incurs priors minimal to speed computational up samplers has overhead. been The use each iterationSecond, without Operator loss of 2 generality, is also a valid thereby MCMC elimi- transitionused before by definition. in MCMC Together, algorithms this (De two-operator Freitas et al., of t and ✓ allows us to set ✓ =0at the beginning of of deterministic priors to speed up samplers has been nating ✓ from Algorithm 1. Hereafter we will refer to 2001; Fagan et al., 2016) as well as in importance sam- each iteration without loss of generality, thereby elimi- used before in MCMC algorithms (De Freitas et al., Gibbs sampling on the annular augmentation as Annu- pling (Liu et al., 2015). Often it makes the samplers nating ✓ from Algorithm 1. Hereafter we will refer to 2001; Fagan et al., 2016) as well as in importance sam- lar Augmentation Gibbs sampling (AAG) and it’s RB more efficient without significant extra computational versionGibbs as Rao-Blackwellized sampling on the annular Annular augmentation Augmentation as Annu-cost, aspling we will (Liu see etin Sectional., 2015). 3. Often it makes the samplers Gibbs larsampling Augmentation (AAG + GibbsRB). sampling (AAG) and it’s RB more efficient without significant extra computational version as Rao-Blackwellized Annular AugmentationAnnularcost, augmentation as we will explores see in Section the subset 3. defined by EvenGibbs thoughsampling the AAG (AAG sampler+ RB). may seem to involve the annular thresholds t,andthebehaviorofthe more work than a traditional method such as MH, both samplerAnnular will depend augmentation significantly explores on the the choice subset of defined by have similarEven though computational the AAG complexity sampler per may sample. seem Toto involveprior pˆ.Presumingthe annularpˆ approximately thresholds t,andthebehaviorofthe matches the true see this,more note work that than each aMH traditional iteration requiresmethod evaluatingsuch as MH, bothmarginalssampler of p,itisinstructivetoconsidertwoex- will depend significantly on the choice of have similar computational complexity per sample. To prior pˆ.Presumingpˆ approximately matches the true see this, note that each MH iteration requires evaluating marginals of p,itisinstructivetoconsidertwoex- CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 32 sampling approach leaves the target distribution in (3.5) invariant and is ergodic, and thus the Markov chain will converge to the desired stationary distribution.

3.2.3 Operator 2: Rao-Blackwellized Gibbs

In the previous section we proposed a generic sampler for the annular augmentation; it now remains to specify Operator 2 to sample θ conditioned on T = t. Though a number of choices are available, a particularly effective choice is Rao-Blackwellized Gibbs sampling, which we will now explain. Gibbs sampling in the annular augmentation is made possible by the fact that s = h(θ, t) is piecewise constant over the annulus (see braces in Figure 3.1a). In each of the intervals between threshold edges, constant s = h(θ, t) implies (h(θ, t)) is also constant. In Gibbs sampling the L probability of sampling a point within an interval is proportional to the integral of the density over that interval. Here it is (proportional to) the product of the interval length and the value of on L the interval. Therefore we can Gibbs sample θ by first sampling an interval and then uniformly sampling a point within it. We further improve estimates from Gibbs sampling using Rao-Blackwellization (RB) [Douc et al., 2011] (a brief background on Rao-Blackwellization is provided in Appendix B.1). To calculate an expectation of some function g(S), the RB estimator is:

N 2d 1 Eˆ[g] = rn g(sn), N k · k n=1 k=1 X X n th th where rk is the normalized density of the k interval between threshold edges in the n iteration k and sn is its corresponding value. Annular augmentation permits an efficient implementation of RB Gibbs sampling, which we show in pseudocode in Algorithm 4. As an implementation note, the rotational invariance of T and θ allows us to set θ = 0 at the beginning of each iteration without loss of generality, thereby eliminating θ from Algorithm 4. Hereafter we will refer to Gibbs sampling on the annular augmentation as Annular Augmentation Gibbs sampling (AAG) and it’s RB version as Rao-Blackwellized Annular Augmentation Gibbs sampling (AAG + RB). Even though the AAG sampler may seem to involve more work than a traditional method such as MH, both have similar computational complexity per sample. To see this, note that each MH iteration requires evaluating f with a change in sign in one coordinate and a uniform random variable for acceptance-rejection. In AAG we require d uniform random variables to sample T , CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 33

Algorithm 4: Rao-Blackwellized Annular Augmentation Gibbs Sampling (AAG+RB) Input : Unnormalized density , priorp ˆ, number of iterations N, function g L Output: Monte Carlo approximation Eˆ[g]

1 Initialize: s 1, +1 d; Eˆ[g] 0 ∈ {− } ← 2 for n = 1, .., N do /* Operator 1 */

3 for i = 1, .., d do π 4 ti (si 1) + πpˆ(Si = si) unif( 1, +1) // Threshold value ← 2 − · − 5 ei (ti πpˆ(Si = +1), ti + πpˆ(Si = +1)) // Threshold edges ← − 6 end /* Operator 2 */

7 q sort edge indices(e1, e2, ..., ed) ← 8 ` interval lengths(q; e1, e2, ..., ed) ← 9 Initialize rn R2d // Interval densities ∈ 10 for k=1,...,2d do // Move around annulus

11 s s q(k) ← − q(k) 12 sn s k ← 13 rn `(k) (sn) k ← ·L k 14 end n n n 15 r r / r 1 // Normalize probabilities ← k k 16 i Multinomial(rn), s sn // Gibbs sample ∼ ← i 1 2d 17 Eˆ[g] Eˆ[g] + rn g(sn) ← N k=1 k · k 18 end P CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 34 must sort the threshold edges and perform 2d function evaluations of f with a change in sign in one coordinate. Although this is more expensive, using RB we get 2d samples, hence the cost per sample is similar as compared to MH. For uniform priors it is possible to avoid the cost of sorting 2d the threshold edges if we constrain T to lie on a lattice, i.e. Ti kπ/d . ∈ { }k=1 We noted previously that the annular augmentation can be sampled with methods other than Gibbs. Indeed, in our experiments we will compare AAG to a similarly defined Annular Augmen- tation Slice sampler (AAS) [Neal, 2003]. Suwa and Todo [2010] recently proposed a non-reversible MCMC sampler for discrete spaces with state-of-the-art mixing times; this Suwa-Todo (ST) algo- rithm can also be used in Operator 2. In Section 3.3.1, we extensively explore the benefits of using different Operator 2 approaches on 2D-Ising models.

3.2.4 Choice And Effect Of The Prior pˆ

Annular augmentation requires a distributionp ˆ which, for optimal performance, should approxi- mately match the marginals of p. A natural choice is Loopy Belief Propagation (LBP), a deter- ministic approximation which incurs minimal computational overhead. The use of deterministic priors to speed up samplers has been used before in MCMC algorithms [De Freitas et al., 2001] as well as in importance sampling [Liu et al., 2015]. We used this approach in Chapter 2 where the EP prior was used to speed up ESS. Often it makes the samplers more efficient without significant extra computational cost, as we will see in Section 3.3. The behavior of annular augmentation will depend significantly on the choice of priorp ˆ. Pre- suming thatp ˆ approximately matches the true marginals of p, it is instructive to consider two extreme cases that demonstrate the quality and flexibility of the annular augmentation: strongly non-uniformp ˆ, wherep ˆ(Si = 1) 1; and uniform priors, wherep ˆ(Si = 1) 1/2 for all i = 1, ..., d. ≈ ≈ In the non-uniform case, each threshold is stretched to cover nearly all of the annulus, inducing a constant S around the annulus except for d tiny regions where only one coordinate is flipped. Our sampler reduces to an MH type algorithm where only one coordinate is flipped at a time. This is appropriate and desired: nearly all of the density in p will be at S = (+1, ..., +1) with most of the remaining density residing in adjacent vectors that have only one negative signed coordinate. AAG in such a case would quickly move to S = (+1, ..., +1) guided by the LBP prior, and then only consider flipping one coordinate at a time. CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 35

On the other hand, for uniform priors the annular augmentation will do a form of overrelaxed sampling. In overrelaxed sampling, proposed points often lie on the opposite side of a mode or mean of a distribution [Neal, 1998], which can suppress random walks. In the annular augmentation with uniform prior, we can observe from Figure 3.1a that points on opposite sides of the annulus have opposite signs, i.e. h(θ, T ) = h(θ + π, T ). Annular augmentation considers all such points and − has a similar flavor as an overrelaxed sampler. This is particularly useful for pairwise binary distributions without bias where f(s) = i

3.3 Experimental results

We evaluate our annular augmentation framework on Ising models and a real world Boltzmann Machine (BM) application. Ising models have historically been an important benchmark for as- sessing the performance of binary samplers [Newman et al., 1999, Zhang et al., 2012, Pakman and Paninski, 2013]. We use a set-up similar to [Zhang et al., 2012] to interpolate between unimodal and more challenging multi-modal target distributions which allows us to compare the performance of each sampler in different regimes and understand the benefits of RB and the LBP prior. BMs, also known as Markov random fields, are a fundamental paradigm in machine learning [Ackley et al., 1985, Barber, 2012], with it’s deep counterpart having found numerous applications [Salakhutdinov and Hinton, 2009]. Inference in fully connect BMs is extremely challenging. In our experiments we consider the heart disease dataset [Edwards and Toma, 1985] as its small size enables exact computations which can be used to measure the error of each sampler. We compare against other general purpose state-of-the-art binary samplers: Exact-HMC [Pak- man and Paninski, 2013] and Coordinate Metropolis Hastings (CMH), a MH sampler which proposes to flip only one coordinate, si, per iteration. We note that for certain classes of binary distribu- tions, specialized samplers have been developed, e.g. the Wolff algorithm [Newman et al., 1999]. As annular augmentation is a general binary sampling scheme, we do not compare to such spe- cialized methods. To make the comparisons particularly challenging, we developed a novel method to incorporate the LBP prior in the CMH sampler and further improved its performance by using CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 36 discrete event simulation. We refer to this as CMH + LBP (details in Appendix B.4). To distinguish the contributions of the annular augmentation, RB, and the LBP pseudoprior, we show results separately for the AAG sampler, its RB version (AAG + RB), and with the LBP prior (AAG + RB + LBP). As mentioned in Section 3.2.3, we also provide results for the Annular Augmentation Slice (AAS) and Suwa-Todo (AAST) sampling counterparts. We run all samplers for a fixed numbers of density function (f) evaluations. This choice is sensible since the density evaluations dominate the run time cost of each sampler, ensuring a fair comparison between the different methods. All samplers were started from the same initial point, and for the LBP approximation we used Matlab code [Schmidt, 2007]. We report results for estimation of i) node marginals, ii) pairwise marginals, and iii) the normalizing factor (partition function) where the exact values were obtained by running the junction tree algorithm [Schmidt, 2007].

3.3.1 2D Ising Model

We consider a 2D Ising model on a square lattice of size 9 9 with periodic boundary conditions, × where p(S = s) exp βE[s] . Here E[s] = Wijsisj Bisi is the energy of ∝ − − hi,ji − i the system and the sum hi,ji is over all adjacent nodesP on the squareP lattice. The strength of interaction between nodeP pairs (i, j) is denoted by Wij, Bi is the bias applied to each node, and

β is the inverse temperature of the system. As is often done, we fix β = 1, Wij = W for all (i, j) pairs, and apply a scaled bias Bi c Unif[ 1, 1] to each node. Each MCMC method is run 20 ∼ · − times for 1000 equivalent density function (f) evaluations. In Figures 3.3, 3.4 and 3.5, we report the RMSE of node marginal and pairwise marginal estimates, each averaged over 20 runs for different settings of W (referred to as “Strength”) and c (referred to as “Bias scale”). A higher value in heatmaps (more yellow, less red) indicates a larger error. The values of strength and bias scale determine the difficulty of the problem, and demonstrate the performance of annular augmentation across a range of settings. In Figure 3.3, we fix c = 0.2 and progressively increase W . These correspond to hard cases or “frustrated systems” with the target distribution being multimodal: increasing W increases the energy barrier between modes making the target more difficult to explore. All samplers do similarly well for easy, low-strength cases (left columns of Figure 3.3), but AAS, AAS + RB, AAG, AAG + RB and AAST, AAST + RB outperform on difficult problems (high W values, redder colors). CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 37

Figure 3.3: Bias scale c = 0.2 and the bias for each node is drawn as Bi c Unif[ 1, 1]. The ∼ · − target distribution here is multi-modal. As we increase strength W , AAS, AAS + RB and AAG, AAG + RB increase their outperformance over all other samplers, including those using LBP (the LBP approximate degrades for W > 0.6).

Figure 3.4: No bias is applied to any node making the target distribution bimodal, with the modes becoming more peaked as strength W is increased. All annular augmentation samplers very significantly outperform. CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 38

Figure 3.5: W is fixed to 0.2 and bias scale c is increased making the target distribution uni-modal. HMC and CMH find the mode quickly, as do annular augmentation samplers that leverage LBP. That group outperforms annular augmentation samplers with no access to LBP.

Also interesting to note is that, in the high strength regime, the addition of LBP hurts performance (LBP itself, not annular augmentation, has broken down; see Appendix B.2, Table B.1 for numerical values). In Figure 3.4, no bias is applied to any node and we progressively increase W . This corresponds to a target distribution with two modes: the ones and negative ones vectors. Samplers based on annular augmentation significantly outperform (lower error, more red) due to the overrelaxation property. For the zero bias cases, the LBP approximation is equal to the true node marginals: pˆ(Si = 1) = 0.5 which is equivalent to the uniform prior case, hence we expect annular augmentation with and without LBP to perform similarly. A significant performance gain in estimating node marginals comes from RB (Appendix, Table B.3 has numerical values). For settings in Figures 3.3 and 3.4, HMC, CMH and CMH + LBP samplers tend to get stuck in one mode, leading to poor performance. Finally in Figure 3.5, we fix W = 0.2 and progressively increase c making the target distribution unimodal. In this simpler setting, HMC and CMH perform well and are on par with AAS + RB + LBP, AAG + RB + LBP and AAST + RB + LBP. We see this performance is due in part to the accuracy of LBP in this setting (Appendix B.2, Table B.2), shown also by the underperformance of annular augmentation without LBP in this case. CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 39

The annular augmented Gibbs, slice and ST methods have similar performance in all experi- ments. Gibbs always performs marginally better than slice and typically outperforms ST when RB is used. This suggests that Gibbs makes better use of RB, even though its mixing time may be slower than ST [Suwa and Todo, 2010]. Regarding the size of these problems, note that annular augmentation has no issue scaling to much larger sizes; we stopped at 9 9 grids to be able to create the baseline with junction tree × within a day or so of computation.

3.3.2 Heart Disease Risk Factors

The heart disease dataset [Edwards and Toma, 1985] lists six binary features (risk factors) relevant for coronary heart disease in 1841 men. These factors are modeled as a fully connected BM, which defines a probability distribution over a vector of binary variables S 0, 1 6: ∈ { } 1 p(S = s W, B) = exp Wi,jsisj + Bisi | Z(W, B) i

p(W, B ) p(W, B) p( W, B) |D ∝ · D| (n) (n) (n) exp n,i

Bayesian learning here is hard. Consider a simple Metropolis Hastings (MH) sampler where starting from (W, B), a symmetric proposal distribution is used to propose (W 0,B0) which is accepted with probability: p(W 0,B0) p( W 0,B0) a = min 1, D| . (3.6) p(W, B) p( W, B)  D|  N Equation (3.6) depends on the intractable partition function ratio Z(W, B)/Z(W 0,B0) . As in Murray and Ghahramani [2004], we can use MCMC to approximate the partition function ratio as:

0 0 Z(W ,B ) 0 0 = exp (W Wi,j)sisj + (B Bi)si (3.7) Z(W, B) i,j − i − i

Heart disease data Simulated data 0.5 2.1

2.0 0.4 1.9

0.3 1.8

1.7 0.2 1.6 Log-partition error 0.1 Log-partition error 1.5

0.0 1.4 HMC CMH AAS + RB AAG + RB HMC CMH AAS + RB AAG + RB

Figure 3.6: Absolute error in estimating the log partition function ratio. Since there is zero bias, the LBP prior yields no additional benefit and so the results LBP driven samplers are omitted. where denotes E ( ). We will use different MCMC methods to draw samples from h·i p(S|W,B) · p(S W, B) to compute the expectation in (3.7), which will then used to compute the approxi- | mate MH acceptance probability in (3.6). Following Murray and Ghahramani [2004], bias terms are taken to be zero. We run 1000 such MH samplers, each for 100 samples. To approximate the partition function ratio in (3.7), we use 1000 samples from one of four different samplers: HMC, CMH, AAS + RB and AAG + RB. For each MH Markov chain, we compute the average absolute error in approximating Z(W,B) the log partition function ratio log Z(W 0,B0) over its 100 samples.   Following Murray and Ghahramani [2004], to see the effect of increasing dimensionality we also simulate data for 10 dimensional risk factors, using a random W matrix and repeat the experiment (details are given in the Appendix). Figure 3.6 plots the overall mean and standard deviation of the approximation error for both real and simulated data. The exact partition function ratio, required for computing the error metric above, was evaluated by enumeration. For the real data example, both AAS + RB and AAG + RB outperform CMH and perform on par with HMC; while for the higher-dimensional simulated data, annular augmentation outperforms both CMH and HMC. Accordingly, across both real and simulated data, annular augmentation provides optimal performance. We note that even small differences seen in Figure 3.6 can affect the convergence of the approximate MH sampler to the correct posterior distribution; as the approximation error for MH acceptance probabilities is proportional to the approximation error for partition function ratio taken to the N th power. CHAPTER 3. ANNULAR AUGMENTATION SAMPLING 41

3.4 Conclusion

We have presented a new framework for sampling from binary distributions. The annular aug- mentation sampler with subsequent Rao-Blackwellization has a number of desirable properties, including overrelaxation, no tuning parameters, and the ability to easily incorporate approximate posterior marginal information such as that from LBP. Taken together, these advantages lead to clear performance gains over CMH and Exact-HMC samplers, across a range of settings.

In this work we only considered uniform sampling of the thresholds: Ti unif(0, 2π); a future ∼ direction is to group thresholds together to flip clusters of correlated coordinates in the spirit of the Wolff algorithm. Furthermore, our inclusion of LBP in both annular augmentation and CMH suggests a similar extension for Exact-HMC [Pakman and Paninski, 2013]. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 42

Chapter 4

The Advantage of Lefties in One-On-One Sports

4.1 Introduction and Related Literature

In this chapter we investigate the extent to which being left-handed impacts elite performance and rankings in one-on-one interactive sports such as tennis, fencing, badminton etc. Our goal is to provide a coherent framework for measuring the benefit of being left-handed in these sports and tracking how this benefit evolves over time. We also aim to provide a framework for considering such questions as “who are the most talented players?” Of course for this latter question to be reasonable it must be the case that the lefty advantage (to the extent that it exists) can be decoupled from the notion of “talent”. Indeed it’s not at all clear that such a decoupling exists.

Causes and Extent of the Lefty Advantage

In fact much of the early research into the performance of left-handers in sports relied on the so-called “innate superiority hypothesis” (ISH), where left-handers were said to have an edge in sporting competitions due to inherent neurological advantages associated with being left-handed [Geschwind and Galaburda, 1985, Nass and Gazzaniga, 1987]. The presence of larger right hemi- spheric brain regions associated with visual and spatial functions, a lack of lateralization, and a larger corpus callosum [Witelson, 1985] (the brain structure involved in communication across CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 43 hemispheres) were all suggested as neurological mechanisms for this edge. Applications of the ISH to sport occurred primarily in fencing [Akpinar et al., 2015, Bisiacchi et al., 1985, Taddei et al., 1991], where left-handers appeared to have advantages in attentional tasks (in terms of response to visual stimuli), though there were also proponents for this view in other sports such as tennis [Holtzen, 2000]. The idea that an innate advantage was responsible for the significant over-representation of left-handers in professional sports gradually lost momentum following the works of [Aggleton and Wood, 1990, Grouios et al., 2000, Wood and Aggleton, 1989]. These papers analyzed interactive sports such as tennis and non-interactive sports such as darts and pool. They found there was a surplus of left-handers in the interactive sports, but not generally in the non-interactive sports. (One exception is golf where Loffing and Hagemann [2016, Box 12.1] noted that the proportion of top left-handed1 golfers is higher than in the general population.) It was reasoned that any innate superiority should also bring left-handers into prominence in non-interactive sports and so alternative explanations were sought. The primary argument of [Aggleton and Wood, 1990, Wood and Aggleton, 1989] was that the prominence of left-handers in a given sport was due to the strategic advantages of being left-handed in that sport. Indeed the prevailing2 explanation today for the over-representation of left-handers in profes- sional interactive sports is the negative frequency-dependent selection (NFDS) effect. This effect is also assumed to underlie the so-called “fighting hypothesis” [Raymond et al., 1996] which explains why there is long-lasting handedness polymorphism in humans despite the fitness costs that appear to be associated with left-handedness. The NFDS effect is best summarized as stating that right- handed players have less familiarity competing against left-handed players (because of the much smaller percentage of lefties in the population) and therefore perform relatively poorly against them as a result. Key evidence supporting this hypothesis was the demonstration of mechanisms for how NFDS effects might arise [Daems and Verfaillie, 1999, Grossman et al., 2000, Grouios et al., 2000,

1It should be noted, however, that most of these left-handed golfers play right-handed and so being left-handed and playing left-handed are not the same.

2We do note, however, that there are other hypotheses for explaining the high proportion of lefties in elite sports. They include higher testosterone levels, personality traits, psychological advantages and early childhood selection. See [Loffing and Hagemann, 2016] and the references therein. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 44

Stone, 1999]. The difficulty of playing elite left-handed players in one-on-one interactive sports has long been recognized. For example, Breznik [2013] quotes Monica Seleˇs,the former women’s world number one tennis player:

“It’s strange to play a lefty (most players are right-handed) because everything is opposite and it takes a while to get used to the switch. By the time I feel comfortable, the match is usually over.”

A more general overview and discussion of NFDS effects can be found in the recent book chapter of Loffing and Hagemann [2016] who also provide extensive statistics regarding the percentage of top lefties across various sports. It is also perhaps worth mentioning that the debate between the ISH and the NFDS mechanism is not quite settled and some research, e.g. Gursoy [2009] in boxing and Breznik [2013] in tennis, still argue that the ISH has a role to play. Recent analyses of combat sports (such as judo [Sterkowicz et al., 2010], mixed martial arts [Dochtermann et al., 2014], and boxing [Loffing and Hagemann, 2015]) also support the existence of NFDS effects on performance, although they suggest that alternative explanations must still be considered and that the resulting advantage is small. This agrees with [Loffing et al., 2012a] which suggests that although left-handedness provides an advantage, modern professionalism and training are acting to counter the advantage. Deliberate training was shown in [Schorer et al., 2012] to improve the performance of handball goalies against players of specific handedness while [Ull´en et al., 2016] explores the issue of deliberate training vs innate talent in depth. A recent article in The Telegraph [Liew, 2015], for example, noted how seven of the first seventeen Wimbledon champions in the open era were left-handed men while there were only two left-handers among the top 32 seeds in the 2015 tournament. Some of this variation is undoubtedly noise (see Section 4.6) but there do appear to be trends in the value of left-handedness. For example, the same Telegraph article noted that a reverse effect might be taking place in women’s tennis. Specifically, the article noted that 2015 was the first time in the history of the WTA tour that there were four left-handed women among the top 10 in tennis. In most sports the lefty advantage appears to be weaker in women than in men [Loffing and Hagemann, 2016]. The issue of gender effects of handedness in professional tennis is discussed in [Breznik, 2013] where it is shown through descriptive statistics and a PageRank-style analysis that women do indeed have a smaller lefty advantage than men although it’s worth noting their data only extends to 2011. It is also suggested in [Breznik, 2013] CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 45 that the lefty advantage in tennis is weaker in Grand Slams than on the ATP and Challenger tours. They conjecture that possible explanations for this are that the very best players are more able to adjust to playing lefties and they may also be in a better position to tailor their training in anticipation of playing lefties. Many other researchers have studied the extent of the leftie advantage and how it might arise. For example, Goldstein and Young [1996] determine a game theoretic evolutionary stable strategy from payoff matrices of summarized performance, whereas Billiard et al. [2005] explicitly use fre- quency dependent interactions between left- and right-handed competitors. The work of Abrams and Panaggio [2012] has some similarity to ours as they also model professionals as being the top performers from a general population skill distribution. They use differential equations to define an equilibrium of transitions between left- and right-handed populations. These papers rely then on the NFDS mechanism to generate the lefty advantage. We note that equilibrium-style models suggest the strength of the lefty advantage might be inversely proportional to the proportion of top lefties. Such behavior is not a feature of our modeling framework but nor is it inconsistent with it as we do not model the NFDS mechanism (and resulting equilibrium). Instead our main goal is to measure the size of the lefty advantage rather than building a model that leads to this advantage. Several researchers have considered how the lefty advantage has evolved with time. In addition to their aforementioned contributions, Breznik [2013] also plots the mean rank of top lefties and righties over time in tennis and obtains broadly similar results to those obtained by our Kalman filtering approach3. Other researchers have also analyzed the proportion of top lefties in tennis over time. For example, Loffing et al. [2012b] fit linear and quadratic functions to the data and then extrapolate to draw conclusions on future trends. Their quadratic fit for the proportion of lefties in men’s tennis uses data from 1970 to 2010 and predicts a downwards trend from 1990 onwards. This is contradicted by our data from 2010-2015 in Section 4.6 which suggests that the number of top lefties may have been increasing in recent years. They also perform a separate analysis for amateur players, showing that the lefty advantage increases as the quality of players improves. It’s also worth noting that Ghirlanda et al. [2009] introduce a model suggesting the possibility of the lefty advantage remaining stable over time.

3Note that Breznik [2013] do not take into account the number of top ranked lefties or righties, only their mean rank. This is in contrast to our Kalman filter which uses as data the fraction of top players that are lefties. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 46

Latent Ability and Competition Models

Our work builds on the extensive4 latent ability and competition models literature. The original two competition models are the Bradley-Terry-Luce (BTL) model [Bradley and Terry, 1952, Luce, 1959] and the Thurstone-Mosteller (TM) model [Mosteller, 1951, Thurstone, 1927]. BTL assumes each player i has skill Si, so that the probability of player i beating player j is a logistic function of the difference in skills. Specifically BTL assumes

−(S −S ) p(i . j Si,Sj) = 1 1 + e i j (4.1) |    where i . j denotes the event of i beating j. TM is defined similarly, but with the probability of i beating j being a probit function of the difference in their skills5. Given match-play data, the skill of each player may be inferred using maximum likelihood estimation (MLE) where the probability of the match-play results is assumed to satisfy

Tij Mij Mji p(M S1,...,SN ) = p(i . j) p(j . i) (4.2) |   · i

4See also the related literature on item response theory (IRT) from psychometrics and the Rasch model [Fischer and Molenaar, 2012] which is closely related to the BTL model below.

5The probit function is another name for the cumulative distribution function of the standard normal distribution. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 47

Contributions of this Work

In this chapter we propose a Bayesian latent ability model for identifying the advantage of being left-handed in one-on-one interactive sports but with the additional complication of having a la- tent factor, i.e. the advantage of left-handedness, that we need to estimate. Inference is further complicated by the truncated nature of data-sets that arise from only observing data related to the top players. The resulting pattern of missing data therefore depends on the latent factor and so it is important that we model it explicitly. We show how to infer the advantage of left-handedness when only the proportion of top left-handed players is available. In this case we show that the distribution of the number of left-handed players among the top n (out of N) converges as N → ∞ to a binomial distribution with a success probability that depends on the tail-length of the innate skill distribution. Since this result would not be possible if we used short- or long-tailed skill distri- butions, we also argue for the use of a medium-tailed distribution such as the Laplace distribution when modeling the “innate”6 skills of players. We also use this result to develop a simple Kalman filtering model for inferring how the lefty advantage has varied through time in a given sport. Our Kalman filter / smoother enables us to smooth any spurious signals over time and should lead to a more robust inference regarding the dynamics of the lefty advantage. We also consider various extensions of our model. For example, in order to estimate the innate skills of top players we consider the case when match-play data among the top n players is available. This model is a direct generalization of the Glicko model described earlier. Unlike other models, this extension learns simultaneously from (i) the over-representation of lefties among top players and (ii) match-play results between top lefties and righties. Previously these phenomena were studied separately. We observe that including match-play data in our model makes little difference to the inference of the lefty advantage and therefore helps justify our focus on the simplified model that only considers the proportion of lefties in the top n players. This extension does help us to identify the innate skills of players, however, we acknowledge that these so-called innate skills may only be of interest to the extent that the NFDS mechanism is responsible for the lefty advantage. (To the extent that the innate superiority hypothesis holds, it’s hard to disentangle the notion of

6Throughout this chapter we will use the term “innate skill” to refer to all components of a player’s “skill” apart from the advantage / disadvantage associated with being left-handed. We acknowledge that the term “innate” may be quite misleading – see the discussion below – but will continue with it nonetheless for want of a better term. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 48 innate skill or talent from the lefty advantage and using the phrase “innate skills” would be quite misleading in this case.) The remainder of this chapter is organized as follows. In Section 4.2 we describe our skill and handedness model and also develop our main theoretical results here. In Section 4.3 we introduce match-play results among top players into the model while in Section 4.4 we consider a variation where we only know the handedness and external rankings of the top players. We present numerical results in Section 4.5 using data from men’s professional tennis in 2014. In Section 4.6 we propose a simple Kalman filtering model for inferring how the lefty advantage in a given sport varies through time and we conclude in Section 4.7 where possible directions for future research are also outlined. Various proofs and other technical details are deferred to the appendix.

4.2 The Latent Skill and Handedness Model

th We assume there is a universe of N players and for i = 1,...,N we model the skill, Si, of the i player as the sum of his innate skill Gi and the lefty advantage L if he is in fact left-handed. That is, we assume

Si = Gi + HiL where Hi is the handedness indicator with Hi = 1 if the player is left-handed and Hi = 0 otherwise. The generative framework of our model is:

2 Left-handed advantage: L N(0, σ ) where σL is assumed to be large and N denotes the • ∼ L normal distribution.

For players i = 1, 2,...,N •

– Handedness: Hi Bernoulli(q) where q is the proportion of left-handers in the overall ∼ population.

– Innate skill: Gi for some given distribution ∼ G G – Skill: Si = Gi + HiL.

The joint probability distribution corresponding to the generative model then satisfies

N p(S, L, H) = p(L) p (Hi) p(Si Hi,L) (4.3) | i=1 Y CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 49 where we note again that N is the number7 of players in our population universe. We assume we know (from public results of professional competitions etc.) the identity of the top n < N players as well as the handedness, i.e. left or right, of each of them. Without loss of generality, we let these top n players have indices in 1, ..., n and define the corresponding event { }

Topn,N := min Si max Si . (4.4) i=1,...,n{ } ≥ i=n+1,...,N{ }   Note, however, that even when we condition on Top , the indices in 1, ..., n are not ordered n,N { } according to player ranking so for example it is possible that S1 < S2 or Sn < S3 etc.

4.2.1 Medium-Tailed Priors for the Innate Skill Distribution

Thus far we have not specified the distribution from which the innate skill levels are drawn in G the generative model. Here we provide support for the use of medium-tailed distributions such as the Laplace distribution for modeling these skill levels. We do this by investigating the probability of top players being left-handed as the population size N becomes infinitely large. Consider then

lim p(nl L; n, N) (4.5) N→∞ | where nl denotes the number of left-handers among the top n players. For the skill distribution to be plausible, the probability that top players are lefthanded should be increasing in L and be consistent with what we observe in practice for a given sport. Letting Binomial (x; n, p) denote the probability of x successes in a Binomial (n, p) distribution, we have the following result.

Proposition 1. Assume that g has support R. Then

q lim p(nl L; n, N) = Binomial nl; n, (4.6) N→∞ | q + (1 q)c(L)  −  where g(s) c(L) := lim s→∞ g(s L) − if the limit c(L) exists.

7We have in mind that N is the total number of players in the world who can play at a good amateur level or above. In tennis, for example, a good amateur level might be the level of varsity players or strong club players. N will obviously vary by sport but what we have in mind is that the player level should be good enough to take advantage of the left-handedness advantage (to the extent that it exists). CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 50

Proof. See Appendix C.1. 

The function c(L) [0, ] characterizes the tail length of the skill distribution [Foss et al., 2011, ∈ ∞ Sec. 2.4]. If c(L) = 0 for all L > 0 as is the case for example with the normal distribution, then g is said to be short-tailed. In contrast, long-tailed distributions such as the t distribution, have c(L) = 1 for all L R. If a distribution is neither short- nor long-tailed, we say it is medium-tailed. ∈ If we use a short-tailed innate skill distribution and L > 0, then c(L) = 0 and we have limN→∞ p(nl L; n, N) = Binomial (nl; n, 1). That is, if there is any advantage to being left- | handed and the population is sufficiently large, then the top players will all be left-handed almost surely8. This property is clearly unrealistic even for sports with a clear left-handed advantage such as fencing, since it is not uncommon to have top ranked players that are right-handed. This raises questions over the heavy use of the short-tailed normal distribution in competition models [Elo, 1978, Glickman, 1999, Herbrich et al., 2007] and suggests9 that other skill distributions may be more appropriate. As an alternative, consider a long-tailed distribution. In this case we have c(L) = 1 for all L R and p(nl L; n, N) Binomial (nl; n, q) as N . This too is undesirable, ∈ | → → ∞ since the probability of a top player being left-handed does not depend on L in the limit and agrees with the probability of being left-handed in the general population. As a consequence, such a dis- tribution would be unsatisfactory for modeling in those sports where we typically see left-handers over-represented among the top players. We therefore argue that the ideal distribution for modeling the innate skill distribution is a medium-tailed distribution such as the standard Laplace distribution which has PDF

1 g(x) = exp ( x /b) (4.7) 2b −| | where b = 1/√2 is the standard scale parameter of the distribution. It is easy to see in this case

8The most extreme example of a short-tailed distribution is if g were a delta function and all players had the same innate skill. The only way to be a top player would then be via the advantage of left-handedness.

9In defense of the normal distribution, we show in Appendix C.2 that a normally distributed G may in fact be a suitable choice if N is not too large. Though p(H1 = 1 | L, S1 ≥ max{S2:N }) → 1 as N → ∞ for normally distributed √ skills, the rate of convergence is only 1 − Ω exp −L log N. This convergence is sufficiently slow for the normal skill distribution to be somewhat reasonable for moderate values of N. However, the result of Proposition 1 suggests that a normal skill distribution would be inappropriate for very large values of N. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 51 that c(L) = exp( L/b) and substituting this into (4.6) yields − q lim p(nl L; n, N) = Binomial nl; n, (4.8) N→∞ | q + (1 q) exp( L/b)  − −  which is much more plausible. For very small values of L, the probability of top players being left- handed is approximately q which is what we would expect given the small advantage of being left- handed. For large positive values of L we see that the probability approaches 1 and for intermediate positive values of L we see that the probability of top players being left handed lies in the interval (q, 1). This of course is what we observe in many sports such as fencing, table-tennis, tennis etc. For this reason, we will restrict ourselves to medium-tailed distributions and specifically the Laplace10 distribution for modeling the innate skill levels in the remainder of this chapter.

4.2.2 Large N Inference Using Only Aggregate Handedness Data

Following the results of Proposition 1, we assume here that we only know the number nl of the top n players who are left-handed. We shall see later that only knowing nl results in little loss of information regarding L compared to the full information case of Section 4.3 where we have knowledge of the handedness and all match-play results among the top n players. We shall make use of this observation in Section 4.6 when we build a model for inferring the dynamics of L through time series observations of nl. √ 10 We will use b = 1/ 2 w.l.o.g. since alternative values of b could be accommodated via changes in σL and σM . CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 52

Posterior of L in an Infinitely Large Population

Applying Bayes’ rule to (4.6) yields

lim p(L nl; n, N) N→∞ |

lim p(L)p(nl L; n, N) ∝ N→∞ | q = p(L) Binomial n ; n, l q + (1 q) exp( L/b)  − −  n q nl q nr = p(L) 1 nl q + (1 q) exp( L/b) − q + (1 q) exp( L/b)    − −   − −  1 nl exp( L/b) nr p(L) − ∝ q + (1 q) exp( L/b) q + (1 q) exp( L/b)  − − n   − −  exp(nl/n L/b) = p(L) · (4.9) q exp(L/b) + 1 q  −  where nr := n nl is the number of top righthanded players, all factors independent of L were − absorbed into the constant of proportionality and the binomial distribution term on the second line is now written explicitly as a function of nl. We shall verify empirically in Section 4.5 that

(4.9) is a good approximation of p(L nl; n, N) if the population size N is large. As the number | th of top players n increases while keeping nl/n fixed, the n power in (4.9) causes the distribution to become more peaked around its mode. This effect can be seen in Figure 4.1 where we have plotted the r.h.s of (4.9) for different values of n. If n is sufficiently large then the data will begin to overwhelm the prior on L and the posterior will become dominated by the likelihood factor, i.e. the second term on the r.h.s. of (4.9). This likelihood term achieves its maximum at

nl/n 1 q L∗ := b log − (4.10) 1 nl/n · q  −  which we plot as the dashed vertical line in Figure 4.1. We can clearly see from the figure that the ∗ 11 density becomes more peaked around L as n increases while keeping nl/n fixed. The value of L∗ provides an easy-to-calculate point estimate of L for large values of n. The bell-shaped posteriors in Figure 4.1 suggest that we might be able to approximate the posterior of L as a Gaussian distribution. This can be achieved by first approximating (4.9) as a Gaussian distribution over L via a Laplace approximation [Barber, 2012, Sec. 28.2] to the second term on the r.h.s. of (4.9). Specifically, we set the mean of the Laplace approximation equal to the

11 ∗ As a sanity check, suppose that nl/n = q. Then it follows from (4.10) that L = 0, as expected. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 53

-1 -0.5 0 0.5 1 1.5 2

Figure 4.1: The value of limN→∞ p(L nl; n, N) as given by the r.h.s. of (4.9) for different values of | n. We assume an N(0, 1) prior for p(L), a value of q = 11% and we fixed nl/n = 25%. The dashed vertical line corresponds to L∗ from (4.10) and the Laplace approximations are from (4.11). mode, L∗, of (4.6) and then set the precision to the second derivative of the logarithm evaluated at the mode. This yields: n exp(nl/n L/b) n · N L; L∗, b2 . (4.11) q exp(L/b) + 1 q ∝ nrnl  −  ∼   Note that this use of the Laplace approximation is non-standard as the left side of (4.11) is not a distribution over L, but merely a function of L. However if 0 < nl < n then the left side of (4.11) is a unimodal function of L and, up to a constant of proportionality, is well approximated by a Gaussian. We can then multiply the normal approximation in (4.11) by the other term, p(L), which is also Gaussian to construct a final Gaussian approximation for the r.h.s. of (4.9). This is demonstrated in Figure 4.1 where (4.9) is plotted for different values of n and where the likelihood factor was set12 to the exact value of (4.6) or the Laplace approximation of (4.11). It is evident from the figure that the Laplace approximation is extremely accurate. This gives us the confidence to use the Laplace approximation in Section 4.6 when we build a dynamic model for L.

4.2.3 Interpreting the Posterior of L

In the aggregate data regime we do not know the posterior distributions of the skills and thus cannot directly infer the effect of left-handedness on match-play outcomes. However, we can still

12In each case, we normalized so that the function integrated to 1. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 54 infer this effect in aggregate. As we shall see, the relative ranking of left-handers is governed by the value of L. In particular, if we continue to assume the Laplace distribution for innate skills, then we show in Appendix C.3 that for any fixed and finite value of L, the difference in skills between players at quantiles λj and λi satisfies

lim S[Nλ ] S[Nλ ] = b log(λi/λj) (4.12) N→∞ j − i   th where the convergence is in probability and we use S[Nλi] to denote the [Nλi] order statistic of 13 S1,...,SN . Suppose now that Nλi = x and Nλj = exp( k/b)x. Assuming (4.12) continues to − hold approximately for large (but finite) values of N it immediately follows that

S S k. (4.13) [Nλj ] − [Nλi] ≈

Consider now a left-handed player of rank x with innate skill, G. All other things being equal, if this player was instead right-handed then his skill would change from S = G + L to S = G. This corresponds to a value of k = L and so that player’s ranking would therefore change from x to − exp(L/b)x which of course would correspond to an inferior ranking for positive values of L. We can therefore interpret the advantage of being left-handed as improving, i.e. lowering, one’s rank by a multiplicative factor of exp( L/b). − We can use this result to infer the improvement in rank for lefties due to their left-handedness in various sports [Flatt, 2008]. We do so by substituting the fraction of top players that are left- ∗ handed, nl/n, into (4.10) to obtain L , our point estimate of L. Following the preceding discussion, the (multiplicative) change in rank due to going from left-handed to right-handed can then be approximated by nl 1 q exp(L∗/b) = − (4.14) n nl · q − which again follows from (4.10). It is important to interpret (4.14) correctly. In particular, it represents the multiplicative drop in rank if a particular left-handed player were somehow to give up the advantage of being left-handed. It does not represent his drop in rank if he and all other left-handed players were to simultaneously give up the advantage of being left-handed. In this latter case, the drop in rank would not be as steep as that given in (4.14) since that player would still remain higher-ranked than the other left-handed players who were below him in the original

13 There is a slight abuse of notation here since we first need to round Nλi to the nearest integer. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 55 ranking. In fact, we can argue that the approximate absolute drop in ranking for a left-handed player of rank x when all left-handers give up the benefit of being left-handed is given by the number of right-handed players between ranks x and exp(L∗/b)x

∗ (1 nl/n) (x exp(L /b) x). (4.15) − × −

We can argue for (4.15) by noting that on average a fraction nl/n of the players between ranks x and x exp(L∗/b) will be left-handed and they will also fall and remain ranked below the original player when the lefty advantage is stripped away. Therefore the new rank of the left-handed player originally ranked x will be ∗ x + (1 nl/n) (x exp(L /b) x) (4.16) − × − when all left-handed players simultaneously give up the advantage of being left-handed. Simplifying (4.16) using (4.14) yields a new ranking of

n l x. nq

We therefore refer to the r.h.s. of (4.14) and nl/(nq) as Dropalone and Dropall, respectively. Results for various sports are displayed in Table 4.1. For example, the table suggests that left-handed table-tennis players would see their ranking drop by a factor of approximately 2.91 if the advantage of left-handedness could somehow be stripped away from all of them. These results, while pleasing, are not very surprising. After all, a back-of-the-envelope calcula- tion could come to a similar conclusion as follows. The proportion of top table-tennis players who are left-handed is 32% but the proportion14 of left-handers in the general population is 11%. As- ≈ ≈ suming the top left-handers are uniformly spaced among the top right-handers, we would therefore need to reduce the ranking of all left-handers in fencing by a factor of 32/11 2.91 = Drop to ≈ ≈ all ensure that the proportion of top-ranked left-handed fencers matches the proportion of left-handers in the general population. While these results and specifically the interpretation of exp(L∗/b) described above are therefore not too surprising, we can and do interpret them as a form of model validation. In particular, they validate the choice of a medium-tailed distribution to model the innate skill level of each player.

14Note that the proportion of left-handers can vary from country to country for various reasons including cultural factors etc. It is therefore difficult to pin down exactly but a value ≈ 11% seems reasonable. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 56

Sport Approx % left-handed (nl/n) Dropalone Dropall

Tennis 15% 1.43 1.36 Badminton 23% 2.41 2.09 Fencing (´ep´ee) 30% 3.47 2.73 Table-tennis 32% 3.81 2.91

Table 4.1: Proportion of left-handers in several interactive one-on-one sports [Flatt, 2008, Loffing and Hagemann, 2016] with the relative changes in rank under the Laplace distribution for innate

nl 1−q skills with q = 11%. Dropalone = represents the drop in rank for a left-hander who alone n−nl · q nl gives up the advantage of being left-handed while Dropall = nq represents the drop in rank of a left-hander when all left-handers give up the advantage of being left-handed.

We note from Proposition 1 and the following discussion that it would not be possible to obtain these results (in the limit as N ) using either short- or long-tailed distributions to model the → ∞ innate skills.

4.3 Including Match-Play and Handedness Data

Thus far we have not considered the possibility of using match-play data among the top n players to infer the value of L. In this section we extend our model in this direction so that L and the innate skills of the top n players can be inferred simultaneously. We suspect (and indeed this is confirmed in the numerical results of Section 4.5) that inclusion of the match-play data adds little information regarding the value of L over and beyond what we can already infer from the basic model of Section 4.2. However, it does in principle allow us to try and answer hypothetical questions regarding the innate skills of players and win-probabilities for players with and without the benefit of left-handedness. We therefore extend our basic model as follows:

For each combination of players i < j •

– Match-play results: Mij Binomial (Tij, p(i . j Si,Sj ; σM )) ∼ | CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 57

where the probability that player i defeats player j is defined according to

1 p(i . j Si,Sj; σM ) := . (4.17) | 1 + e−σM (Si−Sj )

In contrast with the win probability in (4.1), our win probability in (4.17) has a hyperparameter

σM that we use to adjust for the predictability of each sport. In less predictable sports, for example, weaker players will often beat stronger players and so even when Si Sj we have  p(i . j Si,Sj) 1/2. On the other hand, more predictable sports would have a larger value of σM | ≈ which accentuates the effects of skill disparity. Having an appropriate σM allows the model to fit the data much more accurately than if we had simply set σM = 1 as is the case in BTL. It’s worth noting that instead of scaling by σM in (4.17), we could have scaled the skills themselves so that

Si = σM (Gi + HiL) and then used the win probability in (4.1). We decided against this approach in order to keep the scale of L consistent across sports.

4.3.1 The Posterior Distribution

The joint probability distribution corresponding to the extended model now satisfies

N p(S, L, M, H) = p(L) p (Hi) p(Si Hi,L) p(Mij Si,Sj). (4.18) | | i=1 i

As before we condition on Topn,N so the posterior distribution of interest is then given by

p(S1:n,L M1:n,H1:n, Top ) (4.19) | n,N where Sa:b := (Sa,Sa+1, ..., Sb), H1:n are their respective handedness indicators and M1:n denotes the match-play results among the top n ranked players. Bayes’ rule therefore implies

p(S1:n,L M1:n,H1:n, Top ) | n,N

p(S1:n, L, M1:n,H1:n, Top ) ∝ n,N

= p(L) p(H1:n L) p(S1:n L, H1:n) p(Top S1:n, L, H1:n) | | n,N |

p(M1:n S1:n, L, H1:n, Top ) × | n,N

p(L) p(Si L, Hi) p(Top S1:n,L) p(Mij Si,Sj) (4.20) ∝ | n,N | | i≤n i,j≤n Y Y CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 58

where in the final line we have simplified the conditional probabilities and dropped the p(H1:n | L) = p(H1:n) factor since it is independent of S1:n and L. This last statement follows because, as emphasized above, the first n players are not player rankings but are merely15 player indicators from the universe of N players. We know the form of p(L), p(Si L, Hi) and p(Mij Si,Sj) from | | the generative model but need to determine p(Top S1:n,L). The prior density on the skill of n,N | a player in the general population (using the generative framework with Bernoulli(q) handedness) satisfies

f(S L) := p(S L) = EH [p(S H,L)] = qg(S L) + (1 q)g(S) (4.21) | | | − − where we recall that g denotes the PDF of the innate skill distribution, . Letting F (S L) denote G | the CDF of the density in (4.21), it follows that

N N−n p(Top S1:n,L) = p(Si min S1:n S1:n,L) = F (min S1:n L) . n,N | ≤ { } | { } | i=n+1 Y We can now simplify (4.20) to obtain

p(S1:n,L M1:n,H1:n, Top ) | n,N n N−n p(L) g(Si LHi) F (min S1:n L) p(Mij Si,Sj). (4.22) ∝ − { } | | i=1 i,j≤n Y Y 4.3.2 Inference Via MCMC

We use a Metropolis-Hastings (MH) algorithm to sample from the posterior in (4.22). By virtue of working with the random variables S1:n conditional on the event Topn,N , the algorithm will require taking skill samples that are far into the tails of the skill prior, . In order to facilitate fast sampling G from the correct region, we develop a tailored approach for both the initialization and the proposal distribution of the algorithm. The key to our approach is to center and de-correlate the posterior distribution, a process known as whitening [Barber, 2012, Murray and Adams, 2010]. This leads to good proposals in the MH sampler and is also useful in setting the skill scaling hyperparameter, σM , as we discuss below. More specifically, we will use a Gaussian proposal distribution with mean vector

15 It is only through the act of conditioning on Topn,N that we can infer something about the rankings of these players. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 59 equal to the current state of the Markov chain and covariance matrix, λ2Σ/n, where16 λ = 2.38 and Σ is an approximation to the covariance of the posterior distribution. In the numerical results of Section 4.5, we will run multiple chains in order to properly diagnose convergence to stationarity using the well-known Gelman-Rubin R diagnostics. Towards this end, the starting points of the chains will be generated from an N(µ, γI) distribution where µ is an approximation to the mean b of the posterior distribution and where γ is set sufficiently large so as to ensure the starting points are over-dispersed.

Approximating the Mean and Covariance of the Posterior

The posterior of the skills S1:n can be approximated by considering the posterior distribution of th its order statistics, S[1:n] := (S[1],S[2], ..., S[n]), where S[i] is the i largest of S1:n. If we were to reorder the player indices so that the index of each player equaled his rank, then we would have

S1:n = S[1:n]. Unfortunately we don’t a priori know the ranks of the players. Indeed their ranks are uncertain and will follow a distribution over permutations of 1, ..., n that is dependent on the data. However, it is possible to construct a ranking of players that should have a high posterior probability by running BTL on their match-play results. If we order the player indices according to their BTL ranks then we would expect

p(S1:n = s1:n,L M1:n,H1:n, Top ) p(S = s1:n,L M ,H ) (4.23) | n,N ≈ [1:n] | [1:n] [1:n] where M[1:n],H[1:n] are the match-play results and handedness of the now ordered top n players. We can estimate the mean and covariance of the distribution given by (4.23) via Monte Carlo.

First let us apply Bayes’ rule to separate L, S[n] and S[1:n−1] and obtain

p(S ,L M ,H ) [1:n] | [1:n] [1:n] = p(S S , L, M ,H ) p(S L, M ,H ) p(L M ,H ). (4.24) [1:n−1] | [n] [1:n] [1:n] [n] | [1:n] [1:n] | [1:n] [1:n]

We would like to jointly sample from this distribution by first sampling L, then sampling S[n] conditional on L, and finally sampling S[1:n−1] conditional on both L and S[n]. The empirical mean

16The value of λ = 2.38 is optimal under certain conditions; see [Roberts et al., 2001]. We also divide the covariance matrix by the dimension, n, to help counteract the curse of dimensionality which makes proposing a good point more difficult as the dimension of the state-space increases. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 60 and covariance of such samples would approximate the true mean and covariance of posterior of L and S[1:n], which in turn would approximate the mean and covariance of L and S1:n — our ultimate object of interest. Unfortunately sampling from (4.24) is intractable, but we can approximately sample from it as follows:

1. As discussed in Section 4.2.2, for large populations the posterior of L can be approximated as n exp(nl/n L/b) p(L M ,H ) p(L) · . (4.25) | [1:n] [1:n] ∝ q exp(L/b) + 1 q ∼  −  We can easily simulate from the distribution on the r.h.s. of (4.25) by computing its CDF numerically and then using the inverse transform approach. It can be seen from Figure 4.2 in Section 4.5 that this approximation is very accurate for large N.

2. It is intractable to simulate S[n] directly according to the conditional distribution on the r.h.s. of (4.24). It seems reasonable to assume, however, that

p(S L, M ,H ) p(S L), (4.26) [n] | [1:n] [1:n] ≈ [n] |

where we ignore the conditioning on M[1:n] and H[1:n]. As with L, we can use the inverse

transform approach to generate S[n] according to the distribution on the r.h.s. of (4.26) by noting that its CDF is proportional to [David and Nagaraja, 2003, p. 12]

F (S L)N−n(1 F (S L))n−1f(S L) [n] | − [n] | [n] |

where f and F are as defined in (4.21).

3. Finally, we can handle the conditional distribution of S[1:n−1] on the r.h.s. of (4.24) by assuming p(S S , L, M ,H ) p(S S ,L) (4.27) [1:n−1] | [n] [1:n] [1:n] ≈ [1:n−1] | [n]

where we again ignore the conditioning on M[1:n] and H[1:n]. It is easy to simulate S[1:n−1] from the distribution on the r.h.s of (4.27). We do this by simply generating n 1 samples − from the distribution p(S S > S ,L) = p(G + HL G + HL > S ,L) (a simple truncated | [n] | [n] distribution) and then ordering the samples. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 61

We can run steps 1 to 3 repeatedly to generate many samples of (S[1:n],L) and then use these samples to estimate the mean, µ, and covariance matrix, Σ, of the true posterior distribution of

(S1:n,L) (where we recall that the top n players have now been ordered according to their BTL ranking). As described above, the resulting Σ is used in the proposal distribution for the MH algorithm while we use µ as the mean of the over-dispersed starting points for each chain. The accuracy of the approximation is empirically investigated in Section 4.5 and is found to be very close to the true mean and covariance of L and S1:n. We note that the approximate mean, µ, and covariance, Σ, could also be used in other MCMC algorithms such as Hamiltonian Monte Carlo [Neal et al., 2011, Sec 4.1], Elliptical Slice Sampling [Murray et al., 2010] or even used in an EPESS-like method like in Chapter 2.

Setting σM via an Empirical Bayesian Approach

The hyperparameter σM was introduced in (4.17) to adjust for the predictability of the sport and we need to determine an appropriate σM in order to fully specify our model. A simple way to do this is to set σM to be the maximum likelihood estimator over the match-play data where the skills are set to be S1:n = µS, the approximate posterior mean of the skills derived above. That is,

σM = argmax p(Mij Si = µi,Sj = µj; σ). (4.28) σ>0 | i,j≤n Y We are thus adopting an empirical Bayes approach where a point estimate of the random variables is used to set the hyperparameter, σM ; see [Murphy, 2012, p.172].

An alternative to the empirical Bayes approach would be to allow σM be a random variable in the generative model and to infer its value via MCMC. Unfortunately this approach leads to complications. Recall from Section 4.2 that σM can be interpreted as scaling the left-handed advantage and innate skill distributions. If σM is allowed to be random then this effectively changes the skill distribution. For example if the skills were normally distributed conditioned on σM , and

σM were distributed as an inverse gamma distribution, then the skills would effectively have a t distribution (as the inverse gamma is a conjugate prior to the normal distribution). Since we wish to keep our skills as Laplace distributed, or more generally medium-tailed, it is simpler to fix σM as a hyperparameter. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 62

4.4 Using External Rankings and Handedness Data

An alternative variation on our model is one where we know the individual player handedness of each of the top n players and also have an external ranking scheme of their total skills. For example, such a ranking may be available for the professional athletes in a given sport, e.g. the official world rankings maintained by the World ATP Tour for men’s tennis [Stefani, 1997]). We will assume without loss of generality that the player indices are ordered as per the given rankings so that the th i ranked player has index i in our model and Si Si+1 for all i < n. We are therefore assuming ≥ that the external rankings are “correct” and so we can condition our generative model on these rankings. Specifically, we tighten the assumption made in (4.4) that players indexed 1 to n are the top n players to assume that the ith indexed player is also the ith ranked player for i = 1, ..., n. Then a similar argument that led to (4.22) implies that the posterior of interest satisfies

p(S1:n,L H1:n,Si Si+1 i < n, Sn max Sn+1:N ) | ≥ ∀ ≥ { } n N−n p(L) g(Si LHi) F (Sn L) I[Si Si+1] (4.29) ∝ − | ≥ i=1 i

p(Si S−i,H1:n,L) g(Si LHi) I[Si+1 Si Si−1] (4.30) | ∝ − ≤ ≤ for 1 < i < n and where S−i := Sj : j = i, 1 j n . Similarly the conditional distributions of { 6 ≤ ≤ } S1 and Sn satisfy

p(S1 S−1,H1:n,L) g(S1 LH1) I[S1 S2] | ∝ − ≥ and N−n p(Sn S−n,H1:n,L) g(Sn LHn) F (Sn L) I[Sn−1 Sn]. | ∝ − | ≥ Conveniently, the skills of the odd ranked players can be updated simultaneously since they are all independent of each other conditional on the skills of the even ranked players. Similarly, the even-ranked players can also be updated simultaneously conditional on the skills of the odd-ranked players. This makes the sampling parallelizable and efficient to implement when using Metropolis- within-Gibbs. Our algorithm therefore updates the variables in three blocks: CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 63

1. Update all even skills simultaneously.

2. Update all odd skills simultaneously.

3. Update L via a Metropolis-Hastings17 (MH) step with

n N−n p(L S1:n,H1:n) p(L) g(Si LHi) F (Sn L) . | ∝ − | i=1 Y In the absence of match-play data, we believe this model should yield slightly more accurate infer- ence regarding L than the base model of Section 4.2 when the left-handers are not evenly spaced among the top n players. For example, it may be the case that all left-handers in the top n are ranked below all the right-handers in the top n. While such a scenario is of course unlikely, it would suggest that the value of L is not as large as that inferred by the base model which only considers nl and not the relative ranking of the nl players among the top n. The model here accounts for the relative ranking and as such, should yield a more accurate inference of L to the extent that the lefties are not evenly spaced among the top n players.

4.5 Numerical Results

We now apply our models and results to Mens ATP tennis. Specifically, we use handedness data as well as match-play results from ATP Tennis Navigator [Ten, 2004], a database that includes more than seven thousand players from 1980 until the present and hundreds of thousands of match results at various levels of professional and semi-professional tennis. We restrict ourselves to players for whom handedness data is available and who have played a minimum number of games (here, set at thirty). This last restriction is required because we run BTL as a preprocessing step in order to extract the top n = 150 players before applying our methods and because BTL can be susceptible to large errors if the graph of matches (with players as nodes, wins as directed edges) is not strongly18 connected. Using data on numbers of recreational tennis players [Ten, 2015, PAC,

17In the numerical results of Section 4.5, the MH proposal distribution for L will be a Gaussian with mean equal to the current value of L and where the variance is set during a tuning phase to obtain an acceptance probability ≈ 0.234. (This value is theoretically optimal under certain conditions; see [Roberts et al., 1997]).

18Otherwise there may be players that have played and won only one match who are impossible to meaningfully rank. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 64

2016], we roughly estimate a universe of N = 100, 000 advanced players but we note that our results were robust to the specific value of N that we chose. Specifically, we also considered N = 1 million and N = 50 million and obtained very similar results regarding L. We used data from 2014 for all of our experiments and our results were based on the model of Section 4.2 with a Laplace prior on the skills and an uninformative prior on L with σL = 10. The MCMC chains for the extended model of Section 4.3 were initialized by a random perturbation from the approximate mean as outlined in Section 4.3.2, and convergence checked using the Gelman-Rubin diagnostic [Gelman and Rubin, 1992].

4.5.1 Posterior Distribution of L

In Figure 4.2, we display the posterior of L obtained using each of the different models of Sec- tions 4.2 to 4.4 using data from 2014 only. We observe that these inferred posteriors are essentially identical. This is an interesting result and it suggests that for large populations there is essentially no additional information conveyed to the posterior of L by match-play data or ranked handedness if we are already given the proportion of left-handers among the top n players. The posterior of L can be interpreted in terms of a change in rank as discussed Section 4.2.3. Since the posterior of the aggregate handedness with the Laplace approximation agrees with the posterior obtained from the full match-play data, we would argue the results of Table 4.1 are valid even in the light of the match-play data. These results suggest that being left-handed in tennis improves a player’s rank by a factor of approximately 1.36 on average. Of course, these results were based on 2014 data only and as we shall see in Section 4.6, there is substantial evidence to suggest that L has varied through time. While match-play data and ranked handedness therefore provide little new information on L over and beyond knowing the proportion of left-handers in the top n players, we can use the match- play model to answer hypothetical questions regarding win probabilities when the lefty advantage is stripped away from the players’ skills. We discuss such hypothetical questions in Section 4.5.3.

4.5.2 On the Importance of Conditioning on Topn,N

We now consider if it’s important to condition on Topn,N as we do in (4.22) when evaluating the lefty advantage. This question is of interest because previous papers in the literature have tried to CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 65

-1 -0.5 0 0.5 1

Figure 4.2: Posteriors of the left-handed advantage, L, using the inference methods developed in Sections 4.2, 4.3 and 4.4. “Match-play” refers to the results of MCMC inference using the full match-play data, “Handedness and Rankings” refers to MCMC inference using only individual handedness data and external skill rankings, “Aggregate handedness” refers to (4.9) and “Aggregate handedness with Laplace Approximation” refers to (4.9) but where the Laplace approximation in

(4.11) substitutes for the likelihood term. “Match-play without conditioning on Topn,N ” is discussed in Section 4.5.2. infer the lefty advantage without conditioning on the players in their data-set being the top ranked players. For example, Del Corral and Prieto-Rodr´ıguez[2010] built a probit model for predicting match-play outcomes and their factors included a player’s rank, height, age and handedness. Using data from 2005 to 2008 they did not find a consistent statistically significant advantage to being lefthanded in this context. That said, their focus was not inferring the lefty advantage, but rather in seeing how useful an indicator it is for predicting match-play outcomes. Moreover, it seems reasonable to assume that any lefty advantage would already be accounted for by a player’s rank. In contrast, Breznik [2013] attempted to infer the lefty advantage by comparing the mean ranking of top left- and right-handed players as well as the frequency of matches won by top left- vs top right-handed players. Using data from 1968 to 2011 they found a statistically significant advantage for lefties. The analysis in these studies do not account for the fact that the players in their data- CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 66

sets were the top ranked players. In our model, this is equivalent to removing the Topn,N condition N−n from the left side of (4.22) and F (min S1:n L) from its right side. { } | It makes sense then to assess if there is value in conditioning on Topn,N when we use match- play data (and handedness) among top players to estimate the lefty advantage. We therefore re- estimated the lefty advantage by considering the same match-play model of Section 4.3 but where we did not condition on Topn,N . Inference was again performed using MCMC on the posterior of N−n (4.22) but with the F (min S1:n L) factor ignored. Based on this model we arrive at a very { } | different posterior for L and this may also be seen in Figure 4.2. Indeed when we fail to condition on Topn,N we obtain a posterior for L that places more probability on a lefty disadvantage than a lefty advantage and whose mode is negative. The reason for this is that in 2014 there were few very highly ranked left-handers with Nadal being the only left-hander among the top 20. In fact when we apply BTL to the match-play data among the top 150 players that year we find that the mean rank of top right-handers is 74.07 while the mean rank of top lefties is 83.39. A model that only considers results among top 150 players that year therefore concludes there appears to be a disadvantage to being left-handed. In contrast, when we also condition upon Topn,N (as indeed we should since this provides additional information) we find this conclusion reversed so that there appears to be a lefty advantage. Moreover this reversal makes sense: with 23 players in the top 150 being lefties we see the percentage of top lefties is 15.33% which is greater than the 11% assumed for the general population. We therefore see that conditioning on Topn,N can result in a significant improvement in the inference of L over and beyond just considering the match-play results among the top n players.

4.5.3 Posterior of Skills With and Without the Advantage of Left-handedness

We now consider the posterior distribution of the innate skills and how they differ (in the case of left-handers) from the posterior of the total skills. We also consider the effect of L on match-play probabilities and rankings of individual players. These results are based on the extended model of Section 4.3. In Figures 4.3 and 4.4 we demonstrate the posterior of the skills of Rafael Nadal who CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 67 plays19 left-handed and the right-handed Roger Federer. During their careers these two players have forged perhaps the greatest rivalry in modern sport. The figures display the posterior distributions of the innate skill, G, and total skill, S = G+HL, of Nadal and compares them to the skill, S = G, of Federer who is right-handed. Clearly the posterior skill distribution of Federer is to the right of Nadal’s, and the discrepancy between the two is even larger when the advantage of left-handedness is removed. This is not surprising since Federer’s ranking, according to BTL (as well as official year-end rankings), was higher than Nadal’s at the end of 2014. It may be tempting to infer from these posteriors that Federer has a high probability of beating Nadal in any given match (based on 2014 form). The match-play data, however, shows that Nadal and Federer played only once in 2014 with Nadal winning in the semi-final of the Australian Open. While Nadal and Federer is probably the most interesting match-up for tennis fans, this match- up also points to one of the weaknesses of our model. Specifically, we do not allow for player interaction effects in determining win probabilities whereas it is well known that some players match up especially well against other players. Nadal, for example, is famous for matching up particularly well against Federer and has a 23-15 career head-to-head win / loss record20 against Federer despite Federer often having a superior record (to Nadal’s) against other players. For this reason (and others outlined in the introduction) we do acknowledge that inference regarding individual players should be conducted with care. Table 4.2 extends the analysis of Federer and Nadal to other top ranked players. Each cell in the table provides the probability of the lower ranked player beating the higher ranked player according to their posterior skill distributions. Above the diagonal the advantage of left-handedness is included in the calculations whereas below the diagonal it is not. The effect on the winning probability due to left-handedness can be observed by comparing the values above and below the diagonal. For example Nadal has a 39.5% chance of beating Federer with the advantage of left-

19It is of interest to note that although Nadal plays left-handed and has done so from a very young age, he is in fact right-handed. Therefore to the extent that the ISH holds, then Nadal may not actually be benefitting from L. In contrast, to the extent the NFDS mechanism is responsible for the lefty advantage, then Nadal should be benefitting from L.

20The 23-15 record is as of writing this article although it should be noted that Federer has won their last 5 encounters. It is also interesting to note that Nadal’s advantage is explained entirely by their clay-court results where he has a 13-2 head-to-head win / loss edge. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 68

5 5.5 6 6.5 7 7.5 8 -1 -0.5 0 0.5 1 1.5 2

Figure 4.3: Posterior distribution of Fed- Figure 4.4: Marginal posterior distribution erer’s skill minus Nadal’s skill with and with- of Federer’s and Nadal’s skill, with and with- out the advantage of left-handedness. out the advantage of left-handedness. handedness included, but this drops to 31.8% when the advantage is excluded. The left-handed players are identified in bold font together with the match-play probabilities that change when the left-handed advantage is excluded, i.e. when left- and right-handed players meet. In all cases removing the advantage of left-handedness decreases the winning probability of left-handed players, although the magnitude of this effect varies on account of the non-linearity of the sigmoidal match- play probabilities in (4.17). If the advantage of left-handedness was removed then the decrease in left-handers’ skills would lead to a change in their rankings. In Table 4.3, for example, we see how the ranking (as determined by posterior skill means) of the top four left-handed players changes when we remove the left-handed advantage. We also display how these rankings would change when we only use handedness data and external rankings as in Section 4.4, and when we only have aggregate handedness data of the top n players as in Section 4.2.2. We see that the change in rankings suggested by each of the methods largely agree, although there are some minor variations. Notably Nadal’s rank does not change when we use the full match-play data-set but he does drop from 4 to 5 when we use the other inference approaches. Klizan’s change in rank using the handedness and external rankings data is much smaller than for the other two methods. Overall, however, we see substantial agreement CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 69

G S Djokovic Federer Nishikori Nadal Murray Raonic Klizan Lopez i\ i Djokovic - 38.9 31.5 29.3 21.9 20.2 8.0 7.6 Federer 38.9 - 42.0 39.5 30.6 28.5 12.0 11.5 Nishikori 31.5 42.0 - 47.4 37.8 35.5 15.9 15.2 Nadal 22.9 31.8 39.2 - 40.3 37.9 17.3 16.7 Murray 21.9 30.6 37.8 48.5 - 47.5 23.7 22.8 Raonic 20.2 28.5 35.5 46.0 47.5 - 25.6 24.7 Klizan 5.9 8.9 11.9 17.3 18.2 19.7 - 48.8 Lopez 5.6 8.5 11.4 16.7 17.5 19.0 48.8 -

Table 4.2: Probability of match-play results with and without the advantage of left-handedness. The players are ordered according to their rank from the top ranked player (Djokovic) to the lowest ranked player (Lopez). A player’s rank is given by the posterior mean of his total skill, S. Each cell gives the probability of the lower ranked player beating the higher ranked player. Above the diagonal the advantage of left-handedness is included in the calculations whereas below the diagonal it is not. The left-handed players are identified in bold font together with the match- play probabilities that change when the left-handed advantage is excluded, i.e. when left- and right-handed players meet. between the three approaches. This argues strongly for use of the simplest approach, i.e. the aggregate handedness approach of Section 4.2, when the only quantity of interest is the posterior distribution of L.

4.6 A Dynamic Model for L

Thus far we have only considered inference based on data collected over a single time period, but it is also interesting to investigate how the advantage of left-handedness has changed over time in a given sport. Towards this end, we assume that the advantage of left-handedness, Lt, in period t follows a Gaussian for t = 1, ..., T . We also assume that Lt is latent and therefore l unobserved. Instead, we observe the number of top players, nt, as well as the number, n nt, of t ≤ those top players that are left-handed. The generative model is as follows: CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 70

Player Match-play Handedness and Rankings Aggregate handedness

Rafael Nadal 4 4 4 5 4 5 → → → Martin Klizan 25 39 24 33 24 33 → → → Feliciano Lopez 27 41 27 37 27 37 → → → Fernando Verdasco 38 52 36 49 36 49 → → →

Table 4.3: Changes in rank of prominent left-handed players when the left-handed advantage is excluded. The change in rank in the “Match-play” column is computed using the MCMC samples generated using the full match-play data. The change in rank in the “Handedness and Rankings” column is computed using the MCMC samples given only individual handedness data and external skill rankings. The “Aggregate handedness” column uses the BTL ranking as the baseline and the change in rank is obtained by multiplying the baseline by the rank scaling factor of 1.36 from Table 4.1 and rounding to the closest integer.

2 Initial left-handed advantage: L0 N(0, θ ) • ∼ For time periods t = 1, ..., T •

2 – Left-handed advantage: Lt N(Lt−1, σ ) ∼ K l l – Number of top left-handers: n p(n Lt; nt) t ∼ t | where we assume θ is large to reflect initial uncertainty on L0 and σK controls how smoothly L varies over time. The posterior distribution of L given the data is then given by

T l l p(L0:T n ; n1:T ) p(L0) p(Lt Lt−1) p(n Lt; nt). (4.31) | 1:T ∝ | t | t=1 Y l The main complexity in (4.31) stems from the distribution p(n Lt; nt). This quantity is t | infeasible to compute exactly since it requires marginalizing out the player skills but as we have previously seen, it can be accurately approximated by the Laplace approximation in (4.11). Con- veniently, using this approximation results in all of the factors in (4.31) being Gaussian and since

Gaussians are closed under multiplication, the posterior of L0:T becomes a multivariate Gaussian. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 71

Specifically, we have

T l 2 2 ∗ 2 nt p(L0:T n ; n1:T ) N(L0; 0, θ ) N(Lt; Lt−1, σ )N Lt; L , b | 1:T ∝ K t r l ∼ t=1 nt nt Y   N(L0:T ; µ0:T , Σ0:T ) (4.32) ∝

∗ where Lt is as defined in (4.10), µ0:T and Σ0:T are the posterior mean and covariance matrix, l 2 respectively, of L0:T . Both µ0:T and Σ0:T are functions of n1:T , n1:T , θ, σK , and can be evaluated analytically using standard Kalman filtering methods [Barber, 2012, Sec. 24]. A major advantage 2 of the Kalman filter / smoother is that finding an appropriate smoothing parameter σK via max- imum likelihood estimation (MLE) is computationally tractable. The likelihood of observing the handedness data under the generative model is:

l 2 l 2 p(n1:T ; n1:T , σK ) = p(n1:T ,L0:T ; n1:T , σK ) dL0:T T +1 ZR T 2 l = p(L0) p(Lt Lt−1; σK ) p(nt Lt; nt) dL0:T T +1 | | R t=1 Z Y T 2 2 ∗ 2 nt N(L0; 0, θ ) N(Lt; Lt−1, σK )N Lt; Lt , b r l dL0:T (4.33) ∝ T +1 ∼ R t=1 nt nt Z Y   where the constant of proportionality coming from the Laplace approximation does not depend on 2 σK . Since all of the factors in (4.33) involving L are Gaussian, it is possible to analytically integrate 2 2 out L leaving a closed form expression involving σK . Maximizing this expression w.r.t. σK leads 2 2 to a MLE for σK and the calculation details are provided in Appendix C.4. This value of σK can then be substituted into (4.32) to find the posterior of L. In the top panel of Figure 4.5 we plot the fraction of left-handers among the top 100 mens tennis players as a function of year from 1985 to 2016 and using data from [Baˇci´cand Gazala]. In the bottom panel we plot the inferred value of L over this time period. We note that in 2006 and 2007, the fraction of top left-handers dropped below 11%, the estimated fraction of left-handers in the general population. A naive analysis would conclude that for those years the advantage of left-handedness was negative. However, this would ignore the randomness in the fraction of top left-handers from year to year. The Kalman filter smooths over the anomalous 2006 and 2007 years and has a posterior on L with positive mean throughout 1985 to 2016. We also recall our observation from the introduction where we noted that only 2 of the top 32 seeds in Wimbledon in CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 72

0.25

0.2

0.15

0.1

0.05 1985 1990 1995 2000 2005 2010 2015 2020

0.6

0.4

0.2

0

-0.2 1985 1990 1995 2000 2005 2010 2015 2020

Figure 4.5: Posteriors of the left-handed advantage, Lt, computed using the Kalman filter / smoother. The dashed red horizontal lines (at 11% in the upper figure and 0 in the lower fig- ure) correspond to the level at which there is no advantage to being left-handed. The error bars in the marginal posteriors of Lt in the lower figure correspond to 1 standard deviation errors.

2015 were left-handed. There is no inconsistency between that observation and the data from 2015 in Figure 4.5, however. While there were indeed only 2 left-handed men among the top 32 in the official year-end world rankings, there was a total of 13 left-handers among the top 100. Finally we note that we could also have included individual player skills as latent states in our model but this would have resulted in a much larger state space and made inference significantly21 more difficult. As we observed in Section 4.5.1, including match-play results does not change the posterior distribution of L significantly and so we are losing very little information regarding Lt when our model and inference is based only on the observed number of top left-handed players.

21Recent techniques have been developed for MCMC on multinomial linear dynamical systems that significantly improve the efficiency of inference of large state space models [Linderman et al., 2015]. However these methods will still be orders of magnitude slower than performing inference on the reduced state space containing only L. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 73

4.7 Conclusions and Further Research

In this chapter we have proposed a model for identifying the advantage, L, of being left-handed in one-on-one interactive sports. We use a Bayesian latent ability framework but with the additional complication of having a latent factor, i.e. the advantage of left-handedness, that we needed to estimate. Our results argued for the use of a medium-tailed distribution such as the Laplace distribution when modeling the innate skills of players. We showed how to infer the value of L when only the proportion of top left-handed players is available. In the latter case we showed that the distribution of the number of left-handed players among the top n (out of N) converges as N to a binomial distribution with a success probability that depends on the tail-length of the → ∞ innate skill distribution. We also use this result to develop a simple dynamic model for inferring how the value of L has varied through time in a given sport. In order to estimate the innate skills of top players we also considered the case when match-play data among the top n players was available. We observed that including match-play data in our model makes little or no difference to the inference of the left-handedness advantage but it did allow us to address hypothetical questions regarding match-play win probabilities with and without the benefit of left-handedness. It is worth noting that our framework is somewhat coarse by necessity. In tennis for example, there are important factors such as player ability varying across different surfaces (clay, hard court or grass) that we don’t model. We also attach equal weight to all matches in our model estimation despite the fact that some matches and tournaments are clearly (much) more important than others. Moreover, and as we shall see below, we assume (for a given sport) that there is a single latent variable, L, which measures the advantage of being left-handed in that sport. We therefore assume that the total skill of each left-handed player benefits to the same extent according to the value of L. This of course would not be true in practice as it seems likely that some lefties take better advantage of being left-handed than others. Alternatively, our model assumes that all righties are disadvantaged to the same extent by being right-handed. Again, it seems far more likely that some right-handed players are more adversely affected playing lefties than other right-handed players. Finally, we don’t allow for interaction effects between two players in determining the probability of one player beating the other. Again, this seems unlikely to be true in practice where some players are known to “match up well” against other players. Nonetheless, we do believe our model captures the value of being left-handed in an aggregate sense and can be reasonably interpreted CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 74 in that manner. While it is tempting to use the model to answer questions such as “What is the probability that Federer would beat Nadal if Nadal was right-handed?” (and we do ask and answer such questions in Section 4.5!), we do acknowledge that the answers to such specific questions should be taken lightly for the reasons outlined above. There are several directions of interest for future research. First, it would be interesting to apply our model to data-sets from other one-on-one sports and to estimate how Lt varies with time in these sports. The Kalman filtering / smoothing approach developed in Section 4.6 is straightforward to l implement and the data requirements are very limited as we only need the aggregate data (nt, nt) for each time period. While trends in Lt across different sports would be interesting in their own right, the cross-sport dynamics of Lt could be used to shed light on the potential explanations behind the benefit of left-handedness. For example, there is some evidence in Figure 4.5 suggesting that the benefit of left-handedness in men’s professional tennis has decreased with time. If such a trend could be linked appropriately with other developments in men’s tennis such as the superior strength and speed of the players, superior racket and string technology, time pressure etc. then it may be possible to attach more or less weight to the various hypotheses explaining the benefit of left-handedness. The recent22 work of Loffing [2017], for example, studies the link between the lefty advantage and time pressure in elite interactive sports. While these ideas are clearly in the literature already, the Kalman filtering approach provides a systematic, straightforward and consistent approach for measuring Lt. This can only aid with identifying the explanation(s) for the benefits of left-handedness and how it varies across sports and time. It would also be of interest to consider more complex models that can also account for interac- tions in the skill levels between players and / or different match-play circumstances. As discussed above, examples of the latter would include distinguishing between surfaces (clay, grass etc.) and grand-slam / regular tour matches in tennis. Given the flexibility afforded by a Bayesian approach it should be straightforward to account for such features in our models. Given limited match-play data in many sports, however, it is not clear that we would be able to learn much about such features. In tennis, for example, even the very best players may only end up playing each other a couple of times a year or less. As mentioned earlier, Nadal and Federer only faced each other once

22This paper was also discussed in a recent New York Times article [NYT] reflecting the general interest in the lefty advantage beyond academia. CHAPTER 4. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 75 in 2014. It would therefore be necessary to consider data-sets spanning multiple years in which case it would presumably also be necessary to include form as well as the general trajectory of career arcs in our models. We note that such modeling might be of more general interest and identifying the value(s) of L might not be the main interest in such a study. Continuing on from the previous point, there has been considerable interest in recent years in the so-called “interacting performances theory” [O’Donoghue, 2009]. This theory recognizes that the performance (and outcome of a performance) is determined by both the skill level or quality of an opponent as well as the specific type of an opponent. Indeed, different players are influenced by the same opponent types in different ways. Under this theory, it is important23 to be able to identify different types of players. Once these types have been identified we can then label each player as being of a specific type. It may then be possible to accommodate interaction effects between specific players (as outlined in the paragraph immediately above) by instead allowing for player-type interactions. Such a model would require considerably fewer parameters to be estimated than a model which allowed for specific player-interactions. Returning to the issue of the left-handedness advantage, we would like to adapt these models and apply them to other sports such as cricket and baseball where one-on-one situations still arise and indeed are the main aspect of the sport. It is well-known, for example, that left-handed pitchers in Major League Baseball (approx. 25%) and left-handed batsmen in elite cricket (approx 20%) are over-represented. While the model of Section 4.2 that only uses aggregate handedness could be directly applied to these sports, it would be necessary to adapt the match-play model of Section 4.3 to handle them. This follows because the one-on-one situations that occur in these sports do not have binary outcomes like win / lose, but instead have multiple possible outcomes whose probabilities would need to be linked to the skill levels of the two participants. We hope to consider some of these alternative directions in future research.

23See for example [Cui et al., 2017a], [Cui et al., 2017b] and [O’Donoghue, 2005], all of which discuss techniques for identifying different types of tennis players. 76

Part II

Advances in numerical stability for large-scale machine learning problems CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 77

Chapter 5

Stable and scalable softmax optimization with double-sum formulations

5.1 Introduction

The softmax, also known as the multinomial logit model, is a fundamental and ubiquitous distri- bution. It has applications in many diverse fields such as economics and biomedicine [Rust and Zahorik, 1993, Kirkwood and Sterne, 2010, Gopal and Yang, 2013] and appears as a smooth and convex surrogate for the (hard) maximum loss in discrete optimization [Maddison et al., 2016] and network flows [Shahrokhi and Matula, 1990]. Another major application of the softmax is in the neural network literature. The softmax is commonly used as the final layer in multi-class neural network models [Goodfellow et al., 2016, Sec. 6.2.2.2] and can be found in most neural network based state-of-the-art image classifiers [Krizhevsky et al., 2012, Simonyan and Zisserman, 2014, Zeiler and Fergus, 2014, Szegedy et al., He et al., 2016]. The popular word2vec model can be thought of as simply a softmax with embeddings [Mikolov et al., 2013a,b]. Under the softmax model the probability that a random variable y takes on a label ` 1, ..., K , ∈ { } is given by > ex w` p(y = ` x; W ) = K > , (5.1) x wk | k=1 e P CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 78

D D th where x R is the covariate, wk R is the vector of parameters for the k class, and W = ∈ ∈ D×K [w1, w2, ..., wK ] R is the parameter matrix. Given a dataset of N label-covariate pairs ∈ N = (yi, xi) , the ridge-regularized maximum log-likelihood problem is given by D { }i=1 N K > x>w µ 2 L(W ) = x wy log( e i k ) W , (5.2) i i − − 2 k k2 i=1 k=1 X X where W 2 denotes the Frobenius norm. In this chapter we will always consider xi to be a fixed k k external covariate, not an embedding as in word2vec or as an input from a lower layer as in neural networks. Extensions to word2vec and neural network models are discussed in Section 5.6 at the end of this chapter. The core computational challenge when using a softmax model in machine learning is to effi- ciently compute the maximal value of W in (5.2). This chapter focuses on maximizing (5.2) when N, K, D are all large. This is particularly relevant since large values for N, K, D are increasingly common in modern applications such as natural language processing and recommendation systems, where N, K, and D can each be on the order of millions or billions [Chelba et al., 2013, Partalas et al., 2015]. Standard methods for optimizing the softmax break down in the large scale setting. For example, a natural approach to maximizing (5.2) with large values for N, K and D is to use Stochastic Gradient Descent (SGD), where a mini-batch of datapoints is sampled each iteration. However

> K x wk when K and D are large, the O(KD) cost of calculating the normalizing sum k=1 e i in the stochastic gradients can be prohibitively expensive. Several approximations thatP avoid calculating the normalizing sum have been proposed to address this difficulty. These include tree-structured methods [Bengio et al., 2003, Daume III et al., 2016, Grave et al., 2016, Jernite et al., 2016], sampling methods [Bengio and Sen´ecal,2008, Mnih and Teh, 2012, Ji et al., 2015, Joshi et al., 2017] and self-normalization [Andreas and Klein, 2015]. Alternative models, such as the spherical family of losses [de Br´ebissonand Vincent, 2015, Vincent et al., 2015] that do not require normalization, have been proposed to sidestep the issue entirely [Yen et al., 2017]. All of these approximations are computationally tractable for large N, K and D, but are unsatisfactory in that they are biased and do not converge to the optimal W ∗ = argmax L(W ). There are also unbiased methods in the literature, but these tend to be slow. Mussmann and Ermon [2016] use a sampling approach that leverages maximum inner product search to calculate CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 79 unbiased gradients. This provides a speedup factor of about 10 compared to calculating the exact gradients. Although this is a positive result, it is still orders of magnitude slower than the biased methods above and does not scale to the values of N,K,D that we consider. Krishnapuram et al. [2005] avoid calculating the normalizing sum by using a maximization-majorization approach based on lower-bounding the eigenvalues of the Hessian matrix. They have O(ND) runtime per iteration which is not feasible for large N and D. Recently Raman et al. [2016] showed how to recast (5.2) as a double-sum over N and K.1 This formulation is amenable to SGD methods that sample only one datapoint and one class in each iteration, reducing the per iteration cost to O(D). The problem is that Explicit SGD (ESGD)2 is unstable when applied to this formulation in that the stochastic gradients often have high variance and a high dynamic range, leading to computational overflow errors. Raman et al. deal with this instability by calculating the normalizing sum for all datapoints at a cost of O(NKD) at the end of each epoch. Although this achieves stability, its high cost nullifies the benefit of the cheap O(D) per iteration cost and renders the method unscalable to large values of N, K, D. Previous optimization algorithms that have been applied to the double-sum formulation have thus either been unstable or unscalable. The goal of this chapter is to develop methods for optimizing double-sum formulations of the softmax that are both stable and truly scalable. We propose two such methods. The first is an implementation of Implicit SGD (ISGD), a stochastic gradient descent method that is known to be more stable than ESGD and yet has similar convergence properties [Toulis et al., 2016]. We show that the ISGD updates for the double-sum formulation can be efficiently computed using a bisection method with tight initial bounds when only one datapoint and class is sampled per iteration. Furthermore, we guarantee the stability of ISGD by proving that its step size is asymptotically linearly bounded (unlike ESGD which is exponentially bounded). The second algorithm we propose is a new SGD method specifically designed for optimizing the double sum formulation called U-

1This same idea has appeared multiple times in the literature. For example, Ruiz et al. [2018] use a similar idea for variational inference of the softmax.

2We use the term “Explicit” SGD to refer to “standard” or “vanilla” SGD, which has the update formula θ(t+1) = (t) (t) θ − ηt∇f(θ , ξt) where ∇f(θ) = Eξt [∇f(θ, ξt)] and ηt is the learning rate. “Explicit” SGD is to be contrasted from “Implicit” SGD, which will be introduced shortly. The “explicit” and “implicit” nomenclature is often used in the Implicit SGD literature, e.g. [Toulis et al., 2016]. CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 80 max. It is guaranteed to have bounded gradients and converge to the optimal solution of (5.2) for all sufficiently small learning rates. U-max is particularly suited to situations where calculating simultaneous inner products is cheap (for example when using GPUs). We compare the performance of U-max and ISGD to the state-of-the-art methods for maximiz- ing the softmax likelihood which cost O(D) per iteration. Both U-max and ISGD outperform all other methods. ISGD has the best performance with an average log-loss 4.69 times lower than the previous state-of-the-art biased methods, and furthermore, is robust to the learning rate. The outline of the chapter is as follows. In Section 5.2 we develop a double-sum formulation similar to that of Raman et al. [2016], but with smaller magnitude gradients. In Section 5.3.1 we derive an efficient implementation of ISGD, analyze its runtime and bound its step size. The U-max algorithm is proposed in Section 5.3.2. Experiments are conducted in Section 5.4 showing that both U-max and ISGD outperform the previous state-of-the-art, with ISGD having the best performance. Finally, Section 5.5 concludes with Section 5.6 giving potential extensions to the work.

5.2 Convex double-sum formulation

In order to have an SGD method that samples both datapoints and classes each iteration we need to represent (5.2) as a double-sum over datapoints and classes. We begin by rewriting (5.2) in a > more convenient form by putting the xi wyi term inside of the logarithm, N x>(w −w ) µ 2 L(W ) = log(1 + e i k yi ) W . (5.3) − − 2 k k2 i=1 k6=y X Xi The key to converting (5.3) into its double-sum representation is to express the negative logarithm using its convex conjugate3:

log(a) = max av ( log( v) 1) = max u exp( u)a + 1 (5.4) − v<0 { − − − − } u {− − − } where u = log( v) and the optimal value of u is u∗(a) = log(a). Applying (5.4) to each of the − − logarithmic terms in (5.3) yields L(W ) = minu≥0 f(u, W ) + N where − { } N −ui ui + e x>(w −w )−u µ 2 f(u,W )= + e i k yi i + W (5.5) K 1 2 k k2 i=1k6=y XXi −

3This trick is related to the bounds given in [Gopal and Yang, 2013]. CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 81

is our double-sum representation that we seek to minimize. The variable ui can be thought of as an ∗ x>(w −w ) approximation to the log-normalizer, as its optimal solution is u (W ) = log(1+ e i k yi ) i k6=yi ≥ 0. In Appendix D.2 we prove that the optimal u and W are contained in aP compact convex set and that f is strongly convex within this set. Thus performing projected-SGD on f is guaranteed to converge to a unique optimum with a convergence rate of O(1/T ) where T is the number of iterations [Lacoste-Julien et al., 2012].

5.2.1 Instability of ESGD

The challenge in optimizing f using SGD is that the gradients can have very large magnitudes.

Observe that f = Eik[fik] where i unif( 1, ..., N ), k unif( 1, ..., K yi ) and ∼ { } ∼ { } − { }

> −ui x (w −wy )−ui µ 2 2 fik(u, W ) =N ui + e + (K 1)e i k i + (βy wy + βk wk ), (5.6) − 2 i k i k2 k k2   N where βj = is the inverse of the probability of class j being sampled either through nj +(N−nj )/(K−1) i or k, and nj = i : yi = j . It follows that if datapoint i and class k are sampled then the |{ }| stochastic gradients with respect to wk and wyi are:

> x (w −wy )−ui w fik = N(K 1)e i k i xi + µβkwk ∇ k − > x (w −wy )−ui w fik = N(K 1)e i k i xi + µβy wy . (5.7) ∇ yi − − i i

∗ x>(w −w ) x>(w −w )−u If ui is at its optimal value u (W ) = log(1 + e i k yi ) then e i k yi i 1 and the i k6=yi ≤ x>(w −w )−u magnitude of the N(K 1)e i k yi i xi termsP in the gradient are bounded by N(K 1) xi 2. − − k k > x>(w −w )−u However if ui x (wk wy ), then e i k yi i 1 and the magnitude of the gradients can  i − i ≫ become extremely large. Extremely large gradients lead to two major problems: (i) they could lead to overflow errors and cause the algorithm to crash, (ii) they result in the stochastic gradient having high variance, which leads to slow convergence.4 In Section 5.4 we show that these problems occur in practice and make ESGD both an unreliable and inefficient method.5

4The convergence rate of SGD is inversely proportional to the second moment of its gradients [Lacoste-Julien et al., 2012].

5The same problems arise if we approach optimizing (5.3) via stochastic composition optimization [Wang et al., 2016]. As is shown in Appendix D.4, stochastic composition optimization yields near-identical expressions for the stochastic gradients as in (5.7) and thus has the same stability issues. CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 82

The biased sampled softmax optimizers in the literature [Bengio and Sen´ecal,2008, Mnih and Teh, 2012, Ji et al., 2015, Joshi et al., 2017] do not have the issue of large magnitude gradients. These methods can all be thought of as having the same update equations as in (5.7) but instead ∗ of treating ui as a variable to be optimized over, they estimate ui (W ) afresh each iteration using > ∗ x (wk−wy ) sampled classes. For example in One-Vs-Each ui (W ) is approximated by log(1 + e i i )

[Titsias, 2016]. Their gradients are all bounded since, by construction, their sampled ui satisfy > ui > x (wk wy ). Although these methods have bounded gradients, they are biased and hence i − i their iterates do not converge to the optimal W ∗. The goal of this chapter is to design reliable and efficient SGD algorithms for optimizing the double-sum formulation in (5.5). We propose two such methods: ISGD (Section 5.3.1) and U-max (Section 5.3.2). But before we introduce these methods we should establish that (5.5) is a good choice for the double-sum formulation.

5.2.2 Choice of double-sum formulation

The double-sum in (5.5) is different to that of Raman et al. [2016]. Their formulation can be derived by applying the convex conjugate substitution to (5.2) instead of (5.3). Their resulting equations are N 1 1 L(W ) = min f¯ik(¯u, W ) + N − u¯ N K 1  i=1 k6=y  X − Xi  where  

> > > x wy −u¯i x w −u¯i µ 2 2 f¯ik(¯u, W ) =N u¯i x wy +e i i + (K 1)e i k + (βy wy + βk wk ) (5.8) − i i − 2 i k i k2 k k2 ∗ ∗ K x>w∗ and the optimal solution foru ¯i isu ¯i (W ) = log( k=1 e i k ). The only difference between the > double-sum formulations in (5.6) and (5.8) is the reparameterizationP u ¯i = ui + xi wyi . Although either double-sum formulation can be used as a basis for SGD, our formulation in (5.6) tends to have smaller magnitude stochastic gradients, and hence faster convergence. To see this on a > > > x>w −u¯ high level, note that typically x wy = argmax x wk . It follows that theu ¯i, x wy and e i yi i i i k{ i } i i terms are typically of the greatest magnitude in (5.8). Although at optimality these terms should roughly cancel, this will not be the case during the early stages of optimization, leading to stochastic > gradients of large magnitude. In contrast, the function fik in (5.6) only has xi wyi appearing as a > negative exponent, and so if xi wyi is large then the magnitude of the stochastic gradients will be CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 83 small. A more rigorous version of this argument is presented in Appendix D.1. In Appendix D.7 we also present numerical results confirming that in practice our double-sum formulation converges to log-losses two to three times lower than when using Raman et al.’s formulation.

5.3 Stable and scalable SGD methods

5.3.1 Implicit SGD

One method that can solve the large gradient problem from Section 5.2.1 is Implicit SGD (ISGD) [Bertsekas, 2011, Ryu and Boyd, 2014, Toulis and Airoldi, 2015a, Toulis et al., 2016].6 ISGD has a very similar form to ESGD and uses the update equation

(t+1) (t) (t+1) θ = θ ηt f(θ , ξt), (5.9) − ∇

(t) th where θ is the value of the t iterate, f is the function we seek to minimize and ξt is a random variable controlling the stochastic gradient such that f(θ) = Eξ [ f(θ, ξt)]. In our case θ = (u, W ) ∇ t ∇ (t+1) (t+1) (t+1) and ξt = (it, kt) with f(θ , ξt) = fi ,k (u ,W ) with fik being defined by (5.6). Note ∇ ∇ t t (t+1) that the ISGD update in (5.9) has f(θ , ξt) appearing on the right side of the equation, whereas ∇ (t) (t+1) ESGD has f(θ , ξt) on the right. Since θ appears on both the left and right side of the ISGD ∇ update equation, its value must be solved for and may not be available in closed form. Although ISGD has similar convergence rates to ESGD, it has other properties that can make it preferable over ESGD. It is more robust to the learning rate [Toulis et al., 2016], which is important since a good value for the learning rate is never known a priori, and is provably more stable [Ryu and Boyd, 2014, Section 5]. A longer discussion of ISGD’s properties and related methods is given in Section 6.2 of Chapter 6. For now we will focus on one property of ISGD that is of particular interest to our problem — that it has smaller step sizes.

Proposition 2. Let θ(t) : t > 0 denote the iterates generated when ISGD is used to optimize { } f(θ) = Eξ[f(θ, ξ)] where f(θ, ξ) is m-strongly convex for all ξ. Then

(t+1) (t) (t+1) (t) f(θ , ξt) 2 f(θ , ξt) 2 m θ θ 2. k∇ k ≤k∇ k − k − k

6ISGD comes in various forms under different names: ISGD [Toulis and Airoldi, 2015b, Toulis, 2016, Toulis et al., 2016, 2017], minibatch-prox [Wang et al., 2017], stochastic proximal iteration / point algorithm [Ryu and Boyd, 2014, Patrascu and Necoara, 2017] and incremental proximal / gradient method [Bertsekas, 2011, Defazio, 2016]. CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 84

Hence, the ISGD step size is smaller than the corresponding ESGD step size at θ(t).

The proof of Proposition 2 as well as all other propositions and theorems in this chapter are in Appendix D. The bound in Proposition 2 can be tightened for our particular problem. Unlike > ESGD whose step size magnitude is exponential in x (wk wy ) ui, as shown in (5.7), for ISGD i − i − > the step size is asymptotically linear in x (wk wy ) ui. This effectively guarantees that ISGD i − i − cannot suffer from computational overflow.

Proposition 3. The magnitude of the step size in W , when ISGD is applied to the double-sum formulation in (5.5) and in each iteration only one datapoint i and one class k = yi is sampled, is 6 > wk wyi O(xi ( 1+ηµβ 1+ηµβ ) ui). k − yi − The major difficulty in applying ISGD is that in each iteration one has to compute a solution to (5.9) [Ryu and Boyd, 2014, Section 6]. The tractability of this procedure is problem dependent. We show that computing a solution to (5.9) is indeed tractable for the problem considered in this chapter. The details are laid out in full in Appendix D.6.

Proposition 4. The ISGD update for the double-sum formulation in (5.5), when in each iteration n datapoints and m classes are sampled, can be computed to within  accuracy in a runtime of O(n3m log(−1) + nmD).

In Proposition 4 the log(−1) factor comes from applying a first order method to solve the strongly convex Implicit SGD update equation. It may be the case that performing this optimization > is more expensive than the O(nmD) cost of computing the xi wk inner products, and so each iteration of ISGD may be significantly slower than that of ESGD. If so, ESGD could exhibit faster convergence by wall-clock time than ISGD even if ISGD convergences faster per epoch, making ESGD the preferable method. Fortunately, in certain cases we can improve the runtime of solving the ISGD update to a level where it is comparable to ESGD’s. If n = 1 and we just sample one datapoint per iteration then it is possible to reduce the update to solving just a univariate strongly convex optimization problem over ui (see Appendix D.6.3 for details). Furthermore, if m = 1 and only one class is sampled per iteration then we can derive upper and lower bounds on ui. The optimization problem can then be solved using a bisection method, with an explicit upper bound on its cost. The pseudocode of this method is presented in Algorithm 5. CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 85

Algorithm 5: Implicit SGD with one datapoint and class sampled each iteration N Input : Data = (yi, xi) , number of iterations T , learning rate ηt, regularization D { }i=1 N constants µ, βj = and γ from (D.6), principle Lambert-W nj +(N−nj )/(K−1) function P , initial u, W . Output: W

1 for t = 1 to T do

2 Sample datapoint and classes: i unif( 1, ..., N ) ∼ { }

k unif( 1, ..., K yi ) ∼ { } − { } > wk wyi 3 Calculate difference: ∆ = xi ( 1+η µβ 1+η µβ ) t k − t yi 4 Calculate gradient of ISGD update at ui:

−u −1 ∆−u g 2ηtN 2ηtNe i 2γiP (ηtN(K 1)γ e i ) ← − − − i

5 Calculate lower and upper bounds on ui:

η N−u ∆ (ui, ui + P (ηtNe t i (1 + (K 1)e )) ηtN) if g < 0  − − η N−u ∆−η Nγ−1 (bl, bu) (ui + P (ηtNe t i (1 + (K 1)e t i )) ηtN, ui) if g > 0 ←  − −  (ui, ui) if g = 0    6 Optimize ui using Brent’s method with bounds bl, bu and gradient function g(u):

−u −1 ∆−u g(u) 2ηtN 2ηtNe + 2(u ui) 2γiP (ηtN(K 1)γ e ) ≡ − − − − i

ui Brents method(bl, bu, g(u)) ←

−1 ∆−ui 7 Update w: wk P (ηtN(K 1)γi e ) wkj γi − xi ← 1 + ηtµβk − 1 + ηtµβk −1 ∆−ui wyi P (ηtN(K 1)γi e ) wyi + γi − xi ← 1 + ηtµβyi 1 + ηtµβyi

8 end

9 return W CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 86

Proposition 5. Consider applying ISGD to the double-sum formulation in (5.5) where in each iteration only one datapoint i and one class k = yi is sampled with the learning rate η. The 6 ISGD iterate θ(t+1) can be computed to within  accuracy with only two D-dimensional vector inner

−1 > wk wyi 2 products and at most log2( ) + log2( xi ( 1+ηµβ 1+ηµβ ) ui + 2ηN xi 2 + log(2K)) bisection | k − yi − | k k method function evaluations.

For any reasonably large dimension D, the cost of the two D-dimensional vector inner-products will outweigh the cost of the bisection, and ISGD with n = m = 1 will have roughly the same speed per iteration as ESGD with n = m = 1. In our experiments fewer than 8 bisections were needed on average for ISGD to converge each iteration, and thus the difference in run time between ISGD and ESGD was small. However, if calculating inner products is relatively cheap (for example if D is small or GPUs are used), then ISGD might be noticeably slower than ESGD even if n = m = 1. The U-max algorithm, presented next, is stable in the same way ISGD is but has virtually the same runtime as ESGD. This makes U-max an ideal choice when inner products are cheap.

5.3.2 U-max method

> As explained in Section 5.2.1, ESGD has large gradients when ui x (wk wy ). This can  i − i ∗ only occur when ui is less than its optimum value for the current W , since ui (W ) = log(1 + x>(w −w ) > e i k yi ) x (wk wy ). A simple remedy to deal with the large gradient problem is to j6=yi ≥ i − i x>(w −w ) > x>(w −w ) > Pset ui = log(1+e i k yi ) whenever ui x (wk wy ). Since log(1+e i k yi ) > x (wk wy )  i − i i − i > 7 this guarantees that ui > x (wk wy ) and so the gradients will be bounded. It also brings ui closer i − i to its optimal value for the current W and thereby decreases the objective f(u, W ). If m classes x>(w −w ) m i kj yi are sampled per iteration then a similar remedy applies, setting ui = log(1 + j=1 e ) > m x (wk −wy ) whenever ui log(1 + e i j i ). P  j=1 This is exactly the ideaP behind the U-max algorithm — see Algorithm 6 for its pseudocode. U- x>(w −w ) m i kj yi max is the same as ESGD except for two modifications: (i) ui is set equal to log(1+ j=1 e ) > m x (wk −wy ) whenever ui log(1 + e i j i ) δ for some threshold δ > 0, (ii) ui Pis projected onto ≤ j=1 − 2 2 2 2 [0,Bu], and W onto WP: W B at the end of each iteration, where B = N log(K) and { k k2 ≤ W } W µ

> > 7 > x (wk−wy ) P x (wk−wy ) ∗ Since ui < x (wk − wy ) < log(1 + e i i ) < log(1 + e i i ) = u (W ). i i j6=yi i CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 87

Algorithm 6: U-max for a single datapoint and multiple classes sampled per iteration N Input : Data = (yi, xi) , number of classes to sample each iteration m, threshold D { }i=1 δ > 0, number of iterations T , learning rate ηt, regularization constants µ and N βj = −1 m , initial u, W . nj +(N−nj )(1−(1−(K−1) ) ) Output: W

1 for t = 1 to T do

2 Sample datapoint and classes: i unif( 1, ..., N ) ∼ { }

kj unif( 1, ..., K yi ) for j = 1, ..., m (with replacement) ∼ { } − { } > 3 Calculate differences: ∆k = x (wk wy ) for j = 1, ..., m j i j − i 4 Potentially increase ui:

m ∆k m ∆k log(1 + e j ) if ui < log(1 + e j ) δ j=1 j=1 − ui  ← P P ui otherwise

 5 Take SGD step:

∆k −ui wk wk ηtN(K 1)/m e j xi ηtµβk wk for j = 1, ..., m j ← j − − · − j j m ∆k −ui wy wy + ηtN(K 1)/m e j xi ηtµβy wy i ← i − · − i i j=1 X m −u ∆k −ui ui ui ηtN 1 e i (K 1)/m e j ← −  − − − ·  j=1 X   6 end

7 return W CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 88

2 max {kx k }B Bu = log(1 + (K 1)e i i 2 W ). See Appendix D.2 for more details on Bu and BW . − 2 2 Proposition 6. Suppose Bf maxik fik(u, W ) 2 for all W B and 0 u Bu. ≥ k∇ k k k2 ≤ W ≤ ≤ 2 2 Then for an appropriately decreasing learning rate sequence satisfying ηt δ /(4B ), the U-max ≤ f algorithm converges to the optimum of (5.2) at a rate of O(1/T ).

U-max directly resolves the problem of extremely large gradients. Modification (i) ensures that > > m x (wk −wy ) δ x (wk wy ) ui (otherwise ui would be increased to log(1 + e i j i )) and so the ≥ i − i − j=1 δ magnitude of the U-max gradients are bounded above by N(K 1)Pe xi 2. − k k In U-max there is a trade-off between the gradient magnitude and learning rate that is controlled 2 2 by δ. For Proposition 6 to apply we require that the learning rate ηt δ /(4B ). A small δ yields ≤ f small magnitude gradients, which makes convergence fast, but necessitates a small ηt, which makes convergence slow. If n datapoints and m classes are sampled each iteration, then the U-max runtime is dominated by the O(nmD) cost of the vector inner-product calculations. This is the same runtime as ESGD as well as the state-of-the-art biased methods such as Noise Contrastive Estimation [Mnih and Teh, 2012], Importance Sampling [Bengio and Sen´ecal,2008] and One-Vs-Each [Titsias, 2016].

5.4 Experiments

Our main set of experiments compares the performance of U-max and ISGD to the state-of-the-art algorithms over seven real world datasets. We begin by specifying the experimental setup and then move onto the results.

5.4.1 Experimental setup

Algorithms. We compared our algorithms to the state-of-the-art for optimizing large scale soft- max problems which have runtime O(D) per iteration. Algorithms that have O(NKD) run time per epoch were not considered, as they would not even have finished one epoch by the time the other algorithms had already converged. This criterion excludes naive SGD, Raman et al.’s DSMLR algorithm and Mussmann and Ermon’s8 approach from Section 5.1, but does include Noise Con-

8Mussmann and Ermon state that their method is just 10 times faster than calculating the exact gradient and so is also O(NKD). CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 89

Table 5.1: Datasets with a summary of their properties. Where the number of classes, dimension or number of examples has been altered, the original value is displayed in brackets. All of the datasets were downloaded from http://manikvarma.org/downloads/XC/XMLRepository.html, except WikiSmall which was obtained from http://lshtc.iit.demokritos.gr/.

Data set Classes Dimension Examples

MNIST 10 780 60,000 Bibtex 147 (159) 1,836 4,880 Delicious 350 (983) 500 12,920 Eurlex 838 (3,993) 5,000 15,539 AmazonCat 2,709 (2,919) 10,000 (203,882) 100,000 (1,186,239) Wiki10 4,021 (30,938) 10,000 (101,938) 14,146 WikiSmall 18,207 (28,955) 10,000 (2,085,164) 90,737 (342,664) trastive Estimation (NCE) [Mnih and Teh, 2012], Importance Sampling (IS) [Bengio and Sen´ecal, 2008] and One-Vs-Each (OVE) [Titsias, 2016]. Note that the included methods are all biased and will not converge to the optimal softmax MLE, but, perhaps, something close to it. For these algorithms we set n = 100, m = 5, which are standard settings.9 For ISGD we chose to implement the version in Proposition 5 which has n = 1, m = 1 and used Brent’s method as our bisection method solver. The ridge regularization µ was set to zero and the classes and datapoints were sampled uniformly for all algorithms. For U-max and ESGD we set n = 1, m = 5 and for U-max the threshold parameter δ = 1. For both methods we also experimented with m = 1 but obtained significantly better performance with m = 5. The probable reason is that having a larger m value decreases the variance of the gradients, making the algorithms more stable with higher learning rates and thereby improving convergence.

Data. We tested our algorithms on seven real world large scale datasets: MNIST, Bibtex, Deli- cious, Eurlex, AmazonCat-13K, Wiki10, and WikiSmall, the properties of which are summarized in

9We also experimented setting n = 1, m = 5 in these methods and there was virtually no difference in performance except the runtime was slower. Varying the value of m between 1 and 10 also had a negligible effect on the results. CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 90

Table 5.2: Tuned initial learning rates for each algorithm on each dataset. The learning rate in 100,±1,±2,±3/N with the lowest log-loss after 50 epochs using only 10% of the data is displayed. ESGD applied to AmazonCat, Wiki10 and WikiSmall suffered from overflow with a learning rate of 10−3/N, but was stable with smaller learning rates (the largest learning rate for which it was stable is displayed).

Data set OVE NCE IS Explicit Umax Implicit

MNIST 101 101 101 10−2 101 10−1 Bibtex 102 102 102 10−2 10−1 101 Delicious 101 103 103 10−3 10−2 10−2 Eurlex 10−1 102 102 10−3 10−1 101 AmazonCat 101 103 103 10−5 10−2 10−3 Wiki10 10−2 103 102 10−4 10−2 100 WikiSmall 103 103 103 10−4 10−3 10−3

Table 5.1. Most of the datasets are multi-label and, as is standard practice [Titsias, 2016], we took the first label as being the true label and discarded the remaining labels. To make the computation more manageable, we truncated the number of features to be at most 10,000 and the training set size to be at most 100,000. If a datapoint had no non-zero features as a result of the dimension truncation, then it was discarded. The features of each dataset were normalized to have unit L2 norm. All of the datasets were pre-separated into training and test sets. We only focus on the performance of the algorithms on the training set, as the goal in this chapter is to investigate how best to optimize the softmax likelihood, which is given over the training set.

Learning rate and epochs. For all of the algorithms we used the same learning rate schedule of decreasing the learning rate by a factor of 0.9 each epoch. However we allowed the initial learning rate of each algorithm to be tuned for each dataset. We did so by running them on 10% of the training data with initial learning rates η = 100,±1,±2,±3/N.10 The initial learning rate with the

10The learning rates are divided by N to counter the stochastic gradient being proportional to N and thereby make the step size independent of N. CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 91 best performance after 50 epochs was then used when the algorithm was applied to the full dataset. The tuned learning rates are presented in Table 5.2. ESGD required a very small learning rate, otherwise it suffered from overflow. On average the tuned ESGD learning rate was 3,019 times smaller than ISGD’s and 319 times smaller than U-max’s. For our final results we first ran OVE, NCE, IS, ESGD and U-max for 50 epochs on each dataset (they all have virtually the same runtime). We recorded how long this took and then ran ISGD for this same length of time. This makes for a fair comparison between the algorithms in terms of CPU time. To make the results as fair as possible, all algorithms were coded in a common NumPy framework. We also have results comparing the algorithms when ISGD is run for exactly 50 epochs, which show very similar performance (see Appendix D.8).

5.4.2 Results

Our main results compare the performance of ISGD and U-max to their state-of-the-art competitors NCE, IS and OVE, in terms of log-loss over CPU time. Plots of the algorithms’ performance are displayed in Figure 5.1. Table 5.3 presents the relative performance of the methods at the end of training over all datasets. The ISGD method has the best performance on all datasets but one. After just one epoch its performance is better than the performance of all of the state-of-the-art biased methods after 50 epochs. Not only does it converge faster in the first few epochs, it also converges to the optimal MLE (unlike the biased methods that prematurely plateau). On average after 50 epochs ISGD’s log-loss is at least a factor of 4.69 times lower than that of the biased methods. U-max has the next best average log-loss. It is the only algorithm to outperform ISGD on a dataset (AmazonCat). ESGD’s performance is better than the previous state-of-the-art but is generally worse than U-max. The difference in performance between ESGD and U-max can largely be explained by ESGD requiring a smaller learning rate to avoid computational overflow. CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 92

MNIST Bibtex Delicious

5.5 2.0 4 5.0 1.5 3 4.5 2 1.0 4.0 1 0.5 3.5

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Eurlex AmazonCat Wiki10 Wiki-small

6 8 7 9.5 5 6 6 9.0 4 8.5 3 5 4 8.0 2 4 7.5 1 2 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Figure 5.1: Training log-loss vs CPU time. CPU time is marked by the number of epochs for OVE, NCE, IS, ESGD and U-max (they all have virtually the same runtime). ISGD is run for the same CPU time as the other methods, but since it has a slightly different runtime per iteration, it will complete a different number of epochs. Thus when the x-axis is at 50, this indicates the end of the ISGD’s computation but not necessarily its 50th epoch.

Table 5.3: Relative log-loss of each algorithm at the end of training. The values for each dataset are normalized by dividing by the corresponding Implicit log-loss. The algorithm with the lowest log-loss for each dataset is in bold.

Data set OVE NCE IS Explicit U-max Implicit

MNIST 5.25 5.55 5.26 1.31 1.40 1.00 Bibtex 12.65 12.65 12.48 6.61 4.25 1.00 Delicious 1.77 1.78 1.76 1.16 1.03 1.00 Eurlex 6.48 6.39 6.38 3.59 4.34 1.00 AmazonCat 2.01 2.03 2.00 1.39 0.93 1.00 Wiki10 3.68 3.72 3.64 3.13 1.24 1.00 WikiSmall 1.33 1.33 1.33 1.13 1.01 1.00

Average 4.74 4.78 4.69 2.62 2.03 1.00 CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 93

Learning rates OVE NCE IS

6.6 6.6 6.6

6.4 6.4 6.4

6.2 6.2

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Explicit U-max Implicit 4.4 4 4.2 3000 3 4.0 2000

3.8 2 1000 3.6 0 1 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Figure 5.2: Training log-loss on Eurlex for different learning rates. The x-axis denotes the number of epochs.

Sensitivity. In a separate set of experiments we explore the sensitivity of each method to the learning rate. We ran each method on the Eurlex dataset with learning rates η = 100,±1,±2,−3/N for 50 epochs. The results are displayed in Figure 5.2. ISGD has the best performance for all learning rate settings. This is consistent with theoretical results proving that ISGD is robust to the learning rate [Ryu and Boyd, 2014, Toulis and Airoldi, 2015a]. In fact, ISGD’s worst performance is still better than the best performance of all the other algorithms. The OVE, NCE and IS methods are very robust to the learning rate, which is perhaps why they have been so popular in the past. U-max and ESGD are less stable. For learning rates η = 101,2/N the U-max log-loss is extremely large. This can be explained by Proposition 6, which does not guarantee convergence if the learning rate is too high. Explicit SGD is even less stable and suffered from computational overflow for all learning rates except 10−3/N.

Summary. ISGD has clearly the best performance out of all of algorithms considered and is robust to the learning rate. U-max has the second best performance but can diverge if the learning rate is set too high. Compared to U-max, ESGD has worse performance and is more sensitive to the learning rate. The state-of-the-art biased methods are robust, but do not achieve the same level of performance as the other methods since their bias causes their performance to plateau early on. CHAPTER 5. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 94

5.5 Conclusion

In this chapter we propose two stable, scalable and unbiased algorithms for optimizing the softmax likelihood: ISGD and U-max. Both of these methods address the inherent instability present in ESGD applied to double-sum formulations of the softmax. ISGD can be efficiently implemented, is robust to the learning rate and comprehensively outperforms the previous state-of-the-art on seven real world large scale datasets. One limitation of ISGD is that it is relatively slow if multiple datapoints are sampled each iteration or multiple inner-products can be efficiently computed (e.g. using GPUs). U-max should be the method of choice in such a setting.

5.6 Extensions

In this chapter we have only applied U-max and ISGD to the simple softmax. A natural question is whether they can be extended to more complex functions. One of the simplest such extensions is to introduce an L1 penalty. U-max trivially generalizes to this case. The ISGD update equation can also be solved, but takes a factor of log(D) more work than for the L2 penalty (see Appendix > D.9 for details). This extra work may be worthwhile if the induced sparsity makes the xi wj inner products cheaper to compute or decreases the required computational storage space for W .

A more ambitious extension is towards word2vec type models where xi is not fixed, but is an embedding to be optimized over [Mikolov et al., 2013a]. Again, the U-max algorithm easily extends but the ISGD updates become significantly harder. The ISGD update reduces to a univariate optimization problem, but is now potentially non-convex and it is expensive to compute the gra- dients required for optimization (see Appendix D.10 for details). The difficulty of optimizing the reduced univariate problem makes one consider alternative strategies for optimizing the word2vec ISGD update. Newton’s method is one possibility and coordinate descent is another. Both of these methods are iterative. For ISGD to be practical, such optimizers must be fast and can only be run for a small number of (inner) iterations, perhaps just one. We leave exploring the application of such methods for word2vec ISGD to future work. Instead we turn to the more general and challenging task of neural network training. In the next chapter we will consider an approximate coordinate descent approach for optimizing the neural network ISGD update equations. CHAPTER 6. IMPLICIT BACKPROPAGATION 95

Chapter 6

Implicit backpropagation

6.1 Introduction

Despite decades of research, most neural networks are still optimized using minor variations on the backpropagation method proposed by Rumelhart, Hinton and Williams in 1986 [Rumelhart et al., 1986]. Since backpropagation is a stochastic first order method, its run time per iteration is independent of the number of training datapoints. It is this key property that makes it able to ingest the vast quantities of data required to train neural networks on complex tasks like speech and image recognition. A serious limitation of backpropagation being a first order method is its inability to use higher order information. This leads to multiple problems: the need to visit similar datapoints multiple times in order to converge to a good solution, instability due to “exploding” gradients [Pascanu et al., 2013, Sec. 3], and high sensitivity to the learning rate [Goodfellow et al., 2016, Sec. 11.4.1]. A number of different approaches have been suggested to deal with these problems. Adaptive learning rate methods, like Adam [Kingma and Ba, 2014] and Adagrad [Duchi et al., 2011], esti- mate appropriate per-parameter learning rates; and momentum accelerates backpropagation in a common direction of descent [Zhang et al., 2017]. Gradient clipping is a heuristic which “clips” the gradient magnitude at a pre-specified threshold and has been shown to help deal with exploding gradients [Pascanu et al., 2013, Sec. 3]. Although these approaches partially address the problems of backpropagation, neural network training remains unstable and highly sensitive to the learning rate [Goodfellow et al., 2016, Sec. 11]. CHAPTER 6. IMPLICIT BACKPROPAGATION 96

The key research question here is how to add higher order information to stabilize backpropaga- tion while keeping the per iteration run time independent of the number of datapoints. A technique that has recently emerged that addresses this same question in the context of convex optimization is Implicit Stochastic Gradient Descent (ISGD). As discussed in Section 5.3.1 of Chapter 5, ISGD is known to be robust to the learning rate and unconditionally stable for convex optimization problems [Toulis and Airoldi, 2015b, Sec. 3.1] [Ryu and Boyd, 2014, Sec. 5]. A natural question is whether ISGD can be used to improve the stability of neural network optimization. In this chapter, we show how ISGD can be applied to neural network training. To the best of our knowledge this is the first time ISGD has been applied to this problem.1 The main challenge in applying ISGD is solving its implicit update equations. This step is difficult even for most con- vex optimization problems. We leverage the special structure of neural networks by constructing a novel layer-wise approximation for the ISGD updates. The resulting algorithm, Implicit Back- propagation (IB), is a good trade-off: it has almost the same run time as the standard “Explicit” Backpropagation (EB), and yet enjoys many of the desirable features of exact ISGD. IB is compat- ible with many activation functions such as the relu, arctan, hardtanh and smoothstep; however, in its present form, it cannot be applied to convolutional layers. It is possible to use IB for some layers and EB for the other layers; thus, IB is partially applicable to virtually all neural network architectures. Our numerical experiments demonstrate that IB is stable for much higher learning rates as com- pared to EB on classification, autoencoder and music prediction tasks. In all of these examples the learning rate at which IB begins to diverge is 20%-200% higher than for EB. IB performs particu- larly well for RNNs, where exploding gradients are most troublesome. We note that for small-scale classification tasks EB and IB have similar performance. We also investigate IB’s compatibility with clipping. We find that IB outperforms EB with clipping on RNNs, where clipping is most commonly used, and that clipping benefits both IB and EB for classification and autoencoding tasks. Overall, IB is clearly beneficial for RNN training and shows promise for classification and autoencoder feedforward neural networks. We believe that more refined implementations of ISGD to neural networks than IB are likely to lead to even better results — a topic for future research.

1[Toulis et al., 2015] recently remarked that ISGD hasn’t yet been applied to neural networks and how to do so is an open research question. CHAPTER 6. IMPLICIT BACKPROPAGATION 97

The rest of this chapter is structured as follows. Section 6.2 reviews the literature on ISGD and related methods. Section 6.3 develops IB as approximate ISGD, with Section 6.4 deriving IB updates for multiple activation functions. The empirical performance of IB is investigated in Section 6.5 and the chapter concludes with mentions of further work in Section 6.6.

6.2 ISGD and related methods

6.2.1 ISGD method

The standard objective in neural networks, like in most machine learning models, is the ridge- regularized loss N 1 µ 2 `(θ) = `i(θ) + θ , N 2 k k2 i=1 X th where `i(θ) = `θ(xi, yi) is the loss associated with i datapoint (xi, yi) and θ comprises the weight and bias parameters in the neural network. The method ISGD uses to minimize `(θ) is similar to that of standard “Explicit” SGD (ESGD). In each iteration of ESGD, we first sample a random datapoint i and then update the parameters (t+1) (t) (t) (t) as θ = θ ηt( θ`i(θ ) + µθ ), where ηt is the learning rate at time t. ISGD also samples a − ∇ (t+1) (t) (t+1) (t+1) random datapoint i but employs the update θ = θ ηt( θ`i(θ )+µθ ), or equivalently, − ∇

(t+1) µ 2 (t) 2 θ = argmin 2ηt(`i(θ) + θ 2) + θ θ 2 . (6.1) θ 2 k k k − k n o The main motivation of ISGD over ESGD is its robustness to learning rates, numerical stability and transient convergence behavior [Bertsekas, 2011, Ryu and Boyd, 2014, Patrascu and Necoara, 2017]. The increased robustness of ISGD over ESGD can be illustrated with a simple quadratic 1 2 (t+1) (t) loss `(θ) = θ , as displayed in Figure 6.1. Here the ISGD step θ = θ /(1 + ηt) is stable for 2 k k2 (t+1) (t) any learning rate whereas the ESGD step θ = θ (1 ηt) diverges when ηt > 2. − Since ISGD becomes equivalent to ESGD when the learning rate is small, there is no difference in their asymptotic convergence rate for decreasing learning rates. However, it is often the case that in the initial iterations, when the learning rate is still large, ISGD outperforms ESGD. The main drawback of ISGD is that the implicit update (6.1) can be expensive to compute, whereas the update for ESGD is usually trivial. If this update is expensive, then ESGD may converge faster than ISGD in terms of wall clock time, even if ISGD converges faster per epoch. CHAPTER 6. IMPLICIT BACKPROPAGATION 98

ESGD iterates ISGD iterates 2.00 2.00

1.75 1.75

1.50 1.50

1.25 1.25 2 2 / / 2 2 1.00 1.00 = = ) ) ( ( f 0.75 f 0.75

0.50 0.50

0.25 0.25

0.00 0.00

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Figure 6.1: Illustration of the difference between ESGD and ISGD in optimizing f(θ) = θ2/2. The learning rate is η = 1.75 in both cases.

Thus, in order for ISGD to be effective, one needs to be able to solve the update (6.1) efficiently. Indeed, the focus of this chapter is to develop a methodology for efficiently approximating the ISGD update for neural networks.

6.2.2 Related methods

ISGD has been successfully applied to several machine learning tasks. Cheng et al. [2007] applied ISGD to learning online kernels and He [2014] to SVMs, while Kulis and Bartlett [2010] consider a range of problems including online metric learning. ISGD has also been used to improve the stability of temporal difference algorithms in reinforcement learning [Tamar et al., 2014, Iwaki and Asada, 2017]. For more recent advances in ISGD see [Bertsekas, 2015, Toulis et al., 2016, Lin et al., 2017, Paquette et al., 2017, Wang et al., 2017]. Although ISGD has never been applied to neural networks, closely related methods have been investigated. ISGD may be viewed as a trust-region method, where 1/2η is the optimal dual variable in the Langrangian,

µ 2 (t) 2 µ 2 1 (t) 2 argmin `i(θ) + θ 2 : θ θ 2 r = argmin `i(θ) + θ 2 + θ θ 2 . θ 2 k k k − k ≤ θ 2 k k 2η(r)k − k n o   Trust-region methods for optimizing neural networks have been effectively used to stabilize policy optimization in reinforcement learning [Schulman et al., 2015, Wu et al., 2017]. Clipping, which truncates the gradient at a pre-specified threshold, may also be viewed as a computationally efficient CHAPTER 6. IMPLICIT BACKPROPAGATION 99 approximate trust-region method [Pascanu et al., 2013]. It was explicitly designed to address the exploding gradient problem and achieves this by truncating the exploded step. In the experiments section we will investigate the difference in the effect of IB and clipping. Note that trust-region optimization and clipping both require an extra hyperparameter (the trust region radius), whereas ISGD does not. An example of a non-trust region method for optimizing neural networks using higher order information is Hessian-Free optimization [Martens, 2010, Martens and Sutskever, 2011]. These methods directly estimate second order information of a neural network. They have been shown to make training more stable and require fewer epochs for convergence. However they come at a much higher per iteration cost than first order methods, which offsets this benefit [Bengio et al., 2013].

6.3 Implicit Backpropagation

In this section we develop Implicit Backpropagation (IB) as an approximate ISGD implementation for neural networks. It retains many of the desirable characteristics of ISGD, while being virtually as fast as the standard “Explicit” Backpropagation (EB). (d) (d−1) (1) (k) D D Consider a d-layered neural network fθ(x) = f f ... f (x) where f : R k R k+1 θd ◦ θd−1 ◦ ◦ θ1 θk → th D ×(1+D ) D1 represents the k layer with parameters θk R k+1 k , θ = (θ1, ..., θd), x R is the ∈ ∈ input to the neural network, and denotes composition. Let the loss associated with a datapoint ◦ D (x, y) be `θ(x, y) = `(y, ) fθ(x) for some `(y, ): R d+1 R. Later in this section, we will · ◦ · → want to extract the effect of the kth layer on the loss. To this end, we can rewrite the loss as (d:k+1) (k) (k−1:1) (i:j) (i) (i−1) (j) D D `θ(x, y) = ` f f (x) where f = f f ...f : R i R j+1 and θd:k+1,y ◦ θk ◦ θk−1:1 θi:j θi ◦ θi−1 ◦ θj → (d:j) (d:j) ` = `(y, ) f : RDj R. θd:j ,y · ◦ θd:j → The complexity of computing the ISGD update depends on the functional form of the loss

`θ(x, y). Although it is possible in some cases to compute the ISGD update explicitly, this is not the case for neural networks. Even computing the solution numerically is very expensive. Hence, it is necessary to approximate the ISGD update in order for it to be computationally tractable. We introduce the following two approximations in IB:

(a) We update parameters layer by layer. When updating parameter θk associated with layer k, CHAPTER 6. IMPLICIT BACKPROPAGATION 100

all the parameters θ−k = θi : i = k corresponding to the other layers are kept fixed. Under { 6 } this approximation the loss when updating the kth layer is

(t) (d:k+1) (k) (k−1:1) `k(θk; x, y, θ ) := ` f f (x). (6.2) −k (t) θk (t) θd:k+1,y ◦ ◦ θk−1:1

(d:k+1) (k) (b) Building on (a), we linearize the higher layers ` , but keep the layer being updated f (t) θk θd:k+1,y as non-linear. The loss from (6.2) reduces to2

(t) (d:k+1)> (k) (k−1:1) (k:1) `˜k(θk; x, y, θ ) := ` (t) (x, y) + ` (f f (x) f (x)). (6.3) −k θ (t) θk (t) (t) ∇ θd:k+1,y ◦ θk−1:1 − θk:1

This approximation can be validated via a Taylor series expansion where the error in (6.3) (t) 2 compared to (6.2) is O( θk θ ). k − k k2 The IB approximation to the ISGD update is, thus, given by

(t+1) ˜ (t) µ 2 (t) 2 θk = argmin 2η `k(θk; x, y, θ−k) + θk 2 + θk θk 2 . (6.4) θ 2 k k k − k k n   o In Appendix E.3 we present a simple theorem, which leverages the fact that IB converges to EB in the limit of small learning rates, to show that IB converges to a stationary point of `(θ) for appropriately decaying learning rates. In the next section we show that the IB update can be efficiently computed for a variety of activation functions. The IB approximations thus make ISGD practically feasible to implement. However, the approximation is not without drawbacks. The layer-by-layer update from (a) removes all higher order information along directions perpendicular to the parameter space of the layer being updated, and the linearization in (b) loses information about the non-linearity in the higher layers. The hope is that IB retains enough of the beneficial properties of ISGD to have noticeable benefits over EB. In our experiments we show that this is indeed the case. Our IB formulation is, to our knowledge, novel. The most similar update in the literature is for composite convex optimization with just two layers, where the lower, not higher, layer is linearized [Duchi and Ruan, 2017].

2 (d:k+1) (k:1) Dk+1 Note that the derivative ∇` (t) is taken with respect to its argument f (t) (x) ∈ R , not θd:k+1. θd:k+1,y θk:1 CHAPTER 6. IMPLICIT BACKPROPAGATION 101

6.4 Implicit Backpropagation updates for various activation func- tions

Since neural networks are applied to extremely large datasets, it is important that the IB updates can be computed efficiently. In this section we show that the IB update (6.4) can be greatly simplified, resulting in fast analytical updates for activation functions such as the relu and arctan. For those activation functions that do not have an analytical IB update, IB can easily be applied on a piecewise-cubic approximation of the activation function. This makes IB practically applicable to virtually any element-wise acting activation function. Although it is possible to apply IB to layers with weight sharing or non-element-wise acting activation functions, the updates tend to be complex and expensive to compute. For example, the IB update for a convolutional layer with max-pooling and a relu activation function involves solving a quadratic program with binary variables (see Appendix E.2 for the derivation). Thus, we will only focus on updates for activation functions that are applied element-wise and have no shared weights.

6.4.1 Generic updates

Here we derive the IB updates for a generic layer k with element-wise acting activation function σ. th D ×(1+D ) D ×D Let the parameters in the k layer be θk = (Wk,Bk) R k+1 k where Wk R k+1 k is the ∈ ∈ (k−1:1) Dk+1 ki weight matrix and Bk R is the bias. We’ll use the shorthand notation z = (f (t) (xi), 1) ∈ θk−1:1 ∈ (d:k+1) 1+Dk th ki Dk+1 R for the input to the k layer and b = ` (t) R for the backpropagated gradient. ∇ θd:k+1,y ∈ th ki ki The output of the k layer is thus σ(θkz ) where σ is applied element-wise and θkz is a matrix- vector product. Using this notation the IB update from (6.4) becomes:

(t+1) ki> ki 2 (t) 2 θk = argmin 2ηtb σ(θkz ) + ηtµ θk 2 + θk θk 2 , (6.5) θ k k k − k k n o ki> (k:1) where we have dropped the terms `θ(t) (x, y) and b f (t) (x) from (6.3) as they are constant with θk:1 respect to θk. Now that we have written the IB update in more convenient notation, we can begin to simplify it. Due to the fact that σ is applied element-wise, (6.5) breaks up into Dk+1 separate optimization problems, one for each output node j 1, ..., Dk+1 : ∈ { } (t+1) ki > ki 2 (t) 2 θkj = argmin 2ηtbj σ(θkjz ) + ηtµ θkj 2 + θkj θkj 2 , (6.6) θ k k k − k kj n o CHAPTER 6. IMPLICIT BACKPROPAGATION 102

1+D th where θkj = (Wkj,Bkj) R k are the parameters corresponding to the j output node. Using ∈ (t+1) simple calculus (see Appendix E.1) we can write the solution to θkj as

(t) (t+1) θkj ki ki θkj = ηtαj z (6.7) 1 + ηtµ − where αki R is the solution to the one-dimensional optimization problem j ∈ (t)> ki 2 ki ki θkj z ki 2 ki 2 α αj = argmin bj σ α ηt z 2 + ηt(1 + ηtµ) z 2 . (6.8) α  ·  1 + ηtµ − · k k  k k 2      To connect the IB update to EB, observe that if we do a first order Taylor expansion of (6.7) and (6.8) in ηt we recover the EB update:

(t+1) (t) 0 (t)> ki ki ki 2 θ = θ (1 ηtµ) ηtσ θ z b z + O(η ), kj kj − − kj j t   where σ0 denotes the derivative of σ. Thus we can think of IB as a higher order update than EB.

In summary, the original Dk+1 Dk dimensional IB update from (6.5) has been reduced to Dk+1 × separate one-dimensional optimization problems in the form of (6.8). The difficulty of solving (6.8) depends on the activation function σ. Since (6.8) is a one-dimensional problem, an optimal α can always be computed numerically using the bisection method, although this may be slow. Fortu- nately there are certain important activation functions for which α can be computed analytically. In the subsections below we derive analytical updates for α when σ is the relu and arctan functions as well as a general formula for piecewise-cubic functions. Before proceeding to these updates, we can observe directly from (6.7) and (6.8) that IB will be robust to high learning rates. Unlike EB, in which the step size increases linearly with the learning rate, IB has a bounded step size even for infinite learning rates. As the learning rate increases (6.7) becomes 2 (t+1) ηt→∞ ki ki 2 ki 2 β ki θkj argmin bj σ β z 2 + µ z 2 z −−−−→ β · · k k k k 2     This update is finite as long as µ > 0 and σ is asymptotically linear. For some activation functions, ki like the hardtanh, the IB update is finite even when ηt , µ 0 and b . →∞ → | j |→∞

6.4.2 Relu update

Here we give the solution to α from (6.8) for the relu activation function, σ(x) = max x, 0 . We { } will drop the super- and sub-scripts from (6.8) for notational convenience. When sign(b) = +1 CHAPTER 6. IMPLICIT BACKPROPAGATION 103 there are three cases and when sign(b) = +1 there are two cases for the solution to α. The updates are given in Table 6.1.

Table 6.1: IB relu updates

Sign(b) = +1 Sign(b) = 1 − 0 if θ>z 0 ≤ > 1 2  0 if θ z 2 η z 2b α =  θ>z 1 > 2 α = ≤ k k  1+ηµ ηkzk2 if 0 < θ z η z 2b   2 ≤ k k b > 1 2   1+ηµ if θ z > 2 η z 2b b if θ>z > η z 2b k k 1+ηµ k k2     Relu update with sign(b) = +1 Relu update with sign(b) = 1 0.35 0.2 −

0.30 (EB,IB) (EB) 0.1 (IB) )

) 0.25 (EB,IB) z z >

> 0.20 θ 0.0 (EB,IB) θ (

( 0.15 (EB,IB) (EB) 0.1 0.10 relu relu

0.05 − (IB) 0.2 0.00 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.3 0.2 0.1 0.0 0.1 0.2 θ>z θ>z

Figure 6.2: Relu updates for EB and IB with µ = 0. The arrow tail is the initial value of θ>z while the head is its value after the backpropagation update. Lower values are better.

The difference between the EB and IB updates is illustrated in Figure 6.2. When sign(b) = +1 and θ>z is on the slope but close to the hinge, the EB step overshoots the hinge point, making a far larger step than is necessary to reduce the relu to 0. IB, on the other hand, stops at the hinge. The IB update is better from two perspectives. First, it is able to improve the loss on the given datapoint just as much as EB, but without taking as large a step. Assuming that the current value of θ is close to a minimizer of the average loss of the other datapoints, an unnecessarily large step will likely take θ away from its (locally) optimum value. An example of where this property might be particularly important is for the “cliff” of “exploding gradient” problem in RNNs [Pascanu et al., 2013]. And second, the IB step size is a continuous function of θ, unlike in EB where the step size has a discontinuity at the origin. This should make the IB update more robust to perturbations in the data. When sign(b) = 1 and θ>z is on the flat, EB isn’t able to descend as the relu has “saturated” − CHAPTER 6. IMPLICIT BACKPROPAGATION 104

(i.e. is flat). IB, on the other hand, can look past the hinge and is still able to descend down the slope, thereby decreasing the loss. IB thus partially solves the saturating gradient problem [Pascanu et al., 2013].

6.4.3 Arctan update

Although the IB update is not analytically available for all sigmoidal activation functions, it is available when σ is the arctan. For the arctan the value of α becomes the root of a cubic equation which can be solved for analytically. Since the arctan function is non-convex there may be up to three real solutions for α. Under the axiom that smaller step sizes that achieve the same decrease in the objective are better (as argued in Section 6.4.2), we always choose the value of α closest to zero.

6.4.4 Piecewise-cubic function update

Many activation functions are piecewise-cubic, including the hardtanh and smoothstep. Further- more, all common activation functions can be approximated arbitrarily well with a piecewise-cubic. Being able to do IB updates for piecewise-cubic functions thus extends its applicability to virtually all activation functions. M Let σ(x) = I[Bm x < Bm+1] σm(x) where σm(x) is a cubic function and Bm [ , ] m=1 ≤ · ∈ −∞ ∞ defines the boundsP on each piece. The optimal value of α for each m can be found by evaluating

σm at its boundaries and stationary points (which may be found by solving a quadratic equation). The value of α can then be solved for by iterating over m = 1, ..., M and taking the minimum over all the pieces. Since there are M pieces, the time to calculate α scales as O(M).

6.4.5 Relative run time difference of IB vs EB measured in flops

A crucial aspect of any neural network algorithm is not only its convergence rate per epoch, but also its run time. Since IB does extra calculations, it is slower than EB. Here we show that the difference in floating point operations (flops) between EB and IB is typically small, on the order of 30% or less. For any given layer, let the input dimension be denoted as n and the output dimension as m. The run time of both EB and IB are dominated by three operations costing nm flops: multiplying CHAPTER 6. IMPLICIT BACKPROPAGATION 105 the weight matrix and input in the forward propagation, multiplying the backpropagated gradient and input for the weight matrix gradient, and multiplying the backpropagated gradient and weight matrix for the backpropagated gradient in the lower layer. IB has two extra calculations as com- pared to EB: calculating zki 2, costing 2n flops, and calculating αki a total of m times, once for k k2 j ki each output node. Denoting the number of flops to calculate each αj with activation function σ as cσ, the relative increase in run time of IB over EB is upper bounded by (2n + cσm)/(3nm).

The relative run time increase of IB over EB depends on the values of n, m and cσ. When σ is the relu, then cσ is small, no more than 10 flops; whereas when σ is the arctan cσ is larger, costing just less than 100 flops.3 Taking these flop values as upper bounds, if n = m = 100 then the relative run time increase is upper bounded by 4% for relu and 34% for arctan. These bounds diminish as n and m grow. If n = m = 1000, then the bounds become just 0.4% for relu and 3.4% for arctan. Thus, IB’s run time is virtually the same as EB’s for large neural network tasks. If n and m are small then the arctan IB update might be too slow relative to EB for the IB update to be worthwhile. In this case simpler sigmoidal activation functions, such as the hardtanh or smoothstep, may be preferable for IB. The hardtanh has been used before in neural networks, mainly in the context of binarized networks [Courbariaux et al., 2015]. It has the form

1 if x < 1 − − σ(x) = x if 1 x 1  − ≤ ≤  1 if x > 1   for which cσ is no more than 15 flops (using the piecewise-cubic function update from Section 6.4.4). The smoothstep is like the hardtanh but is both continuous and has continuous first derivatives,

1 if x < 1 − − 3 1 σ(x) =  x x3 if 1 x 1  2 − 2 − ≤ ≤  1 if x > 1,   with cσ being no more than 25 flops. The relative increase in run time of IB over EB for the hardtanh and smoothstep is about the same as for the relu. This makes the IB update with the hardtanh or smoothstep practical even if n and m are small. Since the hardtanh and smoothstep

3The number of flops for arctan was counted using Cardano’s method for optimizing the cubic equation. CHAPTER 6. IMPLICIT BACKPROPAGATION 106 saturate (i.e. have zero gradient) for x > 1 they also have the very desirable property of having | | bounded IB updates even if η , b and µ 0. → ∞ → ∞ →

6.5 Experiments

This section details the results of three sets of experiments where the robustness to the learning rate of IB is compared to that of EB.4 Since IB is equivalent to EB when the learning rate is small, we expect little difference between the methods in the limit of small learning rates. However, we expect that IB will be more stable and have lower loss than EB for larger learning rates.

Classification, autoencoding and music prediction tasks. For the first set of experiments, we applied IB and EB to three different but common machine learning tasks. The first task was image classification on the MNIST dataset [LeCun, 1998] with an architecture consisting of two convolutional layers, an arctan layer and a relu layer. The second task also uses the MNIST dataset, but for an 8 layer relu autoencoding architecture. The third task involves music prediction on four music datasets, JSB Chorales, MuseData, Nottingham and Piano-midi.de [Boulanger-Lewandowski et al., 2012], for which a simple RNN architecture is used with an arctan activation function. For each dataset-architecture pair we investigated the performance of EB and IB over a range of learning rates where EB performs well (see Appendix E.4.5 for more details on how these learning rates were chosen). Since IB and EB have similar asymptotic convergence rates, the difference between the methods will be most evident in the initial epochs of training. In our experiments we only focus on the performance after the first epoch of training for the MNIST datasets, and after the fifth epoch for the music datasets.5 Five random seeds are used for each configuration in order to understand the variance of the performance of the methods.6 Figure 6.3 displays the results for the experiments. EB and IB have near-identical performance when the learning rate is small. However, as the learning rate increases, the performance of EB deteriorates far more quickly as compared to IB. Over the six datasets, the learning rate at which

4A more extensive description of the experimental setup and results are given in Appendices E.4 and E.5.

5The music datasets have fewer training examples and so more epochs are needed to see convergence.

6The same seeds are used for both EB and IB. Five seeds are used for all experiments, except MNIST-classification where twenty seeds are used. CHAPTER 6. IMPLICIT BACKPROPAGATION 107

MNIST-classification JSB Chorales-RNN MuseData-RNN 0.015 2.5 explicit explicit 0.005 explicit implicit implicit implicit 2.0 0.004 exact ISGD 0.010 1.5 0.003

1.0 0.005 0.002 0.5 0.001 0.0 0.000 0.5 1.0 1.5 2.0 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate Learning rate

MNIST-autoencoder Nottingham-RNN Piano-midi.de-RNN 0.00125 explicit explicit explicit implicit implicit 0.00100 implicit 6 0.0015

0.00075 0.0010 4 0.00050 0.0005 0.00025 2 0.0000 0.000 0.005 0.010 0.015 0.020 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate Learning rate

Figure 6.3: Training loss of EB and IB (lower is better). The plots display the mean performance and one standard deviation errors. The MNIST-classification plot also shows an “exact” version of ISGD with an inner gradient descent optimizer.

IB starts to diverge is at least 20% higher than that of EB.7 The benefit of IB over EB is most noticeable for the music prediction problems where for high learning rates IB has much better mean performance and lower variance than EB. A potential explanation for this behaviour is that IB is better able to deal with exploding gradients, which are more prevalent in RNN training.

Exact ISGD. For MNIST-classification we also investigate the potential performance of exact ISGD. Instead of using our IB approximation for the ISGD update, we directly optimize (6.1) using gradient descent. For each ISGD update we take a total of 100 gradient descent steps of (6.1) at a learning rate 10 times smaller than the “outer” learning rate ηt. It is evident from Figure 6.3 that this method achieves the best performance and is remarkably robust to the learning rate. Since exact ISGD uses 100 extra gradient descent steps per iteration, it is 100 times slower than the other methods, and is thus impractically slow. However, its impressive performance indicates that ISGD-based methods have great potential for neural networks.

7The threshold for divergence that we use is when the mean loss exceeds the average of EB’s minimum and maximum mean losses measured on that dataset over the learning rates that were tested. CHAPTER 6. IMPLICIT BACKPROPAGATION 108

Run times. According to the bounds derived in Section 6.4.5, the run time of IB should be no more than 12% longer per epoch than EB on any of our experiments. With our basic Pytorch implementation IB took between 16% and 216% longer in practice, depending on the architecture used (see Appendix E.5.1 for more details). With a more careful implementation of IB we expect these run times to decrease to at least the levels indicated by the bounds. Using activation functions with more efficient IB updates, like the smoothstep instead of arctan, would further reduce the run time.

UCI datasets. Our second set of experiments is on 121 classification datasets from the UCI database [Fern´andez-Delgadoet al., 2014]. We consider a 4 layer feedforward neural network run for 10 epochs on each dataset. In contrast to the above experiments, we use the same coarse grid of 10 learning rates between 0.001 and 50 for all datasets. For each algorithm and dataset the best performing learning rate was found on the training set (measured by the performance on the training set). The neural network trained with this learning rate was then applied to the test set. Overall we found IB to have a 0.13% higher average accuracy on the test set. The similarity in performance of IB and EB is likely due to the small size of the datasets (some datasets have as few as 4 features) and relatively shallow architecture making the network relatively easy to train, as well as the coarseness of the learning rate grid.

Clipping. In our final set of experiments we investigated the effect of clipping on IB and EB. Both IB and clipping can be interpreted as approximate trust-region methods. Consequently, we expect IB to be less influenced by clipping than EB. This was indeed observed in our experiments. A total of 9 experiments were run with different clipping thresholds applied to RNNs on the music datasets (see Appendix E.4 for details). Clipping improved EB’s performance for higher learning rates in 7 out of the 9 experiments, whereas IB’s performance was only improved in 2. IB without clipping had an equal or lower loss than EB with clipping for all learning rates in all experiments except for one (Piano-midi.de with a clipping threshold of 0.1). This suggests that IB is a more effective method for training RNNs than EB with clipping. The effect of clipping on IB and EB applied to MNIST-classification and MNIST-autoencoder is more complicated. In both cases clipping enabled IB and EB to have lower losses for higher learning rates. For MNIST-classification it is still the case that IB has uniformly superior performance to CHAPTER 6. IMPLICIT BACKPROPAGATION 109

MNIST-classification, clip = 1.0 MNIST-autoencoder, clip = 1.0 Nottingham-RNN, clip = 8.0 2.5 explicit explicit explicit 6 implicit implicit 2.0 implicit 0.0015 exact ISGD 5 1.5 4 0.0010 1.0 3 0.5 0.0005 2 0.0

1 2 3 4 0.0 0.1 0.2 0.3 0.4 0 100 200 300 400 500 Learning rate Learning rate Learning rate

Figure 6.4: Training loss of EB with clipping and IB with clipping on three dataset-architecture pairs.

EB, but for MNIST-autoencoder this is reversed. It is not unsurprising that EB with clipping may outperform IB with clipping. If the clipping threshold is small enough then the clipping induced trust region will be smaller than that induced by IB. This makes EB with clipping and IB with clipping act the same for large gradients; however, below the clipping threshold EB’s unclipped steps may be able to make more progress than IB’s dampened steps. See Figure 6.4 for plots of EB and IB’s performance with clipping.

Summary. We ran a total of 17 experiments on the MNIST and music datasets8. IB outperformed EB in 15 out of these. On the UCI datasets IB had slightly better performance than EB on average. The advantage of IB is most pronounced for RNNs, where for large learning rates IB has much lower losses and consistently outperforms EB with clipping. Although IB takes slightly longer to train, this is offset by its ability to get good performance with higher learning rates, which should enable it to get away with less hyperparameter tuning.

6.6 Conclusion

In this chapter we developed the first method for applying ISGD to neural networks. We showed that, through careful approximations, ISGD can be made to run nearly as quickly as standard backpropagation while still retaining the property of being more robust to high learning rates. The resulting method, which we call Implicit Backpropagation, consistently matches or outperforms

8MNIST-classification with and without clipping, MNIST-autoencoder with and without clipping, the four music datasets without clipping and nine experiments on the music datasets with clipping. CHAPTER 6. IMPLICIT BACKPROPAGATION 110 standard backpropagation on image recognition, autoencoding and music prediction tasks; and is particularly effective for robust RNN training. The success of IB demonstrates the potential of ISGD methods to improve neural network training. It may be the case that there are better ways to approximate ISGD than IB, which could produce even better results. For example, the techniques behind Hessian-Free methods could be used to make a quadratic approximation of the higher layers in IB (opposed to the linear approximation currently used); or a second order approximation could be made directly to the ISGD formulation in (6.1). Developing and testing such methods is a ripe area for future research. The IB method, as presented in this chapter, also has potential for future research. Extending its updates to other activation functions and applying it to other architectures and datasets will lead to better understanding of IB’s properties and the conditions under which IB performs best. 111

Part III

Bibliography BIBLIOGRAPHY 112

Bibliography

Do lefties have an advantage in sports? It depends. https://www.nytimes.com/2017/11/21/ science/lefties-sports-advantage.html. Accessed: 2017-11-24.

UCI machine learning repository. University of California, Irvine, School of Information and Com- puter Sciences.

Tennis navigator. http://www.tennisnavigator.com/, 2004. Accessed: 2015-02-03.

About tennis Europe. http://www.tenniseurope.org/page.aspx?id=12173, 2015. Accessed: 2017-01-01.

The physical activity council’s annual study tracking sports, fitness, and recreation participation in the US. Report, The Physical Activity Council, 2016.

Daniel M Abrams and Mark J Panaggio. A model balancing cooperation and competition can explain our right-handed world and the dominance of left-handed athletes. Journal of the Royal Society Interface, 2012. ISSN 1742-5689.

David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for Boltzmann machines. Cognitive science, 9(1):147–169, 1985.

John P Aggleton and Charles J Wood. Is there a left-handed advantage in “ballistic” sports? International Journal of Sport Psychology, 1990. ISSN 0047-0767.

AI Group at Arris Pharmaceutical Corporation. UCI machine learning repository, 1994.

Aki Vehtari. MCMC diagnostics for Matlab. BIBLIOGRAPHY 113

Selcuk Akpinar, Robert L Sainburg, Sadettin Kirazci, and Andrzej Przybyla. Motor asymmetry in elite fencers. Journal of motor behavior, 47(4):302–311, 2015.

Yoann Altmann, Steve McLaughlin, and Nicolas Dobigeon. Sampling from a multivariate Gaussian distribution truncated on a simplex: A review. In Statistical Signal Processing (SSP), 2014 IEEE Workshop on, pages 113–116. IEEE, 2014.

Jacob Andreas and Dan Klein. When and why are log-linear models self-normalizing? In HLT- NAACL, pages 244–249, 2015.

Boris Baˇci´cand Ali H Gazala. Left-handed representation in top 100 male professional tennis play- ers: Multi-disciplinary perspectives. http://tmg.aut.ac.nz/tmnz2016/papers/Boris2016. pdf. Accessed: 2017-06-15.

David Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, New York, NY, USA, 2012. ISBN 0521518148, 9780521518147.

Yoshua Bengio and Jean-S´ebastienSen´ecal.Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4):713–722, 2008.

Yoshua Bengio, Jean-S´ebastienSen´ecal,et al. Quick training of probabilistic neural nets by impor- tance sampling. In AISTATS, 2003.

Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. Advances in optimizing recurrent networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE Interna- tional Conference on, pages 8624–8628. IEEE, 2013.

Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathe- matical programming, 129(2):163, 2011.

Dimitri P Bertsekas. Incremental aggregated proximal and augmented Lagrangian algorithms. arXiv preprint arXiv:1509.09257, 2015.

Sylvain Billiard, Charlotte Faurie, and Michel Raymond. Maintenance of handedness polymorphism in humans: A frequency-dependent selection model. Journal of Theoretical Biology, 235(1):85–93, 2005. ISSN 0022-5193. BIBLIOGRAPHY 114

Patrizia S Bisiacchi, H Ripoll, J F Stein, P Simonet, and G Azemar. Left-handedness in fencers: An attentional advantage? Perceptual and motor skills, 61(2):507–513, 1985. ISSN 0031-5125.

David M Blei. Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application, 1:203–232, 2014.

L´eonBottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838, 2016.

Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal depen- dencies in high-dimensional sequences: Application to polyphonic music generation and tran- scription. arXiv preprint arXiv:1206.6392, 2012.

George EP Box. Science and statistics. Journal of the American Statistical Association, 71(356): 791–799, 1976.

Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4):324–345, December 1952. ISSN 0006-3444. doi: 10.2307/2334029. URL http://www.jstor.org/stable/2334029.

Michael Braun and Andre Bonfrer. Scalable inference of customer similarities from interactions data using Dirichlet processes. Marketing Science, 30(3):513–531, 2011.

Kristijan Breznik. On the gender effects of handedness in professional tennis. Journal of sports science & medicine, 12(2):346, 2013.

Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 2015.

Supratik Chakraborty, Daniel J Fremont, Kuldeep S Meel, Sanjit A Seshia, and Moshe Y Vardi. Distribution-aware sampling and weighted model counting for SAT. arXiv preprint arXiv:1404.2984, 2014.

Venkat Chandrasekaran, Nathan Srebro, and Prahladh Harsha. Complexity of inference in graphical models. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 70–78. AUAI Press, 2008. BIBLIOGRAPHY 115

Mark Chavira and Adnan Darwiche. On probabilistic inference by weighted model counting. Ar- tificial Intelligence, 172(6):772–799, 2008.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv:1312.3005, 2013.

Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, and Svn Vishwanathan. Implicit online learning with kernels. In Advances in neural information processing systems, pages 249–256, 2007.

Nicolas Chopin and James Ridgway. Leave Pima Indians alone: Binary regression as a benchmark for Bayesian computation. arXiv preprint arXiv:1506.08640, 2015.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.

Botond Cseke and Tom Heskes. Approximate marginals in latent Gaussian models. The Journal of Machine Learning Research, 12:417–454, 2011.

Yixiong Cui, Miguel-Angel´ G´omez,Bruno Gon¸calves, Hongyou Liu, and Jaime Sampaio. Effects of experience and relative quality in tennis match performance during four grand slams. Inter- national Journal of Performance Analysis in Sport, 17(5):783–801, 2017a.

Yixiong Cui, Miguel-Angel´ G´omez,Bruno Gon¸calves, and Jaime Sampaio. Identifying different tennis player types: An exploratory approach to interpret performance based on player features. In Complex Systems in Sport, International Congress Linking Theory and Practice, page 97, 2017b.

John P Cunningham, Philipp Hennig, and Simon Lacoste-Julien. Gaussian probabilities and ex- pectation propagation. arXiv preprint arXiv:1111.6832, 2011.

Anja Daems and Karl Verfaillie. Viewpoint-dependent priming effects in the perception of human actions and body postures. Visual cognition, 6(6):665–693, 1999. ISSN 1350-6285. BIBLIOGRAPHY 116

Hal Daume III, Nikos Karampatziakis, John Langford, and Paul Mineiro. Logarithmic time one- against-some. arXiv:1606.04988, 2016.

Herbert A. David and H. N. Nagaraja. Order Statistics. Wiley-Interscience, Hoboken, N.J, 3rd edition, August 2003. ISBN 9780471389262.

Alexandre de Br´ebissonand Pascal Vincent. An exploration of softmax alternatives belonging to the spherical loss family. arXiv:1511.05042, 2015.

Nando De Freitas, Pedro Højen-Sørensen, Michael I Jordan, and Stuart Russell. Variational MCMC. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 120– 127. Morgan Kaufmann Publishers Inc., 2001.

Aaron Defazio. A simple practical accelerated method for finite sums. In Advances in Neural Information Processing Systems, pages 676–684, 2016.

Guillaume P Dehaene and Simon Barthelm´e. Bounding errors of expectation-propagation. In Advances in Neural Information Processing Systems, pages 244–252, 2015.

Marc Deisenroth and Shakir Mohamed. Expectation propagation in Gaussian process dynamical systems. In Advances in Neural Information Processing Systems, pages 2609–2617, 2012.

Julio Del Corral and Juan Prieto-Rodr´ıguez. Are differences in ranks good predictors for grand slam tennis matches? International Journal of Forecasting, 26(3):551–563, 2010.

Ned A Dochtermann, CM Gienger, and Shane Zappettini. Born to win? Maybe, but perhaps only against inferior competition. Animal Behaviour, 96:e1–e3, 2014. ISSN 0003-3472.

Randal Douc, Christian P Robert, et al. A vanilla Rao–Blackwellization of Metropolis–Hastings algorithms. The Annals of Statistics, 39(1):261–277, 2011.

John Duchi and Feng Ruan. Stochastic methods for composite optimization problems. arXiv preprint arXiv:1703.08570, 2017.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011. BIBLIOGRAPHY 117

David Edwards and Havranek Toma. A fast procedure for model search in multidimensional con- tingency tables. Biometrika, 72(2):339–351, 1985.

A. E. Elo. The rating of chessplayers, past and present. Arco Pub., 1978.

Manuel Fern´andez-Delgado,Eva Cernadas, Sen´enBarro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res, 15(1):3133–3181, 2014.

Gerhard H Fischer and Ivo W Molenaar. Rasch models: Foundations, recent developments, and applications. Springer Science & Business Media, 2012.

Adrian E. Flatt. Is being left-handed a handicap? The short and useless answer is “yes and no.”. Proceedings of Baylor University Medical Center, 21(3):304–307, 2008.

Sergey Foss, Dmitry Korshunov, Stan Zachary, et al. An introduction to heavy-tailed and subexpo- nential distributions, volume 6. Springer, 2011.

Andrew Gelman and Donald B Rubin. Inference from iterative simulation using multiple sequences. Statistical science, pages 457–472, 1992.

Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis, volume 2. Taylor & Francis, 2014.

Norman Geschwind and Albert M Galaburda. Cerebral lateralization: Biological mechanisms, associations, and pathology: I. A hypothesis and a program for research. Archives of neurology, 42(5):428–459, 1985. ISSN 0003-9942.

Stefano Ghirlanda, Elisa Frasnelli, and Giorgio Vallortigara. Intraspecific competition and coor- dination in the evolution of lateralization. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 364(1519):861–866, 2009.

Mark E. Glickman. Parameter estimation in large dynamic paired comparison experiments. Applied Statistics, 48:377–394, 1999.

Stephen R Goldstein and Charlotte A Young. “Evolutionary” stable strategy of handedness in major league baseball. Journal of Comparative Psychology, 110(2):164, 1996. ISSN 1939-2087. BIBLIOGRAPHY 118

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org.

Siddharth Gopal and Yiming Yang. Distributed training of large-scale logistic models. In Interna- tional Conference on Machine Learning, pages 289–297, 2013.

Edouard Grave, Armand Joulin, Moustapha Ciss´e,David Grangier, and Herv´eJ´egou. Efficient softmax approximation for GPUs. arXiv:1609.04309, 2016.

Emily Grossman, M Donnelly, R Price, D Pickens, V Morgan, G Neighbor, and R Blake. Brain areas involved in perception of biological motion. Journal of cognitive neuroscience, 12(5):711– 720, 2000.

George Grouios, Haralambos Tsorbatzoudis, Konstantinos Alexandris, and Vassilis Barkoukis. Do left-handed competitors have an innate superiority in sports? Perceptual and motor skills, 90(3 suppl):1273–1282, 2000. ISSN 0031-5125.

Recep Gursoy. Effects of left-or right-hand preference on the success of boxers in Turkey. British Journal of Sports Medicine, 43(2):142–144, 2009.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Kun He. Stochastic functional descent for learning Support Vector Machines. PhD thesis, 2014.

Ralf Herbrich, Tom Minka, and Thore Graepel. TrueskillTM: A Bayesian skill rating system. In Advances in neural information processing systems, pages 569–576, 2007.

David W Holtzen. Handedness and professional tennis. International Journal of Neuroscience, 105 (1-4):101–119, 2000. ISSN 0020-7454.

Ryo Iwaki and Minoru Asada. Implicit incremental natural actor critic. In International Conference on Neural Information Processing, pages 749–758. Springer, 2017. BIBLIOGRAPHY 119

Yacine Jernite, Anna Choromanska, David Sontag, and Yann LeCun. Simultaneous learning of trees and representations for extreme classification, with application to language modeling. arXiv:1610.04658, 2016.

Shihao Ji, SVN Vishwanathan, Nadathur Satish, Michael J Anderson, and Pradeep Dubey. Black- out: Speeding up recurrent neural network language models with very large vocabularies. arXiv:1511.06909, 2015.

Mark Johnson, Thomas L Griffiths, and Sharon Goldwater. Bayesian inference for PCFGs via Markov chain Monte Carlo. 2007.

Bikash Joshi, Massih-Reza Amini, Ioannis Partalas, Franck Iutzeler, and Yury Maximov. Ag- gressive sampling for multi-class to binary reduction with applications to text classification. arXiv:1701.06511, 2017.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Betty R Kirkwood and Jonathan AC Sterne. Essential medical statistics. John Wiley & Sons, 2010.

G¨unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in Neural Information Processing Systems, pages 972–981, 2017.

Young Jun Ko and Matthias Seeger. Expectation propagation for rectified linear poisson regression. In Proceedings of the Seventh Asian Conference on Machine Learning, number EPFL-CONF- 214372, 2015.

Balaji Krishnapuram, Lawrence Carin, Mario AT Figueiredo, and Alexander J Hartemink. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE transactions on pattern analysis and machine intelligence, 27(6):957–968, 2005.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolu- tional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

Brian Kulis and Peter L Bartlett. Implicit online learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 575–582. Citeseer, 2010. BIBLIOGRAPHY 120

Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary Gaussian process classification. The Journal of Machine Learning Research, 6:1679–1704, 2005.

Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. arXiv:1212.2002, 2012.

Shiwei Lan and Babak Shahbaba. Sampling constrained probability distributions using spherical augmentation. arXiv preprint arXiv:1506.05936, 2015.

Yann LeCun. The MNIST database of handwritten digits. 1998. http://yann.lecun.com/exdb/ mnist/.

Jonathan Liew. Wimbledon 2015: Once they were great - but where have all the lefties gone? The Telegraph, June 2015.

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. Catalyst acceleration for first-order convex optimization: From theory to practice. arXiv preprint arXiv:1712.05654, 2017.

Scott Linderman, Matthew Johnson, and Ryan P Adams. Dependent multinomial models made easy: Stick-breaking with the p´olya-gamma augmentation. In Advances in Neural Information Processing Systems, pages 3456–3464, 2015.

Qiang Liu, John W Fisher III, and Alexander T Ihler. Probabilistic variational bounds for graphical models. In Advances in Neural Information Processing Systems, pages 1432–1440, 2015.

Florian Loffing. Left-handedness and time pressure in elite interactive ball games. Biology Letters, 13(11), 2017. ISSN 1744-9561. doi: 10.1098/rsbl.2017.0446.

Florian Loffing and Norbert Hagemann. Pushing through evolution? Incidence and fight records of left-oriented fighters in professional boxing history. Laterality: Asymmetries of Body, Brain and Cognition, 20(3):270–286, 2015. ISSN 1357-650X.

Florian Loffing and Norbert Hagemann. Chapter 12 - performance differences between left- and right-sided athletes in one-on-one interactive sports. In Florian Loffing, Norbert Hagemann, Bernd Strauss, and Clare MacMahon, editors, Laterality in Sports, pages 249 – 277. Academic Press, San Diego, 2016. ISBN 978-0-12-801426-4. BIBLIOGRAPHY 121

Florian Loffing, Norbert Hagemann, and Bernd Strauss. Left-handedness in professional and ama- teur tennis. PLoS One, 7(11), 2012a. ISSN 1932-6203.

Florian Loffing, Norbert Hagemann, and Bernd Strauss. Left-handedness in professional and ama- teur tennis. PLoS One, 7(11):e49325, 2012b.

Duncan R. Luce. Individual Choice Behavior: A Theoretical Analysis. Wiley, New York, 1959. ISBN 9780486441368.

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv:1611.00712, 2016.

James Martens. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735–742, 2010.

James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimiza- tion. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1033–1040. Citeseer, 2011.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word represen- tations in vector space. arXiv preprint arXiv:1301.3781, 2013a.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013b.

Thomas Minka. Power EP. Technical report, Microsoft Research, Cambridge, 2004.

Thomas Minka. Divergence measures and message passing. Technical report, 2005.

Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001.

Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv:1206.6426, 2012. BIBLIOGRAPHY 122

Frederick Mosteller. Remarks on the method of paired comparisons: I. The least squares solution assuming equal standard deviations and equal correlations. Psychometrika, 16(1):3–9, 1951. ISSN 0033-3123. doi: 10.1007/BF02313422. URL http://dx.doi.org/10.1007/BF02313422.

Kevin P Murphy. Machine learning: A probabilistic perspective. MIT press, 2012.

Kevin P Murphy, Yair Weiss, and Michael I Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 467–475. Morgan Kaufmann Publishers Inc., 1999.

Iain Murray and Ryan P Adams. Slice sampling covariance hyperparameters of latent Gaussian models. In Advances in Neural Information Processing Systems, pages 1732–1740, 2010.

Iain Murray and Zoubin Ghahramani. Bayesian learning in undirected graphical models: ap- proximate mcmc algorithms. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 392–399. AUAI Press, 2004.

Iain Murray, Ryan P Adams, and David Mackay. Elliptical slice sampling. In International Con- ference on Artificial Intelligence and Statistics, pages 541–548, 2010.

Stephen Mussmann and Stefano Ermon. Learning and inference via maximum inner product search. In International Conference on Machine Learning, pages 2587–2596, 2016.

Ruth D Nass and Michael S Gazzaniga. Cerebral lateralization and specialization in human central nervous system. Comprehensive Physiology, 5, 1987. ISSN 0470650710.

Radford Neal. Slice sampling. Annals of Statistics, 31:705–741, 2003.

Radford M Neal. Suppressing random walks in Markov chain Monte Carlo using ordered overre- laxation. In Learning in graphical models, pages 205–228. Springer, 1998.

Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2:113–162, 2011.

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. BIBLIOGRAPHY 123

Mark EJ Newman, Gerard T Barkema, and MEJ Newman. Monte Carlo methods in statistical physics, volume 13. Clarendon Press Oxford, 1999.

Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary Gaussian process clas- sification. Journal of Machine Learning Research, 9(10), 2008.

Robert Nishihara, Iain Murray, and Ryan P. Adams. Parallel MCMC with generalized elliptical slice sampling. J. Mach. Learn. Res., 15(1):2087–2112, January 2014. ISSN 1532-4435.

Akihiko Nishimura and David Dunson. Recycling intermediate steps to improve Hamiltonian Monte Carlo. arXiv preprint arXiv:1511.06925, 2015.

Sebastian Nowozin and Christoph H Lampert. Structured prediction and learning in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3-4):3–4, 2011.

Peter O’Donoghue. Normative profiles of sports performance. International Journal of Performance Analysis in Sport, 5(1):104–119, 2005.

Peter O’Donoghue. Interacting performances theory. International Journal of Performance Analysis in Sport, 9(1):26–46, 2009.

Ari Pakman and Liam Paninski. Auxiliary-variable exact Hamiltonian Monte Carlo samplers for binary distributions. In Advances in Neural Information Processing Systems, pages 2490–2498, 2013.

Ari Pakman and Liam Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaus- sians. Journal of Computational and Graphical Statistics, 23(2):518–542, 2014.

Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, and Zaid Harchaoui. Cat- alyst acceleration for gradient-based non-convex optimization. arXiv preprint arXiv:1703.10993, 2017.

Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, and Patrick Galinari. LSHTC: A benchmark for large-scale text classification. arXiv:1503.08581, 2015. BIBLIOGRAPHY 124

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.

Andrei Patrascu and Ion Necoara. Nonasymptotic convergence of stochastic proximal point algo- rithms for constrained convex optimization. arXiv preprint arXiv:1706.06297, 2017.

Jonathan W Pillow, Liam Paninski, and Eero P Simoncelli. Maximum likelihood estimation of a stochastic integrate-and-fire neural model. In NIPS, pages 1311–1318. Citeseer, 2003.

Nicholas G Polson, James G Scott, and Jesse Windle. The Bayesian bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):713–733, 2014.

Parameswaran Raman, Shin Matsushima, Xinhua Zhang, Hyokun Yun, and SVN Vishwanathan. DS-MLR: Exploiting double separability for scaling up distributed multinomial logistic regression. arXiv:1604.04706, 2016.

Michel Raymond, Dominique Pontier, Anne-Beatrice Dufour, and Anders Pape Moller. Frequency- dependent maintenance of left handedness in humans. Proceedings of the Royal Society of London B: Biological Sciences, 263(1377):1627–1633, 1996. ISSN 0962-8452.

James Ridgway. EPGLM: Gaussian Approximation of Bayesian Binary Regression Models, 2016. R package version 1.1.1.

Gareth O Roberts, Andrew Gelman, Walter R Gilks, et al. Weak convergence and optimal scaling of random walk Metropolis algorithms. The annals of applied probability, 7(1):110–120, 1997.

Gareth O Roberts, Jeffrey S Rosenthal, et al. Optimal scaling for various Metropolis-Hastings algorithms. Statistical science, 16(4):351–367, 2001.

Francisco JR Ruiz, Michalis K Titsias, Adji B Dieng, and David M Blei. Augment and reduce: Stochastic inference for large categorical distributions. arXiv preprint arXiv:1802.04220, 2018.

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323(6088):533, 1986.

Roland T Rust and Anthony J Zahorik. Customer satisfaction, customer retention, and market share. Journal of retailing, 69(2):193–215, 1993. BIBLIOGRAPHY 125

Ernest K Ryu and Stephen Boyd. Stochastic proximal iteration: A non-asymptotic improvement upon stochastic gradient descent. Author website, early draft, 2014.

Ruslan Salakhutdinov and Geoffrey E Hinton. Deep Boltzmann machines. In AISTATS, volume 1, page 3, 2009.

Mark Schmidt. UGM: A Matlab toolbox for probabilistic undirected graphical models, 2007.

J¨orgSchorer, Florian Loffing, Norbert Hagemann, and Joseph Baker. Human handedness in in- teractive situations: Negative perceptual frequency effects can be reversed! Journal of sports sciences, 30(5):507–513, 2012.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.

Farhad Shahrokhi and David W Matula. The maximum concurrent flow problem. Journal of the ACM (JACM), 37(2):318–334, 1990.

Vince Sigillito. UCI machine learning repository. University of California, Irvine, School of Infor- mation and Computer Sciences, 1989.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Raymond T Stefani. Survey of the major world sports rating systems. Journal of Applied Statistics, 24(6):635–646, 1997.

Stanislaw Sterkowicz, Grzegorz Lech, and Jan Blecharz. Effects of laterality on the techni- cal/tactical behavior in view of the results of Judo fights. Archives of Budo, 6(4):173–177, 2010. ISSN 1643-8698.

James V Stone. Object recognition: View-specificity and motion-specificity. Vision research, 39 (24):4032–4044, 1999. ISSN 0042-6989.

Hidemaro Suwa and Synge Todo. Markov chain Monte Carlo method without detailed balance. Physical review letters, 105(12):120603, 2010. BIBLIOGRAPHY 126

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du- mitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions.

Francesco Taddei, Maria Pia Viggiano, and Luciano Mecacci. Pattern reversal visual evoked poten- tials in fencers. International journal of psychophysiology, 11(3):257–260, 1991. ISSN 0167-8760.

Aviv Tamar, Panos Toulis, Shie Mannor, and Edoardo M Airoldi. Implicit temporal differences. arXiv preprint arXiv:1412.6734, 2014.

L. L. Thurstone. A law of comparative judgment. Psychological Review, 34(4):273–286, 1927. ISSN 1939-1471(Electronic);0033-295X(Print). doi: 10.1037/h0070288.

Michalis K Titsias. One-vs-each approximation to softmax for scalable estimation of probabilities. In Advances In Neural Information Processing Systems, pages 4161–4169, 2016.

Panagiotis Toulis. Implicit methods for iterative estimation with large data sets. PhD thesis, 2016.

Panos Toulis and Edoardo M Airoldi. Implicit stochastic approximation. arXiv:1510.00967, 2015a.

Panos Toulis and Edoardo M Airoldi. Scalable estimation strategies based on stochastic approxi- mations: classical results and new insights. Statistics and computing, 25(4):781–795, 2015b.

Panos Toulis, Thibaut Horel, and Edoardo M Airoldi. Stable Robbins-Monro approximations through stochastic proximal updates. arXiv preprint arXiv:1510.00967, 2015.

Panos Toulis, Dustin Tran, and Edo Airoldi. Towards stability and optimality in stochastic gradient descent. In Artificial Intelligence and Statistics, pages 1290–1298, 2016.

Panos Toulis, Edoardo M Airoldi, et al. Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4):1694–1727, 2017.

Fredrik Ull´en,David Zachary Hambrick, and Miriam Anna Mosing. Rethinking expertise: A multifactorial gene–environment interaction model of expert performance. Psychological bulletin, 142(4):427, 2016.

Pascal Vincent, Alexandre de Br´ebisson,and Xavier Bouthillier. Efficient exact gradient update for training deep networks with very large sparse targets. In Advances in Neural Information Processing Systems, pages 1108–1116, 2015. BIBLIOGRAPHY 127

Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational

inference. Foundations and Trends R in Machine Learning, 1(1-2):1–305, 2008.

Fugao Wang and DP Landau. Efficient, multiple-range random walk algorithm to calculate the density of states. Physical review letters, 86(10):2050, 2001.

Jialei Wang, Weiran Wang, and Nathan Srebro. Memory and communication efficient distributed stochastic optimization with minibatch prox. arXiv preprint arXiv:1702.06269, 2017.

Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stochastic composition optimization. In Advances in Neural Information Processing Systems, pages 1714–1722, 2016.

Sandra F Witelson. The brain connection: The corpus callosum is larger in left-handers. Science, 229(4714):665–668, 1985. ISSN 0036-8075.

William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian. UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, 1995.

CJ Wood and JP Aggleton. Handedness in ‘fast ball’sports: Do lefthanders have an innate advan- tage? British Journal of Psychology, 80(2):227–240, 1989. ISSN 2044-8295.

Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In Advances in neural information processing systems, pages 5285–5294, 2017.

Ian EH Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit Dhillon, and Eric Xing. Ppdsparse: A parallel primal-dual sparse method for extreme classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 545–553. ACM, 2017.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.

Jian Zhang, Ioannis Mitliagkas, and Christopher R´e.Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471, 2017. BIBLIOGRAPHY 128

Yichuan Zhang, Zoubin Ghahramani, Amos J Storkey, and Charles A Sutton. Continuous relax- ations for discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems, pages 3194–3202, 2012. 129

Part IV

Appendices APPENDIX A. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION130

Appendix A

Elliptical Slice Sampling with Expectation Propagation

Here we present a brief introduction to Monte Carlo, Markov Chain Monte Carlo (MCMC) and slice sampling methods.

A.1 Monte Carlo integration

Monte Carlo methods are a set of sampling based approaches that estimate integrals of the form

n 1 I = g(x)p(x)dx = Ep[g(X)] g(xi) ≈ n i=1 Z X where p is a probability distribution and x1, x2, ..., xn are n samples from p. The xi variables may be dependent, but should be identically distributed. The accuracy of the approximation improves as n increases.

A.2 Markov Chain Monte Carlo (MCMC)

Certain probability distributions are difficult to sample from, but their density at any given point can be easily evaluated up to a multiplicative constant. This occurs most often when the normalizing constant of the distribution is unknown, i.e. p(x) = f(x)/Z where f is known but Z = f(x)dx is not. It is possible to sample from such distributions using a Markov Chain. Starting fromR any APPENDIX A. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION131 point x in the distribution’s domain, a Markov Chain Monte Carlo algorithm proposes a sequence of points which are probabilistically either accepted or rejected according to a specific criterion1 such that the empirical distribution of the accepted points will converge to the distribution p. These points then can be used for Monte Carlo integration.

A.3 Slice sampling

Slice sampling is a type of MCMC method. Let p(x) = f(x)/Z be the distribution we would like to sample from where f is known but Z is not. Also let x be the current point in the Markov Chain. Slice sampling generates the next point x0 in the Markov Chain using the following steps:

1. Sample a slice height y uniformly at random from the interval (0, f(x)],

2. Sample x0 uniformly at random from the set f −1[y, + ). ∞ Step 2 is often too hard to do in practice. Instead the next point x0 f −1[y, + ) may be generated ∈ ∞ via a sequence of expanding then contracting intervals containing the current point x. For more details, see [Neal, 2003].

1The probability of acceptance is typically proportional to the density at the proposed point and inversely pro- portional to the transition probability to that point. APPENDIX B. ANNULAR AUGMENTATION SAMPLING 132

Appendix B

Annular Augmentation Sampling

B.1 Rao-Blackwellization

Here we provide a brief description of Rao-Blackwellization. Let us assume that we are interested in calculating an expectation I = E[Y ]. We may calculate an estimate of this expectation by taking 1 n n > 0 samples y1, y2, ..., yn of Y and approximating I yi. Let X denote any random ≈ n i=1 variable. Rao-Blackwellization is based on the observation thatP

I = E[Y ] = E[E[Y X]] |

The Rao-Blackwell estimate of I is obtained by taking n > 0 samples x1, x2, ..., xn of X and 1 n approximating I E[Y X = xi]. This estimate can be shown to have lower variance than ≈ n i=1 | that obtained using samplesP of Y .

B.2 2D-Ising model results

The subsections below provide numerical values for the 2D-Ising model experiments and give more insight into the performance of the algorithms. First the effect of LBP (Appendix B.2.1) and Rao- Blackwellization (Appendix B.2.2) are analyzed, after which Root Mean Squared Errors (RMSEs) will be provided for the node and pairwise marginals for each of the experiments (Appendix B.2.3). APPENDIX B. ANNULAR AUGMENTATION SAMPLING 133

B.2.1 LBP accuracy

In Table B.1 and B.2 below, we provide the absolute error between the LBP marginals and true marginals (as computed by junction tree algorithm) for different parameter settings. Table B.1 is relevant to Figure 3.3 where the target distribution has multiple modes. Clearly, the LBP approximation starts to degrade as we increase W , which hurts the +LBP samplers, giving them equal performance as HMC and CMH (see Figure 3.3). Table B.2 is relevant to Figure 3.5 where the target distribution is unimodal. As the bias increases, the target distribution becomes more peaked with the LBP approximation being consistently accurate, more so with higher bias values. This results in the +LBP samplers achieving the best level of performance (see Figure 3.5).

B.2.2 Benefit of Rao-Blackwellization

Table B.3 is relevant to Figure 3.4. Here we show RMSE of node marginals estimate for AAS and AAS + RB to highlight the benefit of Rao-Blackwellization. The target distribution in this case is bimodal with the two modes corresponding to S = 1 (all coordinates as +1 or all coordinates as ± -1). When we are at one of the modes, say S = 1, the opposite point S = 1 also lies on the annulus − (by over relaxed property of the annular augmentation framework). With Rao-Blackwellization, we are able to take advantage of this fact, thereby leading to a more accurate estimate of the node marginals.

B.2.3 RMSE for node and pairwise marginals

Here we provide the numerical values used to generate the heat maps in Figures 3.3, 3.4, and 3.5. APPENDIX B. ANNULAR AUGMENTATION SAMPLING 134

Table B.1: LBP error for different values of W relating to Figure 3.3. The bias scale c is fixed at 0.1.

Strength (W ) LBP approximation error

0.1 0.0002 0.2 0.0035 0.3 0.0182 0.4 0.0610 0.5 0.1783 0.6 0.8895 0.7 9.3581 0.8 25.5220

Table B.2: LBP error for different values of c relating to Figure 3.5. Strength W is fixed at 0.2.

Bias scale (c) LBP approximation error ( 10−3) × 0.2 3.44 0.4 5.60 0.6 5.77 0.8 5.10 1.0 4.88 1.5 1.99 2.0 0.55 3.0 0.10 4.0 0.05 APPENDIX B. ANNULAR AUGMENTATION SAMPLING 135

Table B.3: RMSE of node marginals estimate for increasing values of W relating to Figure 3.4. No bias is applied to any node.

Strength (W ) AAS ( 10−2) AAS + RB ( 10−2) × × 0.25 1.31 0.03 0.5 1.18 0.03 0.75 1.19 0.03 1.0 1.13 0.03 2.0 1.29 0.03 3.0 1.08 0.03 4.0 1.01 0.03 5.0 1.23 0.03

Table B.4: Numerical values of RMSE (Node Marginals) for Figure 3.3. Bias scale c = 0.2 and we increase the strength (W).

Algorithm W = 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

HMC 0.0211 0.0220 0.0513 0.1498 0.2907 0.4515 0.4117 0.4749 CMH 0.0134 0.0188 0.0475 0.1583 0.3990 0.4670 0.3992 0.4752 CMH + LBP 0.0132 0.0168 0.0466 0.1576 0.3965 0.4527 0.3992 0.4752 AAS 0.0267 0.0416 0.0761 0.1495 0.1585 0.2238 0.2564 0.0558 AAS + RB 0.0214 0.0385 0.0741 0.1490 0.1587 0.2243 0.2563 0.0547 AAS + RB + LBP 0.0097 0.0221 0.0664 0.1418 0.2358 0.3200 0.3884 0.4741 AAG 0.0225 0.0352 0.0635 0.1566 0.1393 0.2694 0.3017 0.0597 AAG + RB 0.0169 0.0313 0.0609 0.1558 0.1419 0.2692 0.3042 0.0581 AAG + RB + LBP 0.0081 0.0166 0.0600 0.1580 0.1661 0.3559 0.3962 0.4752 AAST 0.0407 0.0359 0.0675 0.1205 0.1605 0.3081 0.3128 0.0598 AAST + RB 0.0404 0.0352 0.0671 0.1201 0.1637 0.3073 0.3128 0.0594 AAST + RB + LBP 0.0285 0.0223 0.0536 0.1766 0.1817 0.3921 0.3971 0.4746 APPENDIX B. ANNULAR AUGMENTATION SAMPLING 136

Table B.5: Numerical values of RMSE (Pairwise Marginals) for Figure 3.4. Bias scale c = 0.2 and we increase the strength (W).

Algorithm W = 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

HMC 0.0211 0.0220 0.0513 0.1498 0.2907 0.4515 0.4117 0.4749 CMH 0.0134 0.0188 0.0475 0.1583 0.3990 0.4670 0.3992 0.4752 CMH + LBP 0.0132 0.0168 0.0466 0.1576 0.3965 0.4527 0.3992 0.4752 AAS 0.0267 0.0416 0.0761 0.1495 0.1585 0.2238 0.2564 0.0558 AAS + RB 0.0214 0.0385 0.0741 0.1490 0.1587 0.2243 0.2563 0.0547 AAS + RB + LBP 0.0097 0.0221 0.0664 0.1418 0.2358 0.3200 0.3884 0.4741 AAG 0.0225 0.0352 0.0635 0.1566 0.1393 0.2694 0.3017 0.0597 AAG + RB 0.0169 0.0313 0.0609 0.1558 0.1419 0.2692 0.3042 0.0581 AAG + RB + LBP 0.0081 0.0166 0.0600 0.1580 0.1661 0.3559 0.3962 0.4752 AAST 0.0407 0.0359 0.0675 0.1205 0.1605 0.3081 0.3128 0.0598 AAST + RB 0.0404 0.0352 0.0671 0.1201 0.1637 0.3073 0.3128 0.0594 AAST + RB + LBP 0.0285 0.0223 0.0536 0.1766 0.1817 0.3921 0.3971 0.4746

Table B.6: Numerical values of RMSE (Node Marginals) for Figure 3.4. No bias is applied to any node and we increase the strength (W).

Algorithm W = 0.25 0.5 0.75 1 2 3 4 5

HMC 0.0744 0.1183 0.2173 0.4947 0.8978 1.1969 1.0837 1.1509 CMH 0.0590 0.1148 0.2131 0.5072 1.0911 1.2287 1.0594 1.1517 CMH + LBP 0.0576 0.1123 0.2124 0.5030 1.0848 1.2015 1.0595 1.1519 AAS 0.0913 0.1595 0.2757 0.4734 0.5462 0.6680 0.7240 0.3864 AAS + RB 0.0776 0.1520 0.2711 0.4721 0.5464 0.6687 0.7237 0.3853 AAS + RB + LBP 0.0614 0.1273 0.2590 0.4652 0.6926 0.8719 1.0223 1.1478 AAG 0.0818 0.1434 0.2465 0.5005 0.5497 0.7834 0.8410 0.3986 AAG + RB 0.0676 0.1350 0.2416 0.4988 0.5544 0.7830 0.8460 0.3978 AAG + RB + LBP 0.0585 0.1212 0.2435 0.4970 0.5961 0.9621 1.0493 1.1518 AAST 0.1240 0.1453 0.2545 0.4267 0.5883 0.8768 0.8699 0.3966 AAST + RB 0.1231 0.1436 0.2532 0.4247 0.5915 0.8749 0.8698 0.3964 AAST + RB + LBP 0.0598 0.1217 0.2286 0.5576 0.6239 1.0547 1.0522 1.1495 APPENDIX B. ANNULAR AUGMENTATION SAMPLING 137

Table B.7: Numerical values of RMSE (Pairwise Marginals) for Figure 3.4. No bias is applied to any node and we increase the strength (W).

Algorithm W = 0.25 0.5 0.75 1 2 3 4 5

HMC 0.1524 1.0189 1.2243 1.0591 0.9981 1.0291 1.0194 1.0089 CMH 0.1515 1.0731 1.2237 1.0542 0.9979 1.0045 1.0506 1.0296 CMH + LBP 0.1514 1.1662 1.2215 1.0465 0.9997 1.0505 1.0402 1.0300 AAS 0.1735 0.4281 0.3581 0.2211 0.3481 0.2603 0.2992 0.2797 AAS + RB 0.1695 0.4272 0.3574 0.2193 0.3466 0.2590 0.2984 0.2781 AAS + RB + LBP 0.1739 0.4158 0.3191 0.2608 0.2632 0.2902 0.3504 0.3653 AAG 0.1682 0.4943 0.3828 0.2943 0.2102 0.2206 0.2687 0.2803 AAG + RB 0.1647 0.4938 0.3823 0.2934 0.2061 0.2196 0.2660 0.2792 AAG + RB + LBP 0.1659 0.4812 0.3778 0.1813 0.2565 0.2687 0.2659 0.3266 AAST 0.1674 0.4869 0.3788 0.1151 0.2658 0.1427 0.2347 0.2527 AAST + RB 0.1701 0.5010 0.3947 0.1566 0.1647 0.2753 0.2224 0.2688 AAST + RB + LBP 0.1666 0.4624 0.3915 0.1518 0.2993 0.2452 0.2323 0.2302

Table B.8: Numerical values of RMSE (Node Marginals) for Figure 3.5. We fix W = 0.2 and the bias scale c is increased.

Algorithm c = 0.2 0.4 0.6 0.8 1 1.5 2 3 4

HMC 0.0262 0.0360 0.0381 0.0444 0.0444 0.0432 0.0442 0.0502 0.0351 CMH 0.0233 0.0347 0.0351 0.0430 0.0437 0.0413 0.0421 0.0489 0.0327 CMH + LBP 0.0217 0.0332 0.0333 0.0421 0.0423 0.0403 0.0418 0.0482 0.0318 AAS 0.0462 0.0681 0.0754 0.0766 0.0801 0.0789 0.0806 0.0819 0.0637 AAS + RB 0.0431 0.0658 0.0731 0.0749 0.0785 0.0777 0.0797 0.0811 0.0630 AAS + RB + LBP 0.0244 0.0419 0.0416 0.0525 0.0476 0.0423 0.0418 0.0479 0.0319 AAG 0.0406 0.0492 0.0666 0.0694 0.0677 0.0626 0.0688 0.0774 0.0645 AAG + RB 0.0386 0.0471 0.0641 0.0674 0.0660 0.0616 0.0678 0.0767 0.0636 AAG + RB + LBP 0.0229 0.0365 0.0381 0.0461 0.0463 0.0410 0.0423 0.0483 0.0316 AAST 0.0312 0.0573 0.0589 0.0619 0.0702 0.0732 0.0572 0.0659 0.0555 AAST + RB 0.0445 0.0597 0.0675 0.0649 0.0786 0.0782 0.0499 0.0586 0.0430 AAST + RB + LBP 0.0230 0.0462 0.0463 0.0459 0.0442 0.0513 0.0324 0.0505 0.0337 APPENDIX B. ANNULAR AUGMENTATION SAMPLING 138

Table B.9: Numerical values of RMSE (Pairwise Marginals) for Figure 3.5. We fix W = 0.2 and the bias scale c is increased.

Algorithm c = 0.2 0.4 0.6 0.8 1 1.5 2 3 4

HMC 0.1229 0.1327 0.1313 0.1430 0.1429 0.1297 0.1313 0.1477 0.1037 CMH 0.1187 0.1300 0.1226 0.1384 0.1399 0.1238 0.1237 0.1427 0.0962 CMH + LBP 0.1165 0.1269 0.1188 0.1363 0.1365 0.1206 0.1224 0.1403 0.0934 AAS 0.1635 0.2085 0.2226 0.2269 0.2346 0.2326 0.2380 0.2406 0.1882 AAS + RB 0.1554 0.2015 0.2159 0.2213 0.2294 0.2285 0.2349 0.2383 0.1862 AAS + RB + LBP 0.1278 0.1503 0.1392 0.1648 0.1505 0.1272 0.1227 0.1397 0.0934 AAG 0.1498 0.1781 0.1902 0.2001 0.2155 0.2045 0.2116 0.2186 0.1724 AAG + RB 0.1425 0.1711 0.1833 0.1941 0.2099 0.2003 0.2081 0.2161 0.1703 AAG + RB + LBP 0.1251 0.1375 0.1327 0.1498 0.1482 0.1234 0.1239 0.1406 0.0930 AAST 0.1479 0.1814 0.1892 0.2080 0.1980 0.2102 0.1772 0.1940 0.1527 AAST + RB 0.1563 0.1852 0.1936 0.1918 0.2283 0.2297 0.1474 0.1774 0.1250 AAST + RB + LBP 0.1225 0.1379 0.1411 0.1466 0.1354 0.1409 0.0977 0.1227 0.0937 APPENDIX B. ANNULAR AUGMENTATION SAMPLING 139

B.3 Heart disease dataset

Here we take samples of W from the 1000 Metropolis Hastings (MH) Markov chains described in Section 3.3.2 and fit a normal distribution to them. The results are shown in Figure B.1 for three cases: “True” refers to the MH sampler where exact partition function ratio was used, and the other two refer to the approximate MH samplers for which the partition function ratio was approximated by CMH or AAS. As shown in Figure 3.6, AAS approximated the partition function ratio more accurately than CMH and this effect can also be seen in Figure B.1. The estimated posterior distribution of W using CMH is a bit off as compared to the true posterior. The results for HMC and AAG were almost identical to AAS and we thus omitted these plots.

B.4 CMH + LBP sampler

Here we provide details of the CMH + LBP sampler as introduced in Section 3.3, and for which the results are shown in Figures 3.3, 3.4 and 3.5. Letp ˆ be the LBP “prior” of the target distribution p as defined in Section 3.2. In regular CMH one coordinate is chosen uniformly, flipped and accepted with the usual Metropolis Hastings (MH) acceptance criterion. It is likely that this procedure could be improved using the information inp ˆ. A simple idea is to usep ˆ as a proposal distribution for CMH. Letting s(t) be the current point in the Markov chain, CMH+LBP works as follows:

1. Sample a coordinate i uniformly from 1, ..., d . { }

2. With probabilityp ˆ(Si = si), propose flipping coordinate i. − 3. If a flip is proposed, accept it with MH probability α = min 1, p(Si=−si,S−i=s−i)ˆp(Si=si) . p(Si=si,S−i=s−i)ˆp(Si=−si) n o Here S−i = Sj : j = i . It is easy to verify the reversibility of the above algorithm. Although the { 6 } method is correct, its naive implementation will be wasteful and slow. As an illustration, consider a 2D-Ising model with a large positive bias applied to all nodes. As explained in Section 3.3.1, the target distribution in this case is unimodal and highly peaked, with the mode corresponding (t) to Si = 1 for all i. Now suppose that we are at the mode, i.e. si = 1 for all i. The value of pˆ(Si = si) will be small for all coordinates i and so it will take many samples before a flip is − proposed, wasting valuable computation. APPENDIX B. ANNULAR AUGMENTATION SAMPLING 140

0.6

0.4

0.2

0 -5 0 5 -5 0 5 -5 0 5 -5 0 5 -5 0 5

-5 0 5 -5 0 5 -5 0 5 -5 0 5 -5 0 5

-5 0 5 -5 0 5 -5 0 5 -5 0 5 -5 0 5 True CMH AAS

Figure B.1: Estimated posterior distributions of Wij for 1 i < j 6 on the heart disease ≤ ≤ dataset. The samples from each MH algorithm have had a normal distribution fit to them for easier comparison. “True” refers to the MH sampler where exact partition function ratio was used, whereas “CMH” and “AAS” refer to the approximate MH samplers for which the partition function ratio was approximated by CMH or AAS, respectively. APPENDIX B. ANNULAR AUGMENTATION SAMPLING 141

To get around this problem we can use discrete event simulation. First note that the probability of a flip being proposed in any given iteration is

d 1 α(s) = pˆ(Sj = sj) d − j=1 X Therefore the number of iterations until a flip is proposed is a geometric random variable with success probability α(s). The conditional probability that coordinate i is proposed to flip, given

pˆ(Si=−si) that a flip is proposed, is Pd . The CMH + LBP sampler with discrete event simulation j=1 pˆ(Sj =−sj ) can be summarized as:

(t) 1 d (t) 1. Compute α(s ) = pˆ(Si = s ). d i=1 − i P 2. Sample a flipping time k Geometric(α(s(t))). For the next k iterations we stay at the ∼ current point i.e. s(t) = s(t+1) = ... = s(t+k).

pˆ(Si=−si) 3. Propose flipping coordinate i with probability Pd . j=1 pˆ(Sj =−sj )

4. Accept the flip with probability: min 1, p(Si=−si,S−i=s−i)ˆp(Si=si) . p(Si=si,S−i=s−i)ˆp(Si=−si) n o This saves computational effort as now we only consider iterations where a flip is actually proposed. It is this implementation of CMH+LBP that is used in the experiments.

B.5 Simulated data

B.5.1 d = 10

Here we give details of the simulated data to get results for Figure 3.6. Simulated data was generated using known parameter values (W,ˆ Bˆ) in a fully connected BM. The connection matrix Wˆ was drawn randomly from a unit gaussian i.e. Wˆ i,j N(0, 1) and a very small bias Bˆi Unif[ 0.1, 0.1] was ∼ ∼ − applied to each node i. A data set of 400 binary vectors = s(j) N was generated with the D { }j=1 likelihood: 1 p(S = s W,ˆ Bˆ) = exp Wˆ i,jsisj + Bˆisi (B.1) | ˆ ˆ Z(W, B) i

10 dimensional, we can sample from the likelihood in (B.1) exactly. This simulated data now is modeled as outputs of a fully connected BM, where a uninformative N(0, 5) prior was placed on the parameters (W, b). APPENDIX C. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 143

Appendix C

The Advantage of Lefties in One-On-One Sports

C.1 Proof of Proposition 1

Proof. We begin by observing that the exchangeability of players implies

p(nl L; n, N) = p(nl Top ,L; n). (C.1) | | n,N

Integrating (C.1) over the handedness of the top n players yields

lim p(nl L; n, N) = lim p(nl,H1:n = h1:n Topn,N ,L; n) N→∞ N→∞ | n | h1:nX∈{0,1}

= lim p(H1:n = h1:n Topn,N ,L; n). (C.2) N→∞ | h ∈{0,1}n:Pn h =n 1:n X i=1 i l We can expand each term in the summation by conditioning on the top players’ skills,

lim p(H1:n = h1:n Topn,N ,L; n) N→∞ |

= lim p(H1:n = h1:n S1:n, Topn,N ,L) p(S1:n Topn,N ,L) dS1:n N→∞ n | | ZR = lim p(H1:n = h1:n S1:n,L) p(S1:n Topn,N ,L) dS1:n. (C.3) N→∞ n | | ZR Most of the proof will focus on showing that the r.h.s. of (C.3) equals α(L)nl (1 α(L))nr − where α(L) := q/(q + (1 q)c(L)), nr := n nl is the number of top right-handers and c(L) is the − − APPENDIX C. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 144 tail-length of the innate skill distribution as defined in the statement of the proposition. We will th write S[i] for the i order statistic of the skills S1:N and write H[i] for the corresponding induced order statistic. Given that the innate skills have support R we have F (k L) < 1 for all k R, | ∈ where F denotes the conditional CDF of a player’s total skill given L. Note that for any values n N−n of k, L and  > 0, we can find Nk,L, N such that N F (k L) <  for all N Nk,L,. It ∈ | ≥ therefore follows that for such k, L,  and N we have

n N N−n p(S1:n Top ,L) dS1:n = F (S L) f(Si L) dS1:n | n,N n [n] | | ZS[n]≤k ZS[n]≤k   i=1 n Y N N−n F (k L) f(Si L) dS1:n ≤ n | n | R i=1   Z Y N nF (k L)N−n ≤ | <  (C.4) where f denotes the PDF of F . The conditional distribution of player handedness in (C.3) factorizes as n p(H1:n = h1:n S1:n,L) = p(Hi = hi Si,L) (C.5) | | i=1 Y and consider now the ith term in this product. We have

p(Si = s Hi = 1,L)p(Hi = 1 L) p(Hi = 1 Si = s, L) = | | | p(Si = s L) | g(s L)q = − qg(s L) + (1 q)g(s) − − q = (C.6) q + (1 q) g(s) − g(s−L) g(s) Assuming c(L) := lims→∞ g(s−L) exists as stated in the proposition we can take limits across (C.6) to obtain

q lim p(Hi = 1 Si = s, L) = s→∞ | q + (1 q)c(L) − which we recognize as α(L) which we defined above. A similar argument for the case hi = 0 yields lims→∞ p(Hi = 0 Si = s, L) = 1 α(L). Since the limit of a finite product of functions, each | − having a finite limit, is equal to the product of the limits, it therefore follows that for all  > 0 APPENDIX C. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 145

there exists a k R such that if si > k for i = 1, ..., n then ∈ n h 1−h  > p(H1:n = h1:n S1:n = s1:n,L) α(L) i (1 α(L)) i | − − i=1 Y n nr = p(H1:n = h1:n S1:n = s1:n,L) α(L) l (1 α(L)) . (C.7) | | − − |

We are now in a position to prove that

nl nr lim p(H1:n = h1:n S1:n,L) p(S1:n Topn,N ,L) dS1:n = α(L) (1 α(L)) . (C.8) N→∞ n | | − ZR

For any  > 0, for all N > Nk/3,L,/3 we have

nl nr α(L) (1 α(L)) p(H1:n = h1:n S1:n,L) p(S1:n Topn,N ,L) dS1:n − − n | | ZR

nl nr α(L) (1 α(L)) p(H1:n = h1:n S1:n,L) p(S1:n Topn,N ,L ) dS1:n ≤ − − S[n]≥k/3 | | Z

+ p(H1:n = h1:n S1:n,L) p(S1:n Top ,L) dS1:n. (C.9) | | n,N ZS[n]≤k/3 Observe that

p(H1:n = h1:n S1:n,L) p(S1:n Top ,L) dS1:n | | n,N ZS[n]≥k/3  nl nr α(L) (1 α(L)) + p(S1:n Topn,N ,L) dS1:n ≤ S ≥k − 3 | Z [n] /3    α(L)nl (1 α(L))nr + (C.10) ≤ − 3 where the first inequality follows from (C.7). Similarly, its minimum value is bounded by

p(H1:n = h1:n S1:n,L) p(S1:n Top ,L) dS1:n | | n,N ZS[n]≥k/3  nl nr α(L) (1 α(L)) p(S1:n Topn,N ,L) dS1:n ≥ S ≥k − − 3 | Z [n] /3  

n nr  = α(L) l (1 α(L)) 1 p(S1:n Top ,L) dS1:n 3 n,N − − − S[n]≤k/3 | !   Z   α(L)nl (1 α(L))nr 1 ≥ − − 3 − 3        α(L)nl (1 α(L))nr 1 ≥ − − 3 − 3 − 3  2    α(L)nl (1 α(L))nr  (C.11) ≥ − − 3 APPENDIX C. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 146 where the first inequality follows from (C.7) and the second inequality follows from (C.4). Com- bining the upper and lower bounds of (C.10) and (C.11) implies that the first term on the r.h.s. of (C.9) is bounded above by 2/3. The second term on the r.h.s. of (C.9) satisfies

p(H1:n = h1:n S1:n,L) p(S1:n Top ,L) dS1:n | | n,N ZS[n]≤k/3

p(S1:n Top ,L) dS1:n /3. ≤ | n,N ≤ ZS[n]≤k/3 We can therefore rewrite the bound in (C.9) as

nl nr α(L) (1 α(L)) p(H1:n = h1:n S1:n,L) p(S1:n Topn,N ,L) dS1:n − − n | | ZR 2 1  +  ≤ 3 3 =  completing the proof of (C.8). Substituting (C.8) into (C.3) and (C.3) into (C.2) yields

n nr n n nr lim p(nl L; n, N) = α(L) l (1 α(L)) = α(L) l (1 α(L)) N→∞ | n Pn − nl − h1:n∈{0,1} : hi=nl   X i=1 so the limiting distribution is Binomial (nl; α(L), n) as desired.

C.2 Rate of Convergence for Probability of Left-Handed Top Player Given Normal Innate Skills

Throughout this subsection we use φ( ) and Φ( ) to denote the PDF and CDF, respectively, of a · · standard normal random variable. We also assume that the advantage of left-handedness, L, is a strictly positive and known constant. The notation S[1] is used to indicate the maximum of N IID variables S1, ..., SN with H[1] being the induced order statistic corresponding to S[1] (see [David and Nagaraja, 2003, Ch. 6.8] for more background on induced order statistics).

Proposition 7. If the innate skills are IID standard normal and the advantage of left-handedness, L, is strictly positive, then p(H = 0 L) = Ω(exp( L√log N)). [1] | − Proof. Since L > 0 is assumed known we will typically not bother to condition on it explicitly in the arguments below. This means the expectations that appear below are never over L. We begin APPENDIX C. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 147 by observing that

p(H = 0 L) = ES [p(H = 0 S )] [1] | [1] [1] | [1] φ(S[1]) (1 q) = E − qφ(S L) + (1 q)φ(S )  [1] − − [1]  1 = E (C.12) exp S L exp ( L2/2) q + 1 " [1] − 1−q # where the second equality follows from precisely the same argument that we used to derive (C.7).

Lemma C.2.1 below implies that we can replace S[1] in (C.12) with X[1] + L where X[1] is the maximum of N IID standard normal random variables and with the equality replaced by a greater- than-or-equal to inequality. We therefore obtain 1 p(H[1] = 0 L) E | ≥ exp (X + L)L exp ( L2/2) q + 1 " [1] − 1−q # 1  = E (C.13) exp(LX )/(2c ) + 1  [1] 1  2 1−q where c1 := exp L /2 > 0. A little algebra shows that the denominator in (C.13) is less − 2q than or equal to exp(LX )/c1 if X c2 := ln(2c1)/L and it is less than 2 otherwise. Thus we [1] [1] ≥ have 1 1 p(H = 0 L) E I + I [1] | ≥ exp(LX )/c {X[1]≥c2} 2 {X[1]

p(H = 0 L) c1E exp( LX )I [1] | ≥ − [1] {X[1]≥c2} h i N−1 c1E[exp( LX )] c1c3NΦ(c2) ≥ − [1] − N−1 c1 exp( LE[X ]) c1c3NΦ(c2) ≥ − [1] − N−1 c1 exp L 2 log N c1c3NΦ(c2) ≥ − −    = Ω exp L plog N −   p  where the second inequality follows from Lemma C.2.2, the third inequality follows from Jensen’s inequality, the fourth follows from Lemma C.2.3 (a standard result that bounds the maximum of

IID standard normal random variables) and c3 := E[exp( LX)I ]. − {X≤c2} APPENDIX C. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 148

As stated earlier, in each of the following Lemmas it is assumed that X1,...,XN are IID standard normal and X[1] is their maximum.

Lemma C.2.1. Define the function

q −1 d(s) := exp (sL) exp L2/2 + 1 . − 1 q  −   If the advantage of left-handedness, L, is strictly positive then E[d(S )] E[d(X + L)]. [1] ≥ [1] Proof. We first recall the CDF of the total skill, S, is given by F (s) = qΦ(s L) + (1 q)Φ (s) − − and that F (s) Φ(s L) for all s if L > 0. We then obtain ≥ −

E[d(S )] E[d(X + L)] [1] − [1] ∞ ∞ = d(s)NF (s)N−1f(s) ds d(x + L)NΦ(x)N−1φ(x) dx −∞ − −∞ Z ∞ Z ∞ = d(s)NF (s)N−1f(s) ds d(s)NΦ(s L)N−1 φ (s L) ds −∞ − −∞ − − Z ∞ Z = d(s) NF (s)N−1f(s) NΦ(s L)N−1 φ (s L) ds −∞ − − − Z h ∞ i ∞ ∂d(s) = d(s) F (s)N Φ(s L)N F (s)N Φ(s L)N ds − − − ∂s − − −∞ Z−∞ h i h >0 i =0 <0

> |0 {z } | {z } | {z } where the second to last line follows from integration by parts, and the last line follows because for any L > 0 we have (i) d(s) is a strictly monotonically decreasing function of s and (ii) F (s)N > Φ(s L)N . − Lemma C.2.2. For any constants L and c, we have

E[exp( LX )I ] E[exp( LX )] aNΦ(c)N−1 − [1] {X[1]≥c} ≥ − [1] − where a := E[exp( LX)I ] where X is a standard normal random variable. − {X≤c} Proof. We first note that Φ(x) Φ(c) for all x < c from which it immediately follows that ≤

Φ(x)N−1Nφ(x) exp( Lx) Φ(c)N−1Nφ(x) exp( Lx). (C.15) − ≤ − APPENDIX C. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 149

Integrating both sides of (C.15) w.r.t. x from to c yields −∞ c c2 Φ(x)N−1Nφ(x) exp( Lx) dx Φ(c)N−1Nφ(x) exp( Lx) dx − ≤ − Z−∞ Z−∞ from which we obtain

E[exp( LX )I ] NΦ(c)N−1E[exp( LX)I ] (C.16) − [1] {X[1]≤c} ≤ − {X≤c} where X is a standard normal random variable. The statement of the Lemma now follows by noting

E[exp( LX )I ] = E[exp( LX ))] E[exp( LX )I ] − [1] {X[1])≥c} − [1] − − [1] {X[1]

The following Lemma is well known but we include it here for the sake of completeness.

Lemma C.2.3. E[X ] √2 log N. [1] ≤ Proof. The proof follows from a simple application of Jensen’s Inequality and the fact that the sum of non-negative variables is larger than the maximum of those variables. For any constant β R ∈ we have

1 1 E[X ] = log exp(E[βX ]) log E[exp(βX )] [1] β [1] ≤ β [1]  N  1 log E exp(βXi) ≤ β "i=1 #! X 1 = log (NE[exp(βX)]) β 1 = log N exp(β2/2) β 2 log N.  ≤ p where we chose β = √2 log N to obtain the final inequality. APPENDIX C. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 150

C.3 Difference in Skills at Given Quantiles with Laplace Distributed Innate Skills

Proposition 8. If the innate skills are IID Laplace distributed with mean 0 and scale parameter b > 0, then for any fixed finite L and sufficiently small quantiles λi, λj (0, 1) ∈

lim S[Nλ ] S[Nλ ] = b log(λi/λj) N→∞ j − i   1 th where the convergence is in probability and we use S[Nλi] to denote the [Nλi] order statistic of the skills S1,...,SN .

Proof. Since the innate skills are Laplace distributed with mean 0 and scale b > 0 they have the CDF 1 2 exp (S/b) if S < 0 G(S) =  (C.17) 1 1 exp ( S/b) if S 0  − 2 − ≥ while the total skill distribution for someone from the mixed population of left- and right-handers has CDF F (S L) = qG(S L) + (1 q)G(S). For skills S max L, 0 it therefore follows from | − − ≥ { } (C.17) that

1 1 F (S L) = q 1 exp ( (S L)/b) + (1 q) 1 exp ( S/b) | − 2 − − − − 2 −     1 = 1 exp ( S/b)[q exp(L/b) + 1 q] . (C.18) − 2 − −

−1 Since the Laplace distribution has domain R, for any fixed finite L we have limλ↓0 F (1 λ L) = − | . Thus for sufficiently small quantiles λ (0, 1) we have F −1(1 λ L) > max L, 0 . Consider ∞ ∈ − | { } such a value of λ. Since F (F −1(1 λ L) L) = 1 λ it that − | | −

λ = 1 F (F −1(1 λ L) L) − − | | 1 = exp F −1(1 λ L)/b [q exp(L/b) + 1 q] (C.19) 2 − − | −  where the second equality follows from (C.18) with S = F −1(1 λ L). Simplifying (C.19) now − | yields F −1(1 λ L) = b log(λ) + b log([q exp(L/b) + 1 q]/2). (C.20) − | − −

1Here [·] denotes rounding to the nearest integer. APPENDIX C. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 151

From [David and Nagaraja, 2003, pp. 288] we know that

−1 −1 lim S[Nλ ] S[Nλ ] = F (1 λj L) F (1 λi L) (C.21) N→∞ j − i − | − − |   where convergence is understood to be in probability. Substituting (C.20) into (C.21) then yields

(for λi and λj sufficiently small)

lim S[Nλ ] S[Nλ ] N j − i →∞  = b log(λ ) + b log([q exp(L/b) + 1 q]/2) ( b log(λ ) + b log([q exp(L/b) + 1 q]/2)) − j − − − i −

= b log(λi/λj) as claimed.

C.4 Estimating the Kalman Filtering Smoothing Parameter

2 Here we explain how to compute the MLE for σK as discussed in Section 4.6 where we developed a Kalman filter / smoother for estimating the lefty advantage Lt through time. The likelihood of the observed handedness data over the time interval t = 1,...,T is given in (4.33) and satisfies

T l 2 2 2 ∗ 2 p(n1:T ; n1:T , σK ) N(L0; 0, θ ) N(Lt; Lt−1, σK ) N(Lt; Lt , σt ) dL0:T ∝ T +1 ∼ R t=1 Z Y 2 2 nt where σt := b r l . To perform MLE over σK we first simplify the likelihood, keeping only factors nt nt involving σK . We therefore obtain APPENDIX C. THE ADVANTAGE OF LEFTIES IN ONE-ON-ONE SPORTS 152

l 2 p(n1:T ; n1:T , σK )

2 T 2 1 L0 1 (Lt Lt−1) exp 2 exp − 2 ∝ T +1 √2πθ − 2θ √2πσ − 2σ ∼ ZR   t=1 K  K  Y ∗ 2 1 (Lt Lt ) exp − dL0:T · √2πσ − 2σ2 t  t  2 T 2 ∗ 2 −T L0 (Lt Lt−1) (Lt Lt ) σK exp 2 − 2 − 2 dL0:T ∝ T +1 − 2θ − 2σ − 2σ R t=1 K t Z  X  2 T 2 2 2 ∗ −T 1 L0 Lt 2LtLt−1 + Lt−1 Lt 2LtLt σK exp 2 + − 2 + − 2 dL0:T ∝ T +1 − 2 θ σ σ R t=1 K t Z   X  T −T 1 2 −2 −2 2 −2 2 −2 −2 = σK exp L0(θ + σK ) LT σK + Lt (2σK + σt ) T +1 − 2 − R t=1 Z   X −2 −2 ∗ 2σ LtLt−1 2σ LtL dL0:T − K − t t  −T 1 > −1 > = σK exp L0:T Σ L0:T + v L0:T dL0:T (C.22) T +1 − 2 ZR   where Σ−1 R(T +1)×(T +1) is a symmetric tri-diagonal positive semi-definite matrix with entries ∈

−1 −2 −2 Σ0,0 = θ + σK −1 −2 −2 Σt,t = 2σK + σt for 0 < t < T −1 −2 −2 ΣT,T = σK + σt Σ−1 = Σ−1 = σ−2 for 0 t < T t,t+1 t+1,t − K ≤

T +1 ∗ −2 and v R is a vector with entries v0 = 0 and vt = L σ for t = 1, ..., T . We can integrate out ∈ t t L0:T by appropriately normalizing the Gaussian exponential in (C.22). In particular, we have

l 2 −T 1 > −1 1 > p(n1:T ; n1:T , σK ) σK exp (L0:T Σv) Σ (L0:T Σv) + v Σv dL0:T ∝ T +1 −2 − − 2 ∼ ZR   1 σ−T exp v>Σv Σ . (C.23) ∝ K 2 | |   p The expression in (C.23) can then be evaluated numerically for any value of σK . (Note that Σ and therefore the Σ term in (C.23) depends on σK so an explicit solution for the MLE of σK is | | unlikely to be available.) APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 153

Appendix D

Stable and scalable softmax optimization with double-sum formulations

D.1 Comparison of double-sum formulations

In Section 5.2.2 our double-sum formulation was compared to that of Raman et al. [2016]. It was > noted that the formulations only differ by a reparameterizationu ¯i = ui + xi wyi , and an intuitive argument was given as to why our formulation leads to smaller magnitude gradients. Here we flesh out that argument and also explore different reparameterizations. x>(w −w ) > Let us introduce the set of parameterizations v = log(1 + e i k yi ) + αx w where i k6=yi i yi

α R. Our ui corresponds to vi with α = 0 while theu ¯i of RamanP et al. [2016] corresponds to vi ∈ with α = 1. The question is, what is the optimal α? The double-sum objective can be written in terms of v as

n 1 1 f(v, W ) = f (v, W ) (D.1) N K 1 ik i=1 k6=y − X Xi where

> > > αx wy −vi x (w −(1−α)wy )−vi fik(v, W ) =N vi αx wy + e i i + (K 1)e i k i ) − i i −   APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 154 and for notational simplicity we have set the ridge-regularization parameter µ = 0. The stochastic gradients are

> x (w −(1−α)wy )−vi w fik = N(K 1)e i k i xi ∇ k − > > αx wy −vi x (w −(1−α)wy )−vi w fik = N α + αe i i (1 α)(K 1)e i k i xi ∇ yi − − − −  > >  αx wy −vi x (w −(1−α)wy )−vi v fik = N 1 e i i (K 1)e i k i . ∇ i − − −   ∗ x>(w −w ) > If v were at its optimum value v (W ) = log(1+ e i k yi )+αx w then the gradients i i k6=yi i yi would always be bounded and hence have low variance.P However if an SGD method is run on (D.1) ∗ then, at least early on, vi would typically differ from vi (W ). We decompose this difference as follows. Let W˜ be the value of W from the previous time W was updated using datapoint i, which is also the last time that vi was updated. We denote the change in wj along direction xi as > > x>(w ˜ −w˜ ) δj = x (wj w˜j) and decompose vi =u ˜i + αx w˜y + ˜i whereu ˜i = log(1 + e i k yi ) and i − i i k6=yi ∗ ˜i = vi v (W˜ ). The gradients become P − i

> δ −(1−α)δy x (w ˜ −w˜y )−u˜i−˜i w fik = N(K 1) e k i e i k i xi ∇ k − · · > αδy −u˜i−˜i δ −(1−α)δy x (w ˜ −w˜y )−u˜i−˜i w fik = N α + αe i e (1 α)e k i (K 1)e i k i xi ∇ yi − · − − · −  >  αδy −u˜i−˜i δ −(1−α)δy x (w ˜ −w˜y )−u˜i−˜i v fik = N 1 e i e e k i (K 1)e i k i . ∇ i − · − · −   The goal is for the variance of these stochastic gradients to be as small as possible. This may be achieved by setting α to decrease the effect of the noise coming from δ. Let us focus on w fik and ∇ yi for the purposes of analysis make the simplifying assumption that δk, δyi are independent random aδ +bδ variables that are sufficiently small so that we can use the Taylor expansion e k yi 1+aδk +bδy . ≈ i 1 The variance of the first term in w fik is 0, while the variance of the second term is proportional ∇ yi 4 2 4 2 2x>(w ˜ −w˜ ) to α V ar(δy ) and the third term to ((1 α) V ar(δk) + (1 α) V ar(δy )) (K 1) e i k yi . · i − − i · − Clearly there is a tradeoff between the variance in the second and third terms: if α 0 then the ≈ second term will have small variance but the third term will have high variance, and if α 1 then ≈ 2 2x>(w ˜ −w˜ ) it is the other way around. The tradeoff is weighted by the (K 1) e i k yi factor in the − > > third term. If x w˜y < x w˜k + log(K 1), which will occur most often when the labels are noisy, i i i − then the variance of w fik will be lowest when α 1. However if the labels are not noisy and ∇ yi ≈

1 −u˜i−˜i We have dropped the common e xi factor. APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 155

> > x w˜y > x w˜k + log(K 1) then it is best that α 0. A similar analysis applies for the variance i i i − ≈ of w fik and v fik. ∇ k ∇ i The optimal value of α clearly depends on the noisiness of the data, with more noisy data having a higher optimal value of α. In our experiments we found that our formulation with α = 0 allowed the U-max algorithm to converge to a log-loss that was on average 2 to 3 times lower as compared to when Raman et al.’s formulation with α = 1 was used (see Appendix D.7). A future line of work would be to develop methods to learn the optimal α, perhaps dynamically per datapoint.

D.2 Proof of variable bounds and strong convexity

In this section we will establish that the optimal values of u and W are bounded. Next, we show that within these bounds the objective f from (5.6) is strongly convex and its gradients are bounded.

Lemma D.2.1 (Raman et al. [2016]). The optimal value of W is bounded as W ∗ 2 B2 where k k2 ≤ W 2 2 BW = µ N log(K).

Proof.

µ N log(K) = L(0) L(W ∗) W ∗ 2 − ≤ ≤ − 2 k k2

Rearranging gives the desired result.

∗ Lemma D.2.2. The optimal value of ui is bounded as 0 u Bu where Bu = log(1 + (K ≤ i ≤ − 2B B 1)e x W ) and Bx = maxi xi 2 {k k } Proof. For the upper bound

> ∗ x (wk−wy ) ui = log(1 + e i i ) k6=y Xi kx k (kw k +kw k ) log(1 + e i 2 k 2 yi 2 ) ≤ k6=y Xi log(1 + e2BxBW ) ≤ k6=y Xi = log(1 + (K 1)e2BxBW ). − APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 156

For the lower bound

∗ x>(w −w ) u = log(1 + e i k yi ) log(1) = 0. i ≥ k6=y Xi

2 2 Definition D.2.3. Let Θ := (u, W ): W B , 0 ui Bu . By Lemmas D.2.1 and D.2.2 { k k2 ≤ W ≤ ≤ } we have that (u∗,W ∗) Θ. ∈ Lemma D.2.4. The function f(u, W ) is strongly convex with convexity constant greater than or equal to min exp( Bu), µ for (u, W ) Θ. { − } ∈ Proof. Let us rewrite f as

N > −u x (w −wy )−u µ 2 f(u, W ) = ui + e i + e i k i i + W 2 k k2 i=1 k6=y X Xi N > −e>ψ b> ψ µ 2 = e ψ + e i + e ik + Mψ . i 2 k k2 i=1 k6=y X Xi > > > N+KD th where ψ = (u , w , ..., w ) R and ei is the i canonical basis vector, with bik and M being 1 k ∈ appropriately defined2. The Hessian of f is

N 2 −e>ψ > b> ψ > f(ψ) = e i eie + e ik bikb + µ diag 0N , 1KD ∇ i ik · { } i=1 k6=y X Xi where 0N is an N-dimensional vector of zeros and 1KD is a KD-dimensional vector of ones. It follows that

2f(ψ) I min min e−ui , µ ∇  · {0≤ui≤Bu{ } }

= I min exp( Bu), µ · { − } 0. 

2Note that for notational convenience we have flatted (u, W ) ∈ RN × RK×D into the vector > > > N+KD ψ = (u , w1 , ..., wk ) ∈ R . APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 157

Lemma D.2.5. The 2-norm of both the gradient of f and each stochastic gradient fik are bounded by

Bu Bu Bf = N max 1, e 1 + 2(Ne Bx + µ max βk BW ). { − } k { } for all (u, W ) Θ. ∈ Proof. By Jensen’s inequality

max f(u, W ) 2 = max Eikfik(u, W ) 2 (u,W )∈Θ k∇ k (u,W )∈Θ k∇ k

max Eik fik(u, W ) 2 ≤ (u,W )∈Θ k∇ k

max max fik(u, W ) 2. ≤ (u,W )∈Θ ik k∇ k

Using the results from Lemmas D.2.1 and D.2.2 and the definition of fik from (5.6) we have that

u fik(u, W ) 2 k∇ i k > −u x (w −wy )−u = N 1 e i (K 1)e i k i i ) 2 k − − − k  −u x>(w −w )  = N 1 e i (1 + (K 1)e i k yi ) | − − | −u x>(w −w ) kx k (kw k +kw k ) N max 1 e i (1 + (K 1)e i k yi ), (1 + (K 1)e i 2 k 2 yi 2 ) 1 ≤ { − − − − } N max 1, eBu 1 . ≤ { − }

For j indexing either the sampled class k = yi or the true label yi we have 6

> x (w −wy )−ui w fik(u, W ) 2 = N(K 1)e i k i xi + µβjwj 2 k∇ j k k ± − k kx k2(kw k2+kwy k2) N(K 1)e i k i xi 2 + µβj wj 2 ≤ − k k k k Bu Ne Bx + µ max βk BW . ≤ k { }

Letting

Bu Bu Bf = N max 1, e 1 + 2(Ne Bx + µ max βk BW ) { − } k { } we have

fik(u, W ) 2 u fik(u, W ) 2 + w fik(u, W ) 2 + w fik(u, W ) 2 = Bf . k∇ k ≤ k∇ i k k∇ k k k∇ yi k APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 158

In conclusion:

max f(u, W ) 2 max max fik(u, W ) 2 Bf . (u,W )∈Θ k∇ k ≤ (u,W )∈Θ ik k∇ k ≤

D.3 Proof of convergence of U-max method

In this section we will prove the claim made in Proposition 6, that U-max converges to the softmax optimum. We need the following lemma to establish the final result.

> m x (wk −wy ) Lemma D.3.1. Fix δ > 0 and suppose that ui log(1 + e i j i ) δ. Then setting ≤ j=1 − x>(w −w ) m i kj yi 2 ui = log(1 + j=1 e ) decreases f(u, W ) by at leastPδ /2.

P > > > N+KD Proof. As in Lemma D.2.4, let θ = (u , w , ..., w ) R . Then increasing ui to log(1 + 1 k ∈ x>(w −w ) m i kj yi th j=1 e ) is equivalent to setting θ = θ + ∆ei where ei is the i canonical basis vector > m x (wk −wy ) andP ∆ = log(1 + e i j i ) ui δ. By a second order Taylor series expansion j=1 − ≥ P 2 > ∆ > 2 f(θ) f(θ + ∆ei) f(θ + ∆ei) ei∆ + e f(θ + λ∆ei)ei (D.2) − ≥ ∇ 2 i ∇

∗ for some λ [0, 1]. Since the optimal value of ui for a given value of W is ui (W ) = log(1 + ∈ > x>(w −w ) m x (wk −wy ) > e i k yi ) log(1 + e i j i ), we must have f(θ + ∆ei) ei 0. From k6=yi ≥ j=1 ∇ ≤ PLemma D.2.4 we also know thatP

> 2 e f(θ + λ∆ei)ei i ∇ > x (w −wy )−(u +λ∆) = exp( (ui + λ∆)) + e i k i i − k6=y Xi > x (w −wy ) = exp( λ∆) exp( ui)(1 + e i k i ) − − k6=y Xi m > > x (w −wy ) x (w −w ) = exp( λ∆) exp( (log(1 + e i kj i ) ∆))(1 + e i k yi ) − − − j=1 k6=y X Xi APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 159

> m x (wk −wy ) since ui = log(1 + e i j i ) ∆. Continuing, j=1 − x>(w −w ) P 1 + e i k yi > 2 k6=yi e f(θ + λ∆ei)ei = exp(∆ λ∆) i x>(w −w ) ∇ − m i kj yi 1 + Pj=1 e exp(∆ λ∆) P ≥ − exp(∆ ∆) ≥ − = 1.

Putting in bounds for the gradient and Hessian terms in (D.2), 2 > ∆ > 2 f(θ) f(θ + ∆ei) f(θ + ∆ei) ei∆ + e f(θ + λ∆ei)ei − ≥ ∇ 2 i ∇ ∆2 0 + ≥ 2 δ2 . ≥ 2

Now we are in a position to prove Proposition 6.

Proof of Proposition 6. Let θ(t) = (u(t),W (t)) Θ denote the value of the t-th iterate. Let P ∈ denote the projection of θ onto Θ. Since f is convex and Θ is a convex set containing the optimum (δ) we have f(P (θ)) f(θ) for any θ. Finally let πi (θ) denote the operation of setting ui = log(1 + > ≤ > m x (wk −wy ) m x (wk −wy ) e i j i ) if ui log(1 + e i j i ) δ. j=1 ≤ j=1 − P If datapoint index i and class indexP k = yi are sampled for the stochastic gradient then 6 (t+1) (t) (t) f(θ ) = f(P (πi(θ ) ηt fik(πi(θ )))) − ∇ (t) (t) f(πi(θ ) ηt fik(πi(θ ))). ≤ − ∇ > m x (wk −wy ) If ui log(1 + e i j i ) δ then ≤ j=1 − P (t+1) (t) f(θ ) f(πi(θ )) + max ηt fik(πi(θ)) 2 max f(θ) 2 ≤ θ∈Θ k ∇ k θ∈Θ k∇ k (t) 2 f(πi(θ )) + ηtB ≤ f (t) 2 2 f(θ ) δ /2 + ηtB ≤ − f (t) (t) 2 2 f(θ ηt fik(θ )) δ /2 + 2ηtB ≤ − ∇ − f (t) (t) f(θ ηt fik(θ )), ≤ − ∇ APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 160

2 2 since ηt δ /(4Bf ) by assumption. ≤ > m x (wk −wy ) (t) (t) Alternatively if ui log(1 + e i j i ) δ then θ = πi(θ ) and ≥ j=1 − P (t+1) (t) (t) f(θ ) f(θ ηt fik(θ )). ≤ − ∇

(t+1) (t) (t) Either way f(θ ) f(θ ηt fik(θ )). Taking expectations with respect to i, k, ≤ − ∇

(t+1) (t) (t) Eik[f(θ )] Eik[f(θ ηt fik(θ ))]. ≤ − ∇

This shows that U-max’s decrease in objective each iteration is at least as much as ESGD’s in expectation. From here the convergence rate of O(1/T ) for U-max can be easily proven using the fact that ESGD converges at a rate of O(1/T ) for strongly convex functions. See for example [Bottou et al., 2016, Theorem 4.7].

D.4 Stochastic Composition Optimization

Here we show that stochastic composition optimization has the same issue of large gradients as the double-sum formulation from (5.5) has. We can write the equation for L(W ) from (5.3) as

N x>(w −w L(W ) = log(1 + e i k yi )) − i=1 k6=y X Xi = Ei[hi(Ek[gk(W )])]

N where i unif( 1, ..., N ), k unif( 1, ..., K ), hi(v) R, gk(W ) R , ∼ { } ∼ { } ∈ ∈

> hi(v) = N log(1 + e v) − i x>(w −w ) Ke i k yi if k = yi 6 [gk(W )]i =  . 0 otherwise

 > and we set µ = 0 for notational simplicity. In stochastic composition optimization e v = vi R is i ∈ x>(w −w ) a variable that is explicitly kept track of with vi Ek[gk(W )]i = e i k yi . Clearly vi in ≈ k6=yi stochastic composition optimization has a similar role as ui has inP our formulation for f in (5.5). ∗ ∗ Their optimal values are related through ui = log(1 + vi ). APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 161

If i and k = yi are sampled in stochastic composition optimization then the updates are of the 6 form

x>(z −z ) e i k yi wyi = wyi + ηtNK xi 1 + vi x>(z −z ) e i k yi wk = wk ηtNK xi, − 1 + vi where zk is a smoothed value of wk [Wang et al., 2016]. These updates have the same numerical x>z instability issues as ESGD has with f from (5.5): we require that 0 e i k 1 for the gradients ≤ 1+vi ≤ x>z to be bounded, but it is possible (and practically not unlikely) that e i k 1. 1+vi 

D.5 Proof of general ISGD gradient bound

Proof of Proposition 2. Let f(θ, ξ) be m-strongly convex in θ for all ξ. The ESGD step size (t) th is ηt f(θ , ξt) 2 where ηt is the learning rate for the t iteration. The ISGD step size is k∇ k (t+1) (t+1) (t+1) (t) (t+1) (t+1) ηt f(θ , ξt) 2 where θ satisfies θ = θ ηt f(θ , ξt). Rearranging, f(θ , ξt) = k∇ k − ∇ ∇ (t) (t+1) (t+1) > (t) (t+1) (t+1) (t) (θ θ )/ηt and so it must be the case that f(θ , ξt) (θ θ ) = f(θ , ξt) 2 θ − ∇ − k∇ k k − (t+1) θ 2. k Our desired result follows:

(t) > (t) (t+1) (t) f(θ , ξt) (θ θ ) f(θ , ξt) 2 ∇ (t) (t+1)− k∇ k ≥ θ θ 2 (kt+1) − > (t)k (t+1) (t) (t+1) 2 f(θ , ξt) (θ θ ) + m θ θ 2 ∇ (t−) (t+1) k − k ≥ θ θ 2 (t+1) k (t) − (t+1)k (t) (t+1) 2 f(θ , ξt) 2 θ θ 2 + m θ θ 2 = k∇ k k (t−) (t+1)k k − k θ θ 2 k − k (t+1) (t) (t+1) = f(θ , ξt) 2 + m θ θ 2 k∇ k k − k where the first inequality is by Cauchy-Schwarz and the second inequality by strong convexity.

D.6 Update equations for ISGD

In this section we will derive the updates for ISGD. We will first consider the simplest case where only one datapoint and a single class is sampled in each iteration. Later we will derive the up- APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 162 dates for when multiple classes are sampled, and finally when both multiple classes and multiple datapoints are sampled.

D.6.1 Single datapoint, single class

From (5.6) it follows that the stochastic gradient corresponding to a single datapoint and single sampled class is given by

> −ui x (w −wy )−ui µ 2 2 fik(u, W ) = N(ui + e + (K 1)e i k i ) + (βy wy + βk wk ). − 2 i k i k2 k k2

The ISGD update corresponds to computing the optimal solution of

2 ˜ 2 min 2ηfik(u, W ) + u u˜ 2 + W W 2 , u,W k − k k − k n o where η is the learning rate and the tilde refers to the value of the old iterate [Toulis et al., 2016, Eq.

6]. Since fik is only a function of ui, wk, wy , it follows that the optimal wj =w ˜j for all j / k, yi i ∈ { } and uj =u ˜j for all j = i. Thus the optimization problem reduces to 6

2 2 2 min 2ηfik(ui, wk, wyi ) + (ui u˜i) + wyi w˜yi 2 + wk w˜k 2 ui,wk,wyi − k − k k − k

 > −ui xi (wk−wyi )−ui 2 2 = min 2ηN ui + e + (K 1)e + ηµ(βyi wyi 2 + βk wk 2) ui,wk,wyi − k k k k    2 2 2 + (ui u˜i) + wy w˜y + wk w˜k . (D.3) − k i − i k2 k − k2 

Solving for wk, wyi with auxiliary variable b

x>(w −w )−u Much of the difficulty in optimizing this equation comes from the interaction between the e i k yi i term and the 2 terms. To isolate this interaction we introduce an auxiliary variable b = k · k2 > x (wk wy ) and rewrite the optimization problem as i − i

−u b−u 2 min 2ηN ui + e i + (K 1)e i + (ui u˜i) (D.4) ui,b − −    2 2 2 2 > + min ηµ(βyi wyi 2 + βk wk 2) + wyi w˜yi 2 + wk w˜k 2 : b = xi (wk wyi ) . wk,wyi k k k k k − k k − k − n o  APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 163

The inner optimization problem over wk, wyi is a quadratic program with a linear constraint. Taking the dual and solving yields

> w˜k w˜yi xi 1+ηµβ 1+ηµβ b w˜k k − yi − wk = γi xi 1 + ηµβk −  1 + ηµβk  w˜ x> w˜k yi b w˜y i 1+ηµβk 1+ηµβy w = i + γ − i − x (D.5) yi i   i 1 + ηµβyi 1 + ηµβyi where 1 γi = 2 −1 −1 . (D.6) xi ((1 + ηµβk) + (1 + ηµβy ) ) k k2 i

Substituting in the solution for wk, wyi and dropping constant terms, the optimization problem from (D.4) reduces to

2 −ui b−ui 2 > w˜k w˜yi min 2ηN(ui +e +(K 1)e )+(ui u˜i) + b xi γi . (D.7) ui,b − − − 1 + ηµβ − 1 + ηµβ    k yi  

We’ll approach this optimization problem by first solving for b as a function of ui and then optimizing over ui. Once the optimal value of ui has been found, we can calculate the corresponding optimal value of b. Finally, substituting b into (D.5) will give us our updated value of W .

Solving for b

We solve for b by setting its derivative equal to zero in (D.7),

b−u > w˜k w˜yi 0 = 2ηN(K 1)e i + 2 b x γi. − − i 1 + ηµβ − 1 + ηµβ   k yi  Letting w˜ w˜y a = x> k i b (D.8) i 1 + ηµβ − 1 + ηµβ −  k yi  and using simple algebra yields

 ˜  > w˜k wyi xi 1+ − 1+ −ui aea = ηN(K 1)γ−1e ηµβk ηµβyi . (D.9) − i

The solution for a can be written in terms of the principle branch of the Lambert-W function, P ,

 ˜  > w˜k wyi xi 1+ − 1+ −ui −1 ηµβk ηµβyi a(ui) = P ηN(K 1)γi e 0. (D.10) − ! ≥

> w˜k w˜yi The optimal value of b given ui is therefore b(ui) = xi 1+ηµβ 1+ηµβ a(ui). k − yi −   APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 164

Bisection method for ui

> w˜k w˜yi Substituting b(ui) = xi 1+ηµβ 1+ηµβ a(ui) into (D.7), we now only need minimize over ui: k − yi −    ˜  > w˜k wyi x − −a(ui)−ui −u i 1+ηµβk 1+ηµβy 2 2 min 2ηN(ui + e i + (K 1)e i ) + (ui u˜i) + a(ui) γi (D.11) ui − −   −u 2 2 = min 2ηNui + 2ηNe i + 2γia(ui) + (ui u˜i) + a(ui) γi ui −   −u 2 = min 2ηNui + 2ηNe i + (ui u˜i) + γia(ui)(2 + a(ui)) (D.12) ui −   where we used that e−P (z) = P (z)/z to derive the equality

 ˜  > w˜k wyi x − −a(ui)−ui i 1+ηµβk 1+ηµβy ηN(K 1)e i = γia(ui). −

The derivative in (D.12) with respect to ui is

−u 2 ∂u 2ηNui + 2ηNe i + (ui u˜i) + γia(ui)(2 + a(ui)) i −   −u = 2ηN 2ηNe i + 2(ui u˜i) + 2γi(1 + a(ui))∂u a(ui) − − i −u = 2ηN 2ηNe i + 2(ui u˜i) 2γia(ui) (D.13) − − −

P (z) a(ui) where we used the fact that ∂zP (z) = to work out that ∂u a(ui) = . z(1+P (z)) i − 1+a(ui) We can solve for ui using a bisection method. Below we show how to calculate the initial lower and upper bounds of the bisection interval and prove that the size of the interval is bounded (which ensures fast convergence). The initial lower and upper bounds we use depends on the derivative 0 in (D.13) at ui =u ˜i. In deriving the bounds we will use ui to denote the optimal value of ui and a0 to denote the optimal value of a in the ISGD update.

0 Case: ui > u˜i

0 If the derivative in (D.13) at ui =u ˜i is negative then ui must be lower bounded byu ˜i. An upper 0 bound on ui can be derived from (D.11):

 ˜  > w˜k wyi 0 xi 1+ − 1+ −a −ui 0 −ui ηµβk ηµβyi 2 ui = argmin 2ηN(ui + e + (K 1)e ) + (ui u˜i) u − − i    ˜  > w˜k wyi x − −ui −u i 1+ηµβk 1+ηµβy 2 argmin 2ηN(ui + e i + (K 1)e i ) + (ui u˜i) (D.14) ≤ u − − i    ˜ w˜  x> wk − yi ηN−u˜ i 1+ηµβk 1+ηµβy =u ˜i + P (ηNe i (1 + (K 1)e i )) ηN − − APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 165

0 where in the inequality we set a = 0, since the minimal value of ui is monotonically decreasing in a0 and a0 0. This upper bound on u0 should be used in the bisection method. However for ease ≥ i 2 of analysis we can weaken the bound by removing the (ui u˜i) term in (D.14) −  ˜  > w˜k wyi xi 1+ − 1+ 0 −ui ηµβk ηµβyi −ui ui argmin 2ηN(ui + e + (K 1)e e ) ≤ u − i    ˜  > w˜k wyi xi 1+ − 1+ = log(1 + (K 1)e ηµβk ηµβyi ) −  ˜  > w˜k wyi xi 1+ − 1+ log(2 max 1, (K 1)e ηµβk ηµβyi ) ≤ { − }  ˜  > w˜k wyi xi 1+ − 1+ = max log(2), log(2(K 1)e ηµβk ηµβyi ) { − } w˜ w˜y max log(2), log(2K) + x> k i ≤ { i 1 + ηµβ − 1 + ηµβ }  k yi  w˜ w˜y log(2K) + max 0, x> k i . ≤ { i 1 + ηµβ − 1 + ηµβ }  k yi 

0 > w˜k w˜yi Thusu ˜i ui log(2K) + max 0, xi 1+ηµβ 1+ηµβ . The length of the interval between the ≤ ≤ { k − yi } upper and lower bound is upper bounded as: 

> w˜k w˜yi log(2K) + max 0, xi ( u˜i { 1 + ηµβk − 1 + ηµβyi } −

> w˜k w˜yi = log(2K) + max 0 u˜i, x u˜i { − i 1 + ηµβ − 1 + ηµβ − }  k yi  > w˜k w˜yi log(2K) + max 0, x u˜i ≤ { | i 1 + ηµβ − 1 + ηµβ − |}  k yi  > w˜k w˜yi = log(2K) + x u˜i | i 1 + ηµβ − 1 + ηµβ − |  k yi  where we used the fact thatu ˜i 0. ≥ APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 166

0 Case: ui < u˜i

0 Now let us consider if the derivative in (D.13) is positive at ui =u ˜i. Then ui is upper bounded by 0 u˜i. We can lower bound ui by:

 ˜  > w˜k wyi xi 1+ − 1+ 0 0 −ui ηµβk ηµβyi −a −ui 2 ui = argmin 2ηN(ui + e + (K 1)e e ) + (ui u˜i) (D.15) u − − i    ˜  > w˜k wyi x − 0 −u i 1+ηµβk 1+ηµβy −a −u argmin ui + e i + (K 1)e i e i ≥ u − i   w˜ w˜y = log(1 + (K 1) exp(x> k i a0)) − i 1 + ηµβ − 1 + ηµβ −  k yi  w˜ w˜y log((K 1) exp(x> k i a0)) ≥ − i 1 + ηµβ − 1 + ηµβ −  k yi  w˜ w˜y = log(K 1) + x> k i a0 (D.16) − i 1 + ηµβ − 1 + ηµβ −  k yi  2 where the first inequality comes dropping the (ui u˜i) term and the second inequality is from the − monotonicity of the log function. Recall from (D.9) that

 ˜  > w˜k wyi 0 0 xi 1+ − 1+ −ui a0ea = ηN(K 1)γ−1e ηµβk ηµβyi . − i

We can upper bound a0 using the lower bound on u0:

0 0 a0 = e−a a0ea

 ˜  > w˜k wyi 0 0 xi 1+ − 1+ −ui = e−a ηN(K 1)γ−1e ηµβk ηµβyi − i  ˜   ˜  > w˜k wyi > w˜k wyi 0 0 xi 1+ − 1+ −(log(K−1)+xi 1+ − 1+ −a ) e−a ηN(K 1)γ−1e ηµβk ηµβyi ηµβk ηµβyi ≤ − i −1 = ηNγi . (D.17)

0 0 Substituting this upper bound for a back into (D.15) and solving yields a lower bound on ui,

 ˜ w˜  x> wk − yi −ηNγ−1 0 ηN−u˜ i 1+ηµβk 1+ηµβy i u u˜i + P (ηNe i (1 + (K 1)e i )) ηN. i ≥ − −

Again this bound should be used in the bisection method, but for ease of analysis we can weaken the bound by instead substituting the bound for a0 into (D.16) which yields:

w˜ w˜y u0 log(K 1) + x> k i ηNγ−1. i ≥ − i 1 + ηµβ − 1 + ηµβ − i  k yi  APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 167

> w˜k w˜yi −1 0 Thus log(K 1) + xi 1+ηµβ 1+ηµβ ηNγi ui u˜i. The size of the bisection method − k − yi − ≤ ≤ w˜  > w˜k yi −1 interval is upper bounded byu ˜i xi 1+ηµβ 1+ηµβ + ηNγi log(K 1). − k − yi − − In summary, for both signs of the derivative in (D.13) at ui =u ˜i we are able to lower and upper bound the optimal value of ui such that interval between the bounds is at most u˜i | − > w˜k w˜yi −1 xi 1+ηµβ 1+ηµβ + ηNγi + log(2K). This allows us to perform the bisection method k − yi | w˜   −1 > w˜k yi where, for  > 0 level accuracy, we require only log2( ) + log2( u˜i xi 1+ηµβ 1+ηµβ + | − k − yi | −1   ηNγi + log(2K)) function evaluations. In practice we use Brent’s method as the optimization routine, which is faster than the simple bisection method.

D.6.2 Bound on step size

Here we will prove that the step size magnitude of ISGD with a single datapoint and sampled class

> w˜k w˜yi with respect to w is bounded as O(xi 1+ηµβ 1+ηµβ u˜i). We will do so by considering the k − yi − 0 0  0  two cases ui > u˜i and ui < u˜i separately, where ui denotes the optimal value of ui in the ISGD update andu ˜i is its value at the previous iterate.

0 Case: ui > u˜i

Let a0 denote the optimal value of a in the ISGD update. From (D.10)

 ˜  > w˜k wyi 0 xi 1+ − 1+ −ui 0 −1 ηµβk ηµβyi a = P ηN(K 1)γi e − !  ˜  > w˜k wyi xi 1+ − 1+ −u˜i −1 ηµβk ηµβyi P ηN(K 1)γi e ≤ − !

0 where ui is replace byu ˜i and we have used the monotonicity of the Lambert-W function P . Now using the fact that P (z) = O(log(z)),

0 > w˜k w˜yi −1 a = O(x u˜i + log(ηN(K 1)γ )) i 1 + ηµβ − 1 + ηµβ − − i  k yi  > w˜k w˜yi = O(x u˜i). i 1 + ηµβ − 1 + ηµβ −  k yi 

0 Case: ui < u˜i

0 0 0 −1 If u < u˜i then we can lower bound a from (D.17) as a ηNγ . i ≤ i APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 168

Combining cases

Putting together the two cases,

0 > w˜k w˜yi −1 a = O(max x u˜i, ηNγ ) { i 1 + ηµβ − 1 + ηµβ − i }  k yi  > w˜k w˜yi = O(x u˜i). i 1 + ηµβ − 1 + ηµβ −  k yi  From (D.5) and (D.8) we know that the step size magnitude is proportional to a0. Thus the step

> w˜k w˜yi size magnitude is also O(xi 1+ηµβ 1+ηµβ u˜i). k − yi −   D.6.3 Single datapoint, multiple classes.

m Consider the case where only one datapoint i, but multiple classes kj : ky = yi are sampled { 6 }j=1 each iteration. Like in Appendix D.6.1, we will be able to reduce the implicit update to a one- dimensional strongly-convex optimization problem. The resulting problem may be solved using any standard convex optimization method, such as Newton’s method. We do not derive upper and lower bounds for a bisection method as we did in Appendix D.6.1.

Let us rewrite the double-sum formulation from (5.5) as f(u, W ) = Ei,Ci [fi,Ci (u, W )] where i is a uniformly sampled datapoint, Ci is a set of m uniformly sampled classes from 1, ..., K yi { } − { } (without replacement). Here

> −ui x (w −wy )−ui µ 2 fi,C (u, W ) = N(ui + e + α e i k i ) + βk wk , i 2 k k2 k∈C Xi k∈CXi∪{yi} where

−1 β = P (k Ci yi ) k ∈ ∪ { }

= P (k = yi) + P (k Ci k = yi)P (k = yi) ∈ | 6 6 −1 nk + α (N nk) = − , N −1 m 1 nk = i : yi = k, i = 1, ..., N and α = P (k Ci k = yi) = 1 (1 ). Following the |{ }| ∈ | 6 − j=1 − K−j same style derivation as in Appendix D.6.1, the ISGD update problemQ is

> µ −ui x (wk−wy )−ui 2 min 2η N(ui + e + α e i i ) + βk wk 2 u , {w }  2 k k  i k k∈Ci∪{yi} k∈C Xi k∈CXi∪{yi}  2 2  + (ui u˜i) + wk w˜k . (D.18) − k − k2 k∈CXi∪{yi} APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 169

The goal is to simplify this multi-variate minimization problem into a one-dimensional strongly > convex minimization problem. The first trick we will use is to reparametrize ui = vi x wy for − i i x>(w −w )−u x>w −v some vi R. This changes the e i k yi i term to e i k i , decoupling wk and wy , which will ∈ i make the optimization easier. The problem becomes:

> > µ > xi wyi −vi xi wk−vi 2 min 2η N(vi xi wyi + e + α e ) + βk wk 2 v , {w }  − 2 k k  i k k∈Ci∪{yi} k∈C Xi k∈CXi∪{yi}  > 2 2  + (vi x wy u˜i) + wk w˜k . (D.19) − i i − k − k2 k∈CXi∪{yi} > Since vi = ui +xi wyi is a linear transformation and (D.18) is strongly convex, (D.19) is also jointly strongly convex in vi and wk . Bringing the w minimizations inside yields { }k∈Ci∪{yi}

min 2ηNvi vi > > xi wyi −vi 2 > 2 2 + min 2ηNxi wyi + 2ηNe + ηµβyi wyi 2 + (vi xi wyi u˜i) + wyi w˜yi 2 wyi − k k − − k − k n > o x wk−vi 2 2 + min 2ηNαe i + ηµβk wk 2 + wk w˜k 2 . (D.20) wk k k k − k k∈C Xi n o In Appendix D.6.1 we were able to reduce the dimensionality of the problem by introducing an auxiliary variable b to separate the exponential terms from the norm terms. We will do a similar thing here. Let us first focus on the inner minimization for k Ci. ∈

> x wk−vi 2 2 min 2ηNαe i + ηµβk wk 2 + wk w˜k 2 wk k k k − k n o b−vi 2 2 > = min 2ηNαe + min ηµβk wk 2 + wk w˜k 2 : b = xi wk b wk k k k − k  n o b−vi 2 2 > = min 2ηNαe + max min ηµβk wk 2 + wk w˜k 2 + 2λ(b xi wk) b λ∈R wk k k k − k −  n o where we have taken the Lagrangian in the final line. The solution for wk in terms of λ is

w˜k λ wk = + xi. 1 + ηµβk 1 + ηµβk

w˜k xi Thus the optimal wk must satisfy wk = 1+ηµβ ak kx k2 for some ak R. It can similarly be k − i 2 ∈ w˜yi xi shown that wyi = 1+ηµβ + ayi kx k2 for some ayi R. Substituting this into (D.20) and dropping yi i 2 ∈ APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 170 constant terms yields

> xi w˜yi 2 min 2vi ηN u˜i + vi vi − 1 + ηµβ −  yi  > ˜ xi wyi > a 1+ −vi xi w˜yi yi ηµβyi + min 2e ηNe + 2ayi vi ηN + +u ˜i ay − − 1 + ηµβ i  !  yi  2 −2 + a (1 + xi (1 + ηµβy )) yi k k2 i  x>w˜ i k −v −ak 1+ηµβ i 2 −2 + min 2e ηNαe k + ak( xi (1 + ηµβk)) . (D.21) ak k k k∈C ( ! ) Xi Using the same techniques as in Appendix D.6.1 we can analytically solve for the a values:

2 > 2 2 ηN xi 2 xi w˜yi xi 2/(1 + ηµβyi ) + (vi u˜i) xi 2 ayi (vi) = k k − k k 2 − k k P (σ(vi)) 1 + ηµβy + xi − i k k2 x>w˜ i k −v η 2 1+ηµβ i ak(vi) = P xi 2Nαe k . 1 + ηµβk k k !

2 > 2 ηNkxik2 xi w˜yi −vi(1+ηµβyi )+(ηN−u˜i)kxik2 where σ(vi) = 2 exp 2 . Substituting the solutions for a 1+ηµβyi +kxik2 1+ηµβyi +kxik2 into (D.21) yields  

> xi w˜yi 2 min 2vi 1 + ηN u˜i) + vi vi − 1 + ηµβ −  yi  > xi w˜yi −2 2ay (vi) vi + ηN u˜i + 1 + xi (1 + ηµβy ) − i − 1 + ηµβ − k k2 i  yi  2 −2 + ay (vi) (1 + xi (1 + ηµβy )) i k k2 i −2 + ak(vi)(1 + ak(vi)) xi (1 + ηµβk). (D.22) k k k∈C Xi

This is a one-dimensional strongly convex minimization problem in vi. The optimal vi can be solved for using any standard convex optimization method, such as Newton’s method. Each iteration in 2 such a method will take O(m) since it is necessary to calculate ak(vi), ∂vi ak(vi) and ∂vi ak(vi) for all k Ci yi . The first derivatives are easily calculated, ∈ ∪ { } 2 xi 2 1 + ηµβyi P (σ(vi)) ∂vi ayi (vi) = k k 2 + 2 1 + ηµβy + xi 1 + ηµβy + xi 1 + P (σ(vi)) i k k2 i k k2 ak(vi) ∂vi ak(vi) = , −1 + ak(vi) APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 171 as are the second derivatives,

2 2 1 + ηµβyi P (σ(vi)) ∂ ay (vi) = vi i − 1 + ηµβ + x 2 (1 + P (σ(v )))3  yi i 2  i 2 k k 2 ak(vi) ∂vi ak(vi) = 3 . (1 + ak(vi))

First or second order methods can therefore easily be implemented to optimize vi.

D.6.4 Multiple datapoints, multiple classes

Consider the case where n datapoints and m classes are sampled each iteration. Using similar methods to Appendix D.6.3, we will reduce the implicit update to an n dimensional strongly convex optimization problem.

Let us rewrite the double-sum formulation from (5.5) as f(u, W ) = EI,C [fI,C (u, W )] where I is a set of n datapoints uniformly sampled from 1, ..., N (without replacement), C is a set of m uniformly sampled classes from 1, ..., K (without replacement). The sampled function is of the form

> −ui x (w −wy )−ui µ 2 fI,C (u, W ) = αn(ui + e ) + αm I[k = yi]e i k i + βk wk , 6 2 k k2 i∈I k∈C ! X X k∈C∪Xi∈I {yi} where −1 n−1 −1 1 αn = P (i I) = 1 (1 ) ∈  − − N j  j=0 Y − −1 −1  −1 αm = P (i I, k C) = P (i I) P (k C) ∈ ∈ ∈ ∈ −1 βk = P (k C i∈I yi ) = P (k C) + P (k i∈I yi ) ∈ ∪ { } ∈ ∈ ∪ { }  −1 P (k C)P (k i∈I yi ) − ∈ ∈ ∪ { }  m−1 1 P (k C)−1 = 1 (1 ) ∈ − − K j j=0 Y − n−1 i : yi = k P (k i∈I yi ) = 1 1 |{ }| . ∈ ∪ { } − − N j j=0 Y  −  APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 172

It will be useful to group the classes that appear in i∈I yi and those that only appear in C: ∪ { }

fI,C (u, W )

> −ui x (w −wy )−ui µ 2 = I[k = yi]αn(ui + e ) + I[k = yi, k C]αme i k i + βk wk 6 ∈ 2 k k2 i∈I k∈∪Xi∈I {yi} X   > x (w −wy )−ui µ 2 + αme i k i + βk wk . 2 k k2 i∈I k∈C−∪Xi∈I {yi} X The ISGD update is

> −u x (w −wy )−u min 2η I[k = yi]αn(ui + e i ) + I[k = yi, k C]αme i k i i {ui}i∈I 6 ∈ {w }  k∈∪i∈I {yi} i∈I k k∈C∪i∈I {yi} X X   µ 2 + βk wk 2 k k2 > x (w −wy )−ui µ 2 + αme i k i + βk wk 2 k k2 i∈I k∈C−∪Xi∈I {yi} X  2 2 + (ui u˜i) + wk w˜k . − k − k2 i∈I X k∈C∪Xi∈I {yi} > Like in Appendix D.6.3, the first step to simplifying this equation is to reparametrize ui = vi x wy − i i for some vi R and to bring the wk minimizations inside: ∈

min 2ηαnvi {v } i i∈I i∈I X > > x wk−vi > 2 + min I[k = yi] 2ηαn( xi wk + e i ) + (vi xi wk u˜i) wk − − − i∈I k∈∪Xi∈I {yi}  X    > x w −vi 2 2 + I[k = yi, k C]2ηαme i k + ηµβk wk + wk w˜k 6 ∈ k k2 k − k2   > x wk−vi 2 2 + min 2ηαme i + ηµβk wk 2 + wk w˜k 2 . (D.23) wk k k k − k i∈I k∈C−∪Xi∈I {yi}  X  As done in Appendix D.6.3, the inner minimizations can be solved analytically by introducing > constrained auxiliary variables bki = xi wk and optimizing the dual. We’ll do this separately for k i∈I yi and k C i∈I yi . ∈ ∪ { } ∈ − ∪ { } APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 173

Optimizing over datapoint labels k i∈I yi ∈ ∪ { }

> > x wk−vi > 2 min I[k = yi] 2ηαn( xi wk + e i ) + (vi xi wk u˜i) wk − − − i∈I X    > x w −vi 2 2 + I[k = yi, k C]2ηαme i k + ηµβk wk + wk w˜k 6 ∈ k k2 k − k2  b −vi 2 b −vi = min I[k = yi] 2ηαn( bki + e ki ) + (vi bki u˜i) + I[k = yi, k C]2ηαme ki bki − − − 6 ∈ i∈I X     2 2 > + min ηµβk wk 2 + wk w˜k 2 : bki = xi wk for all i I . wk k k k − k ∈ n o Focusing on the minimization over wk:

2 2 > min ηµβk wk 2 + wk w˜k 2 : bki = xi wk for all i I wk k k k − k ∈ n o 2 2 > = max min ηµβk wk 2 + wk w˜k 2 + 2 λki(bki xi wk). λki wk k k k − k − i∈I X The solution for wk in terms of λki is

w˜k + λkixi w = i∈I . k 1 + ηµβ P k Dropping constant terms, the dual becomes

2 > i∈I λkixi 2 xi w˜k max k k + 2 λki(bki ) λki − 1 + ηµβk − 1 + ηµβk P i∈I X > > > XI w˜k = max λk Qkλk + 2λk bk λki − − 1 + ηµβ  k  > > > XI w˜k −1 XI w˜k = bk Q bk − 1 + ηµβ k − 1 + ηµβ  k   k  > 2 XI w˜k = bk , − 1 + ηµβk Q−1 k

> > xi xj D×n −1 XI w˜k where Qk,ij = and XI = (xi)i∈I R and the optimal λ = Q bk . Now we 1+ηµβk ∈ k − 1+ηµβk can solve for bk,  

b −vi 2 b −vi min I[k = yi] 2ηαn( bki + e ki ) + (vi bki u˜i) + I[k = yi, k C]2ηαme ki bki − − − 6 ∈ i∈I X   > 2 XI w˜k + bk . − 1 + ηµβk Q−1 k

APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 174

n Setting to zero the derivative with respect to bk R and dividing by 2: ∈

b −vI 0 = I[k = yI ] ηαn( 1 + e k ) + bk +u ˜I vI ◦ − −   X>w˜ b −vI −1 I k + I[k = yI , k C] ηαme k + Q bk 6 ∈ ◦ k − 1 + ηµβ  k  b = diag(a)e k + Akbk hk (D.24) − where denotes the element-wise product, diag(a) is a diagonal matrix, 1 denotes the vectors of ◦ n all ones, vI = (vi)i∈I R , likewise foru ˜I and yI , and ∈

−vI −vI ak = I[k = yI ] ηαne + I[k = yI , k C] ηαme ◦ 6 ∈ ◦ −1 Ak = diag(I[k = yI ]) + Qk > −1 XI w˜k hk = I[k = yI ] (ηαn1 u˜I + vI ) + Qk . ◦ − 1 + ηµβk −1 −1 Multiplying (D.24) on the left by A , letting zk = A hk bk and multiplying on the right by k k − diag(ezk ) yields

z −1 A−1h zk e k = A (ak e k k ). ◦ k ◦

The solution for zk is

−1 A−1h zk = P (A (ak e k k )) k ◦ where P is the principle branch of the Lambert-W function applied component-wise. The corre- sponding solution for bk is

−1 −1 A−1h bk(vI ) = A hk P (A (ak e k k )) (D.25) k − k ◦ where bk is a function of vI since ak and hk are functions of vI .

For pure class labels k C i∈I yi ∈ − ∪ { } The procedure for pure class labels is nearly identical as for the datapoint labels. The optimal value of wk is

w˜k + λkixi w = i∈I k 1 + ηµβ P k > −1 XI w˜k where λ = Q bk and k − 1+ηµβk   > X>w˜ I k −v XI w˜k 1+ηµβ I bk(vI ) = P ηαnQke k . (D.26) 1 + ηµβk − ! APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 175

Final optimization problem

Substituting the optimal values of bk into (D.23) yields the final optimization problem

min 2ηαnvi (D.27) {v } i i∈I i∈I X b (vI )−vi 2 + I[k = yi] 2ηαn( bki(vI ) + e ki ) + (vi bki(vI ) u˜i) − − − i∈I k∈∪Xi∈I {yi} X   b (v )−v + I[k = yi, k C]2ηαme ki I i 6 ∈ > 2 XI w˜k + bk(vI ) − 1 + ηµβk Q−1 k  > ˜  > XI wk x w˜ −vI i k 1+ηµβk 2 −P ηαnQke  −vi X>w˜ 1+ηµβk I k −v 1+ηµβ I + 2ηαme i + P ηαnQke k . k∈C−∪ {y } i∈I ! Q−1 i∈I i k X X n where bk(vI ) is from (D.25). This is a strongly convex optimization problem in vI R . Using ∈ first order methods, it can be solved to  > 0 accuracy in O(log(−1)) iterations [Nesterov, 2013, Ch 2.1]. The cost of optimizing (D.27) using a first order method breaks down as follows. There is an > 3 3 O(nmD) cost of taking the xi w˜k inner products and an O(n ) cost of inverting all Qk matrices. These computations only have to be done once per ISGD update, as they are independent of vI . There are O(nm) terms in (D.27) and taking the gradient of each term is dominated by the O(n2) cost of matrix-vector multiplications. This brings the overall cost of optimizing (D.27) to be O(n3m log(−1) + nmD)

Note that we have assumed that Qk is invertible. If it is not, then a similar method to the one presented above can be developed where a basis of xi i∈I is used. { }

D.7 Comparison of double-sum formulations

In Section 5.2.2 and Appendix D.1 we claimed that our double-sum formulation lead to faster convergence than that of Raman et al. [2016]. Here we present numerical results confirming this. Figure D.1 illustrates the performance on the Eurlex dataset of U-max using the proposed double-sum in (5.6) compared to U-max using the double-sum of Raman et al. [2016] in (5.8).

3 1 > > −1 Note that Qk = X XI and so we only have to invert X XI once to calculate all Q matrices. 1+ηµβk I I k APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 176

Our proposed double-sum outperforms for all4 learning rates η = 100,±1,±2,±3,±4/N, with its 50th- epoch log-loss being 3.08 times lower on average. As a second point of comparison we also ran ISGD on the Bibtex dataset with Raman et al.’s double-sum formulation and learning rates η = 100,±1,±2,±3,±4/N. We found that its log-loss at the end of 50 epochs was on average 1.96 times higher than on our double-sum formulation.

Double-sum formulations

Learning rates 101 Log-likelihood

0 10 20 30 40 50 Epochs

Figure D.1: Log-loss of U-max on Eurlex for different learning rates with our proposed double-sum formulation and that of Raman et al. [2016].

4The learning rates η = 101,2,3,4/N are not displayed in the Figure D.1 for visualization purposes. They have similar behavior as η = 1.0/N. APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 177

D.8 Results over epochs

In the main part of the paper we included plots of the performance of each algorithm vs CPU time. Here we give the runtime of each method and plot the performance over epochs opposed to CPU time. In Table E.3 we give the time in seconds to run 50 epochs for each algorithm on each dataset. Plots with respect to epochs are included in Figure D.2.

Table D.1: Time in seconds taken to run 50 epochs. OVE/NCE/IS/ESGD/U-max with n = 1, m = 5 all have the same runtime. ISGD with n = 1, m = 1 is faster per iteration. The final column displays the ratio of OVE/.../U-max to ISGD for each dataset.

Data set ISGD OVE/NCE/IS/ESGD/U-max Ratio

MNIST 1283 2494 1.94 Bibtex 144 197 1.37 Delicious 287 325 1.13 Eurlex 427 903 2.12 AmazonCat 24392 42816 1.76 Wiki10 783 1223 1.56 WikiSmall 6407 8470 1.32

Average - - 1.60 APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 178

MNIST Bibtex Delicious

5.5 2.0 4

5.0 1.5 3 4.5

2 1.0 4.0

1 0.5 3.5

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Eurlex AmazonCat Wiki10 Wiki-small

8 6 7 9.5 7 5 9.0 6 6 4 5 8.5 5 3 4 8.0

2 4 3 7.5 2 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Figure D.2: The x-axis is the number of epochs and the y-axis is the training set log-loss from (5.2). APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 179

D.9 L1 penalty

The double-sum softmax ISGD update from (D.3) with an L1 penalty is

> −u x (w −wy )−u min 2ηN(ui + e i + (K 1)e i k i i ) ui,wk,wyi −

+ 2ηµ(βy wy 1 + βk wk 1) i k i k k k 2 2 2 + (ui u˜i) + wy w˜y + wk w˜k . − k i − i k2 k − k2

x>(w −w )−u This problem is hard to solve directly, due to the combination of e i k yi i and the L1 penalties. > To move x (wk wy ) ui from the exponent, we use the convex conjugate of the exponential: i − i −

ez = max az + a a log(a) a≥0 − to write the ISGD update as

−ui > max min 2ηN(ui + e + (K 1)[a(xi (wk wyi ) ui) + a a log(a)]) a≥0 ui,wk,wyi − − − −

+ 2ηµ(βy wy 1 + βk wk 1) i k i k k k 2 2 2 + (ui u˜i) + wy w˜y + wk w˜k − k i − i k2 k − k2 where we swapped the max and the min using Sion’s condition. Moving the minimizations inside

max 2ηN(K 1)(a a log(a)) (D.28) a≥0 − − −u 2 + min 2ηN(ui + e i (K 1)aui) + (ui u˜i) ui − − −  > 2 + min 2ηN(K 1)axi wyi + 2ηµβyi wyi 1 + wyi w˜yi 2 wyi − − k k k − k n o > 2 + min 2ηN(K 1)axi wk + 2ηµβk wk 1 + wk w˜k 2 . wk − k k k − k n o Note that this is a concave function of a, since a a log(a) is concave and the minimizations are − convex functions where a only appears as a linear term. We can solve for each minimization in closed form.

Minimize ui

−u 2 min 2ηN(ui + e i (K 1)aui) + (ui u˜i) ui − − − APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 180 has the solution

ηN−ηN(K−1)a−u˜ ui =u ˜i ηN + ηN(K 1)a + P (ηNe i ) − − where P is the principle branch of the Lambert-W function.

Minimize wyi and wk

The solutions for wyi and wk are

wy = σηµβ (w ˜y ηN(K 1)axi) i yi i − −

wk = σηµβ (w ˜k + ηN(K 1)axi) k − where σc(x) is the double-hinge function with threshold c applied element-wise,

x c if x > c  − σc(x) = x + c if x < c  −  0 otherwise.    Maximize a 

All that is left is to maximize a. The function (D.28) for a is easily differentiable except at its hinge points. The function has 4D hinge points (wk and wyi each have D components, each of which have two hinges). To take the gradient of a one must first determine which inter-hinge-point interval the current value of a lies in, which naively costs O(D). Since this must be done for each gradient calculation, this cost can become very expensive. Fortunately, the cost of optimizing a can be decreased via the following process:

1. Calculate the location of all 4D hinge points in time O(D).

2. Sort the hinge points in O(D log(D)).

3. Do bisection optimization on the hinge points in O(log(D)). That is, evaluate a subgradient of (D.28) with a at the outer hinges, then bisect inwards.

4. If the hinge point is the optimum, return it. Else optimize over the interval between the hinge points containing the optimum using Brent’s method. APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 181

The total work is O(D log(D)) for steps 1-4 and O(log(−1)) for step 5 where the optimal value of a is found to within error  > 0.

Comment on runtime

Compared to the L2 penalty, the L1 penalty increases the cost per iteration by a factor of log(D), from O(D) to O(D log(D)). If xi is s sparse then there are only 4s hinge points. This decreases − the runtime of L2 penalized ISGD to O(s) and L1 penalized ISGD to O(s log(s)). Note that in many datasets even though D is large, s is small (e.g. AmazonCat).

D.10 Word2vec

Following the same arguments as in Appendix D.6.1 we can write the word2vec double-sum formu- lation ISGD update as

> −u x (w −wy )−u min 2ηN(ui + e i + (K 1)e i k i i ) ui,xi,wk,wyi − 2 2 2 + ηµαi xi + ηµβk wk + ηµβy wy k k2 k k2 i k i k2 2 2 2 2 + (ui u˜i) + xi x˜i + wk w˜k + wy w˜y . − k − k2 k − k2 k i − i k2

> Like in Appendix D.6.1 we introduce an auxiliary variable z x (wk wy ) and the update becomes ≥ i − i

−u z−u 2 min min 2ηN(ui + e i + (K 1)e i ) + (ui u˜i) z ui − −   2 2 2 + min ηµαi xi 2 + ηµβk wk 2 + ηµβyi wyi 2) xi,wk,wyi k k k k k k

 2 2 2 > + xi x˜i + wk w˜k + wy w˜y : z x (wk wy ) k − k2 k − k2 k i − i k2 ≥ i − i 

First we will analytically minimize over ui, then over xi, wk, wyi , leaving just z to be optimized over numerically.

Minimization over ui

Let us first focus on

−u z−u 2 min 2ηN(ui + e i + (K 1)e i ) + (ui u˜i) . (D.29) ui − −  APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 182

This is a strongly convex optimization problem. Setting the derivative equal to zero yields

ui(z) =u ˜i ηN + v(z) − where v(z) = P (ηNeηN−u˜i (1 + (K 1)ez)) and P denotes the principle branch of the Lambert-W − function. The value of (D.29) at the minimal ui is

−u (z) z−u (z) 2 2 2ηN(ui(z) + e i + (K 1)e i ) + (ui(z) u˜i) = (v(z) + 1) + const. − −

Minimization over xi, wk, wyi

Now lets move onto minimizing over xi, wk, wyi . We do so by solving the dual.

2 2 2 min ηµαi xi 2 + ηµβk wk 2 + ηµβyi wyi 2) xi,wk,wyi k k k k k k  2 2 2 > + xi x˜i + wk w˜k + wy w˜y : z x (wk wy ) k − k2 k − k2 k i − i k2 ≥ i − i 2 2 2 max min ηµαi xi 2 + ηµβk wk 2 + ηµβyi wyi 2) λ≥0 xi,wk,wyi k k k k k k  2 2 2 > + xi x˜i + wk w˜k + wy w˜y + 2λ(x (wk wy ) z) k − k2 k − k2 k i − i k2 i − i − 2 2 max 2λz + min ηµαi xi 2 + xi x˜i 2 λ≥0 − xi k k k − k  2 2 > + min ηµβk wk 2 + wk w˜k 2 + 2λxi wk wk k k k − k  2 2 > + min ηµβyi wyi 2 + wyi w˜yi 2 2λxi wyi . wy k k k − k − i  

The optimal values for wk and wyi are

w˜k λxi wk(λ, z) = − 1 + ηµβk

w˜yi + λxi wyi (λ, z) = . 1 + ηµβyi

Substituting in these values, the optimization over xi becomes

2 > 2 > 2 2 2 xi 2 xi w˜k 2 xi 2 xi w˜yi min ηµαi xi 2 + xi x˜i 2 λ k k + 2λ λ k k 2λ xi k k k − k − 1 + ηµβ 1 + ηµβ − 1 + ηµβ − 1 + ηµβ  k k yi yi  2 2 2 2 > = min xi 2(ηµαi λ γ ) + xi x˜i 2 2λxi b (D.30) xi k k − k − k −   APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 183 where

1 1 γ = + s1 + ηµβk 1 + ηµβyi w˜y w˜ b = i k . 1 + ηµβyi − 1 + ηµβk

2 2 Note that for xi to have a minimal value it must be the case that 1+ηµαi λ γ > 0, or equivalently − −1 λ < γ √1 + ηµαi. The optimal xi satisfies

x˜i + λb xi(λ, z) = 2 2 . 1 + ηµαi λ γ − Substituting this value into (D.30) yields

2 2 2 2 2 > x˜i + λb 2 min xi 2(ηµαi λ γ ) + xi x˜i 2 2λxi b = k k 2 2 + const. xi k k − k − k − −1 + ηµαi λ γ   − Maximization over λ

The maximization problem over λ becomes

2 x˜i + λb 2 max√ 2λz k k 2 2 0≤λ<γ−1 1+ηµα − − 1 + ηµαi λ γ i − Setting the derivative with respect to λ equal to zero yields a quartic equation

0 =λ4(γ4z)

2 2 > 2 + λ (2γ x˜ b 2(1 + ηαi)γ z) i − 1 2 2 2 + λ ( b (1 + ηµαi) + γ x˜i ) k k2 k k2 2 > + ((1 + ηµαi) z + (1 + ηµαi)˜xi b), (D.31) for which the solutions can be found analytically. The minimal value of λ will be at one of the four roots or λ = 0 as imposed by the boundary condition. Denote the optimal value of λ as λ(z).

Minimization over z

The final optimization is over z:

2 2 x˜i + λ(z)b 2 min(v(z) + 1) 2λ(z)z k k 2 2 . z − − 1 + ηµαi λ(z) γ − APPENDIX D. STABLE AND SCALABLE SOFTMAX OPTIMIZATION WITH DOUBLE-SUM FORMULATIONS 184

This is potentially a non-convex optimization problem. We can find a root using a gradient descent method. To do so, we need to be able to take gradients. The derivative of v(z) is

v(z) (K 1)ez ∂zv(z) = − . 1 + v(z) 1 + (K 1)ez − To find the derivative of λ(z) we differentiate (D.31) by z which, after some rearranging, yields:

2 2 2 (1 + ηµαi λ(z) γ ) ∂zλ(z) = 2 > 2 2 − 2 2 2 2 . −4λ(z)γ x˜ b + b (1 + ηµαi) + γ x˜i 4λ(z)γ z(1 + ηαi γ λ(z) ) i k k2 k k2 − − This is the derivative unless λ = 0 for which 0 is a subgradient. The cost of calculating a gradient with respect to z is therefore the cost of evaluating the Lambert-W function once, solving a quartic equation and substituting in the results. APPENDIX E. IMPLICIT BACKPROPAGATION 185

Appendix E

Implicit Backpropagation

E.1 Derivation of generic update equations

In this section we will derive the generic IB update equations, starting from equation (6.5) and ending at equation (6.8). For notational simplicity we will drop superscripts and subscripts where ˜ (t) they are clear from the context. Let a tilde denote the current iterate, i.e. θk = θk . With this notation, the IB update from (6.5) becomes

> 2 ˜ 2 argmin 2ηb σ(θz) + ηµ θ 2 + θ θ 2 (E.1) θ k k k − k n o Dk+1 > 2 2 = argmin 2ηbjσ(θ z) + ηµ θj + θj θ˜j  j k k2 k − k2 θ j=1  X  th th where θj is the j row of θ corresponding to the j node in the layer. The minimization splits into separate minimization problems for each j:

> 2 ˜ 2 argmin 2ηbjσ(θj z) + ηµ θj 2 + θj θj 2 . (E.2) θj k k k − k n o 1+D Since θj R k this is a (1 + Dk)-dimensional problem. However, we will be able to reduce it ∈ > to just a one-dimensional problem. We begin by introducing an auxiliary variable qj = θj z and rewriting (E.2) as

2 ˜ 2 > min 2ηbjσ(qj) + min ηµ θj 2 + θj θj 2 : qj = θj z . (E.3) qj θj { k k k − k }   We will first solve the inner minimization over θj as a function of qj, and then solve the outer minimization over qj. APPENDIX E. IMPLICIT BACKPROPAGATION 186

Inner minimization

The inner minimization can be solved by taking the dual:

2 ˜ 2 > min ηµ θj 2 + θj θj 2 : qj = θj z θj { k k k − k } 2 ˜ 2 > = max min ηµ θj 2 + θj θj 2 + 2λj(qj θj z) λj ∈R θj { k k k − k − } 2 ˜ 2 > = max 2λjqj + min ηµ θj 2 + θj θj 2 2λjθj z . (E.4) λj ∈R θj { k k k − k − }   The solution for θj is θ˜j + λjz θj = . (E.5) 1 + ηµ Substituting (E.5) into (E.4) and simplifying yields

1 2 2 ˜> ˜ 2 min λj z 2 + 2λj(θj z (1 + ηµ)qj) ηµ θj 2 . − 1 + ηµ λj ∈R k k − − k k n o This is a quadratic in λj, which is easily minimized. The value for λj at the minimum is, ˜> (1 + ηµ)qj θj z λj = − . (E.6) z 2 k k2 Substituting (E.6) into (E.4) yields the minimal value of the inner minimization problem as a function of qj, ˜> 2 1 + ηµ θj z ηµ 2 qj + θ˜j . (E.7) z 2 − 1 + ηµ 1 + ηµk k2 k k2 ! Outer minimization

ηµ 2 Replacing the inner minimization with (E.7), dropping the constant θ˜j term and dividing 1+ηµ k k2 everything by 2η, (E.3) becomes

˜> 2 1 + ηµ θj z argmin bjσ(qj) + 2 qj . qj  2η z 2 − 1 + ηµ!   k k  Reparameterizing qj as   ˜> 1 θj z αj = qj , (E.8) η z 2 1 + ηµ − k k2 ! we arrive at our simplified update from (6.8):

˜> 2 θj z 2 2 α αj = argmin bj σ α η z 2 + η(1 + ηµ) z 2 . α ( · 1 + ηµ − · k k ! k k 2 ) APPENDIX E. IMPLICIT BACKPROPAGATION 187

Once we have solved for αj we can recover the optimal θj by using (E.8) to find qj, (E.6) to find

λj and (E.5) to find θj. The resulting formula is

θ˜j θj = ηαjz, 1 + ηµ − as was stated in (6.7).

E.2 IB for convolutional neural networks

Here we consider applying IB to a Convolutional Neural Network (CNNs). As each filter is applied independently in a CNN, the IB updates decouple into separate updates for each filter. Since a filter uses shared weights, we cannot use the generic update equations in Section 6.4.1. Instead we have to derive the updates starting from (E.1)

> 2 ˜ 2 argmin 2ηb σ(Zθ) + ηµ θ 2 + θ θ 2. (E.9) θ k k k − k

Here Z is a matrix where each row corresponds to one patch over which the convolution vector 1 θ is multiplied. The activation function σ has components [σ(x)]m = max Bmx where Bm is a { } pooling matrix with elements in 0, 1 . We will assume that Bm has a row of all zeros so that the { } > > max-pooling effectively includes a relu non-linearity, since max (B , 0) x = max relu(Bmx) . { m } { }

1Note that Z in general will have repeated entries and may have a column of ones appended to it to account for a bias. APPENDIX E. IMPLICIT BACKPROPAGATION 188

We can expand (E.9) into a quadratic program:

M 2 2 argmin 2η bm max BmZθ + ηµ θ + θ θ˜ { } k k2 k − k2 θ m=1 X M 2 2 = argmin 2η bm am + ηµ θ + θ θ˜ | | k k2 k − k2 a,θ m=1 X

max BmZθ if bm 0 s.t. am { } ≥ ≥  min BmZθ if bm < 0  {− } M  2 2 = argmin 2η bm am + ηµ θ + θ θ˜ | | k k2 k − k2 a,y,θ m=1 X

BmZjθ if bm 0 s.t. am ≥ for all j ≥   BmZjθ M(1 ymj) if bm < 0 − − −

ymj= 1 j X ymj 0, 1 , ∈ { } where M > 0 is a large constant. This is clearly an expensive problem to solve each iteration. Note that when the convolution does not involve max-pooling, but just shared weights, the problem becomes a quadratic program with continuous variables and no constraints, which can be solved analytically. On the other hand if the convolution had max-pooling, but no shared weights, then the generic IB updates from (6.7) would apply. Thus the difficulty in solving the IB convolutional update comes from doing the max-pooling and weight sharing in the same layer.

E.3 Convergence theorem

In this section we present a simple theorem that leverages the fact that IB converges to EB in the limit of small learning rates, to show that IB converges to a stationary point of the loss function for appropriately decaying learning rates. First, we introduce some useful notation, and then we state the conditions under which the convergence result holds.

(t) Notation. Let gi(θ ; ηt) denote the gradient used to take a step in IB when datapoint i is (t+1) (t) (t) sampled with learning rate η, i.e. θ = θ ηtgi(θ ; ηt). Let ` be the loss function from − APPENDIX E. IMPLICIT BACKPROPAGATION 189

Sections 6.2 and 6.3. Define the level set

2 C = θ : θ 2 `(0) k k2 ≤ µ   and the restarting function θ if θ C ∈ R(θ) =  0 otherwise.

The set C depends on the value of `(0). Whenθ = 0 the output of the neural network is independent 1 N 1 N of its input and so `(0) = N i=1 `(yi, f0(xi)) = N i=1 `(yi, f0(0)) can be quickly calculated. Finally define the extended level-setP P

2 C¯(η) = θ : θ 2 `(0) + η max gi(θ; η) 2 k k ≤ µ · i,θ∈C{k k }  r  to contain all points that can be reached from C in one IB iteration (without restarting). We assume the following conditions.

1 N µ 2 (t) Assumption 1. The objective function `(θ) = `θ(xi, yi) + θ , IB gradients gi(θ ; η) N i=1 2 k k2 ∞ and learning rate sequence ηt satisfy the following:P { }t=1

(a) The loss at each datapoint is non-negative, i.e. `θ(x, y) 0 for all x, y, θ. ≥ (b) The gradient function is Lipschitz continuous with a Lipschitz constant 0 < L(η) < that is ∞ monotonically decreasing in η. That is

`(θ) `(θ¯) 2 L(η) θ θ¯ 2, k∇ − ∇ k ≤ k − k

for all θ, θ¯ C¯(η) and L(η) L(¯η) if η η¯. { } ⊂ ≤ ≤ ˜ (t) (c) The gradients of the stochastic functions `k(θk; x, y, θ−k) in (6.3) are Lipschitz continuous with a Lipschitz constant 0 < L˜(η) < that is monotonically decreasing in η. That is ∞

(t) (t) `˜k(θk; x, y, θ ) `˜k(θ¯k; x, y, θ ) 2 L˜(η) θk θ¯k 2, k∇ −k − ∇ −k k ≤ k − k

for all k 1, ..., d , (x, y) , θ(t) C and θ, θ¯ C¯(η); and L˜(η) L˜(¯η) if η η¯. ∈ { } ∈ D ∈ { } ⊂ ≤ ≤ ∞ (d) The learning rate sequence is monotonically decreasing with η1L˜(η1) < 1, ηt = and t=1 ∞ ∞ η2 < . P t=1 t ∞ P APPENDIX E. IMPLICIT BACKPROPAGATION 190

Assumption (a) is effectively equivalent to the loss being lower bounded by a deterministic 0 constant `θ(x, y) B, as one can always define an equivalent loss ` (x, y) = `θ(x, y) B 0. ≥ θ − ≥ Most standard loss functions, such as the square loss, quantile loss, logistic loss and multinomial loss, satisfy assumption (a). Assumptions (b) and (c) will be valid for any neural network whose activation functions have Lipschitz gradients. Assumption (d) is standard in SGD proofs (except for the η1L˜(η1) < 1 assumption which is particular to us). Let “restarting IB” refer to IB where the restarting operator R is applied each iteration. We now state the IB convergence theorem:

Theorem E.3.1. Under Assumptions 1, restarting IB converges to a stationary point in the sense that T (t) 2 ηtE[ `(θ ) ] lim t=1 k∇ k2 = 0. T →∞ T P t=1 ηt The proof of Theorem E.3.1 builds on the following intermediate results. P Lemma E.3.2. The restarting operator R does not increase the loss `.

Proof. If θ C then `(R(θ)) = `(θ) and the loss stays the same, otherwise ∈ µ `(R(θ)) = `(0) < θ 2 `(θ), 2 k k2 ≤ where the first inequality is by the definition of C and the second is from the assumption that

`θ(x, y) 0. ≥

Lemma E.3.3. The gradient of the loss function at any point in C is bounded by B`(η) = 2 L(η) `(0) + `(0) 2. µ k∇ k q Proof. By the triangle inequality and Lipschitz assumption

`(θ) 2 = `(θ) `(0) + `(0) 2 k∇ k k∇ − ∇ ∇ k

`(θ) `(0) 2 + `(0) 2 ≤ k∇ − ∇ k k∇ k

L(η) θ 0 2 + `(0) 2 ≤ k − k k∇ k 2 L(η) `(0) + `(0) 2 ≤ µ k∇ k r = B`(η) for all θ C. ∈ APPENDIX E. IMPLICIT BACKPROPAGATION 191

> > Lemma E.3.4. If x y 2 z then for any v we have y v x v z v 2. k − k ≤ ≥ − k k Proof.

y>v min s>v : x s 2 z2 ≥ s { k − k2 ≤ } > 2 2 = max min s v + λ( x s 2 z ) λ≥0 s { k − k − } > = x v z v 2 − k k where the final line follows from basic algebraic and calculus.

Lemma E.3.5. The 2-norm difference between the EB and IB gradients at θ(t) with learning rate (t) ηt is bounded by ηtL˜(ηt) gi(θ ; ηt) 2. k k

th Proof. Let `ik(θ) denote the components of `i(θ) corresponding to the parameters of the k ∇ ∇ layer θk, i.e. `i(θ) = ( `i1(θ), ..., `id(θ)). By construction ∇ ∇ ∇ (t) ˜ (t) `ik(θ ) = θk `k(θk; x, y, θ−k) θ =θ(t) ∇ ∇ k k (t) ˜ (t) gi(θ ; ηt) = θk `k(θk; x, y, θ−k) θ =θ(t+1) ∇ k k where, in a slight abuse of notation, θ(t+1) refers to the value of the next IB iterate before the application of the restarting operator. By the Lipschitz assumption on `˜k we have

d (t) (t) 2 (t) (t) 2 `i(θ ) gi(θ ; ηt) = `ik(θ ) gik(θ ; ηt) k∇ − k2 k∇ − k2 k=1 X d ˜ (t) ˜ (t) 2 = θk `k(θk; x, y, θ−k) θ =θ(t) θk `k(θk; x, y, θ−k) θ =θ(t+1) 2 k∇ k k − ∇ k k k k=1 X d 2 (t) (t+1) 2 L˜(ηt) θ θ ≤ k k − k k2 k=1 X 2 (t) (t+1) 2 = L˜(ηt) θ θ k − k2 2 (t) 2 = L˜(ηt) ηtgi(θ ; ηt) k − k2 2 2 (t) 2 = η L˜(ηt) gi(θ ; ηt) . t k k2 APPENDIX E. IMPLICIT BACKPROPAGATION 192

Lemma E.3.6. The 2-norm of the IB gradient gi(θ; ηt) at any point in C is bounded by Bg(ηt) = B`(ηt) . 1−ηtL˜(ηt)

Proof. By Lemmas E.3.3, E.3.5 and the triangle inequality,

gi(θ; ηt) 2 = gi(θ; ηt) `i(θ) + `i(θ) 2 k k k − ∇ ∇ k

gi(θ; ηt) `i(θ) 2 + `i(θ) 2 ≤ k − ∇ k k∇ k

ηtL˜(ηt) gi(θ; ηt) 2 + B`(ηt). (E.10) ≤ k k

Note that η1L˜(η1) < 1 by assumption. Since ηt η1 and L˜(η) is monotonically decreasing in η, we ≤ have that ηtL˜(ηt) < 1 for all t 1. Thus subtracting ηtL˜(ηt) gi(θ; ηt) 2 from both sides of (E.10) ≥ k k and dividing by 1 ηtL˜(ηt) > 0 yields the desired result. − Proof of Theorem E.3.1. We upper bound the loss of restarting-IB as

(t+1) (t) (t) E[`(θ )] = E[`(R(θ ηtgi(θ ; ηt)))] − (t) (t) E[`(θ ηtgi(θ ; ηt))] ≤ − (t) (t) > (t) 1 2 (t) 2 E[`(θ ) ηtgi(θ ; ηt) `(θ ) + η L(η) gi(θ ; ηt) ] ≤ − ∇ 2 t k k2 (t) (t) > (t) 1 2 (t) 2 = E[`(θ )] ηtE[gi(θ ; ηt) `(θ )] + η L(η)E[ gi(θ ; ηt) ]. (E.11) − ∇ 2 t k k2

(t) > (t) Let’s focus on the second term in (E.11), E[gi(θ ; ηt) `(θ )]. By Lemmas E.3.4 and E.3.5 we ∇ have that

(t) > (t) (t) > (t) (t) (t) gi(θ ; ηt) `(θ ) `i(θ ) `(θ ) ηtL˜(η) gi(θ ; ηt) 2 `(θ ) 2 ∇ ≥ ∇ ∇ − k k k∇ k and so

(t) > (t) (t) 2 (t) (t) E[gi(θ ; ηt) `(θ )] `(θ ) ηtL˜(η)E[ gi(θ ; ηt) 2] `(θ ) 2 ∇ ≥ k∇ k2 − k k k∇ k (t) 2 `(θ ) ηtL˜(η)Bg(ηt)B`(ηt) ≥ k∇ k2 − where B`(ηt) is as defined in Lemma E.3.3 and Bg(ηt) in Lemma E.3.6. Moving onto the third term in (E.11), we have (t) 2 2 E[ gi(θ ; ηt) ] B (ηt) k k2 ≤ g APPENDIX E. IMPLICIT BACKPROPAGATION 193 by Lemma E.3.6. Putting all the terms in (E.11) together

(t+1) (t) (t) 2 1 2 2 E[`(θ )] E[`(θ )] ηt( `(θ ) ηtL˜(ηt)Bg(ηt)B`(ηt)) + η L(ηt)B (ηt) ≤ − k∇ k2 − 2 t g (t) (t) 2 2 1 2 = E[`(θ )] ηt `(θ ) + η (L˜(ηt)Bg(ηt)B`(ηt) + L(ηt)B (ηt)) − k∇ k2 t 2 g (t) (t) 2 2 1 2 E[`(θ )] ηt `(θ ) + η (L˜(η1)Bg(η1)B`(η1) + L(η1)Bg(η1) ) ≤ − k∇ k2 t 2 where we have used the assumption that L and L˜ are monotonically decreasing in η (that B` and

Bg are also monotonically decreasing in η follows from the assumptions on L and L˜). Using a telescoping sum and rearranging yields

T (t) 2 (1) (T +1) ηtE[ `(θ ) ] E[`(θ )] E[`(θ )] k∇ k2 ≤ − t=1 X T 1 2 2 + (L˜(η1)Bg(η1)B (η1) + L(η1)Bg(η1) ) η ` 2 t t=1 X T (1) 1 2 2 E[`(θ )] + (L˜(η1)Bg(η1)B`(η1) + L(η1)Bg(η1) ) η (E.12) ≤ 2 t t=1 X where the second inequality follows from the assumption that `(θ) 0. Both of the terms in (E.12) ≥ are deterministically bounded for all T and so it must be the case that ∞ (t) 2 ηtE[ `(θ ) ] < . k∇ k2 ∞ t=1 X ∞ Finally, by the assumption that ηt = we have t=1 ∞ T (t) 2 P ηtE[ `(θ ) ] lim t=1 k∇ k2 = 0. T →∞ T P t=1 ηt P

E.4 Experimental setup

In this section we describe the datasets, hyperparameters, run times and results of the experiments from Section 6.5 in greater detail.

E.4.1 Datasets

The experiments use three types of dataset: MNIST for image classification and autoencoding, 4 polyphonic music dataset for RNN prediction and 121 UCI datasets for classification. These are APPENDIX E. IMPLICIT BACKPROPAGATION 194 standard benchmark datasets that are often used in the literature: MNIST is arguably the most used dataset in machine learning [LeCun, 1998], the polyphonic music datasets are a standard for testing real world RNNs [Martens and Sutskever, 2011, Boulanger-Lewandowski et al., 2012, Pascanu et al., 2013] and the UCI datasets are an established benchmark for comparing the performance of classification algorithms [Fern´andez-Delgadoet al., 2014, Klambauer et al., 2017]. The sources of the datasets are given in Table E.1 along with a basic description of their characteristics in Table E.2. The MNIST and UCI datasets were pre-split into training and test sets. For the music datasets we used a random 80%-20% train-test split.

Table E.1: Data sources

Dataset Source url

MNIST http://yann.lecun.com/exdb/mnist/

Polyphonic music http://www-etud.iro.umontreal.ca/~boulanni/icml2012. UCI classification https://github.com/bioinf-jku/SNNs.

Table E.2: Data characteristics. An asterisk ∗ indicates the average length of the musical piece. 88 is the number of potential notes in each chord. For information on the UCI classification datasets, see [Klambauer et al., 2017, Appendix 4.2].

Dataset Train Test Dimension Classes

MNIST 60,000 10,000 28x28 10 JSB Chorales 229 (60∗) 77 (61∗) 88 88 MuseData 524 (468∗) 124 (519∗) 88 88 Nottingham 694 (254∗) 170 (262∗) 88 88 Piano-midi.de 87 (873∗) 25 (761∗) 88 88

E.4.2 Loss metrics

For all of the experiments we used standard loss metrics. For image classification we used the cross- entropy and for autoencoders the mean squared error. As suggested by Boulanger-Lewandowski APPENDIX E. IMPLICIT BACKPROPAGATION 195 et al. [2012] on their website http://www-etud.iro.umontreal.ca/~boulanni/icml2012, we used the expected frame level accuracy as the loss for the music prediction problems. That is, at each time step the RNN outputs a vector in [0, 1]88 representing the probability of each of the 88 notes in the next chord being played or not. This naturally defines a likelihood of the next chord. We use the log of this likelihood as our loss. Also as suggested by Boulanger-Lewandowski et al. [2012], the loss for each piece is divided by its length, so that the loss is of similar magnitude for all pieces.

E.4.3 Architectures

We endeavored to use standard architectures so that the experiments would be a fair representation how the algorithms might perform in practical applications. The architecture for each type of dataset is given below.

MNIST classification. The architecture is virtually identical to that given in the MNIST clas- sification Pytorch tutorials available at https://github.com/pytorch/examples/blob/master/ mnist/main.py. The input is a 28 28 dimensional image. We first apply two 2d-convolutions × with a relu activation function and max-pooling. The convolution kernel size is 5 with 10 filters for the first convolution and 20 filters for the second, and the max-pooling kernel size is 2. After that we apply one arctan layer with input size 320 and output size 50, followed by dropout at a rate of 0.5 and a relu layer with output size 10. This is fed into the softmax function to get probabilities over the 10 classes. Since IB is not applicable to convolutional layers, for the IB implementation we use EB updates for the convolutional layers and IB for the arctan and relu layers. For the EB implementation, we use EB updates for all of the layers.

MNIST autoencoder. The autoencoder architecture just involves relus and has a 784:500:300:100:30:100:300:500:784 structure. This is similar to, but deeper than, the structure used in [Martens, 2010].

Music prediction. The simple RNN architecture used is identical to that in [Pascanu et al., 2013], except for the fact that we use an arctan activation function instead of tanh (since IB is compatible with arctan but not with tanh). 300 hidden units are used. APPENDIX E. IMPLICIT BACKPROPAGATION 196

UCI datasets. The architecture used for the UCI datasets consists of three arctan layers followed by a linear layer and the softmax. The number of nodes in the arctan layers are equal to the number of features in the dataset, with the final linear layer having input size equal to the number of features and output size equal to the number of classes. This architecture is based on [Klambauer et al., 2017], who also use the same UCI datasets with a similar architecture.

E.4.4 Hyperparameters and initialization details

The hyperparameters and parameter initializations schemes used are as follows:

Batch size: 100 (except for RNNs which had a batch-size of 1) • Dropout: None (unless otherwise stated) • Momentum: None • Ridge-regularization (µ): 0 • Weight matrix initialization: Each element is sampled independently from • unif 6 , 6 , where n = input size, m = output size to layer. This follows advice − n+m n+m of [Goodfellow q etq al., 2016, p. 299]

Bias initialization: 0. Again, this follows advice of [Goodfellow et al., 2016, p. 299] •

E.4.5 Learning rates

The process used to decide on the learning rates for the MNIST and music experiments is as follows. First a very coarse grid of learning rates was tested using EB to ascertain the range of learning rates where EB performed reasonably well. We then constructed a finer grid of learning rates around where EB performed well. The finer grid was constructed so that EB demonstrated in a U-shape of losses, with lower and higher learning rates having higher losses than learning rates in the middle, as in [Goodfellow et al., 2016, Fig 11.1, p. 425]. It was this finer grid that was used to generate the final results. Note that at no stage was IB involved in constructing the grid and thus, if anything, the grid is biased in favor of EB. For the UCI datasets the following set of learning rates were used: 0.001, 0.01, 0.1, 0.5, 1.0, 3.0, 5.0, 10.0, 30.0, 50.0. This is quite a coarse grid over a very large range of values. APPENDIX E. IMPLICIT BACKPROPAGATION 197

E.4.6 Clipping

Clipping was not used except for those experiments where the effect of clipping was explicitly being investigated. When clipping was used, the gradients were clipping according to their norm (opposed to each component being clipped separately). For EB we applied clipping in the usual way: first calculating the gradient, clipping it and then taking the step using the clipped gradient. To implement clipping for IB we used an alternative definition of the IB step from (6.1): θ(t+1) = (t) (t+1) (t+1) θ ηt( θ`i(θ ) + µθ ), where we highlight that the gradient is evaluated at the next, not − ∇ current, value of θ. The IB gradient can be inferred using

(t+1) (t+1) (t) (t+1) θ`i(θ ) + µθ = (θ θ )/ηt (E.13) ∇ − where θ(t+1) is calculated using (6.4). When applying clipping to IB we first calculate θ(t+1) using the IB update, infer the IB gradient using (E.13), clip it, and finally take a step equal to the learning rate multiplied by the clipped gradient.

E.5 Results

Here we will give the full results of all of the experiments. We begin with giving the run times of each experiment after which we present the performance on the MNIST and music datasets. Finally we give results on the UCI datasets.

E.5.1 Run times

In Section 6.4.5 upper bounds on the relative increase in run time for IB as compared to EB were derived. The bounds for each experiment are displayed in Table E.3 along with the empirically measured relative run times of our basic Pytorch implementation. The Pytorch run times are higher than in the theoretical upper bounds. This shows that IB could be more efficiently implemented.

E.5.2 Results from MNIST and music experiments

In this section the plots for all of the experiments are presented. For each experiment we have multiple plots: a line plot showing the mean performance along with 1 standard deviation error bars; a scatter plot showing the performance for each random seed; and, where applicable, line and APPENDIX E. IMPLICIT BACKPROPAGATION 198

Dataset Theoretical upper bound Empirical

MNIST classification 6.27% 16.82% MNIST autoencoder 0.60% 99.68% JSB Chorals 11.33% 152.51% MuseData 11.33% 207.48% Nottingham 11.33% 215.85% Piano-midi.de 11.33% 213.98% UCI classification sum - 58.99%

Table E.3: Theoretical and empirical relative run time of IB vs EB. Empirical measured on AWS p2.xlarge with our basic Pytorch implementation. scatter plots showing the difference between the experiment with and without clipping. In general 5 seeds are used per learning rate, although for MNIST-classification we use 20 seeds. The experiments are presented in the following order

1. MNIST classification without clipping

2. MNIST classification with clipping threshold = 1.0

3. MNIST autoencoder without clipping

4. MNIST autoencoder with clipping threshold = 1.0

5. JSB Chorales without clipping

6. JSB Chorales with clipping threshold = 8.0

7. JSB Chorales with clipping threshold = 1.0

8. JSB Chorales with clipping threshold = 0.1

9. MuseData without clipping

10. MuseData with clipping threshold = 8.0

11. Nottingham without clipping APPENDIX E. IMPLICIT BACKPROPAGATION 199

12. Nottingham with clipping threshold = 8.0

13. Piano-midi.de without clipping

14. Piano-midi.de with clipping threshold = 8.0

15. Piano-midi.de with clipping threshold = 1.0

16. Piano-midi.de with clipping threshold = 0.1

17. Piano-midi.de with clipping threshold = 0.01

For the MNIST experiments we only used a clipping threshold of 1.0. For the music datasets we first use a clipping threshold of 8.0 as suggested by the authors of [Pascanu et al., 2013]. As this clipping threshold didn’t much affect the performance of either EB or IB, we also considered lower clipping thresholds. It is evident from the plots without clipping that on Piano-midi.de the algorithms are less well converged than the other datasets (which is probably due to Piano-midi.de having only 87 training datapoints). It was therefore of interest to see if further clipping could help stabilize the algorithms on Piano-midi.de. We can see from the random seed scatter plots that the effect of clipping helped a little with stabilization, but not to the extent that the results were significantly better. As JSB Chorales has the second fewest number of datapoints, we also tried lower clipping thresholds on it, and found it to often hurt the performance of EB as much as it helped (depending on the random seed). APPENDIX E. IMPLICIT BACKPROPAGATION 200

E.5.3 MNIST classification APPENDIX E. IMPLICIT BACKPROPAGATION 201

Mnist classification Mnist classification

2.5 explicit implicit 2.0 2.0 exact ISGD 1.5 1.5

1.0 1.0 Training set loss Training set loss 0.5 0.5

0.0 0.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Learning rate Learning rate

Mnist classification Mnist classification

2.5 explicit implicit 2.0 2.0 exact ISGD 1.5 1.5

1.0 1.0 Test set loss Test set loss 0.5 0.5

0.0 0.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Learning rate Learning rate

Mnist classification Mnist classification 100 100

80 80

60 60

40 40 20 explicit

Training set accuracy implicit Training set accuracy 20 0 exact ISGD

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Learning rate Learning rate

Mnist classification Mnist classification 100 100

80 80

60 60

40 40 explicit Test set accuracy 20 Test set accuracy implicit 20 0 exact ISGD

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 202

Mnist classification, clip = 1.0 Mnist classification, clip = 1.0

2.5 explicit implicit 2.0 2.0 exact ISGD 1.5 1.5

1.0 1.0

Training set loss 0.5 Training set loss 0.5

0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Learning rate Learning rate

Mnist classification, clip = 1.0 Mnist classification, clip = 1.0

2.5 explicit implicit 2.0 2.0 exact ISGD

1.5 1.5

1.0 1.0 Test set loss Test set loss 0.5 0.5

0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Learning rate Learning rate

Mnist classification, clip = 1.0 Mnist classification, clip = 1.0 100 100

80 80

60 60

40 40 explicit

Training set accuracy 20 Training set accuracy implicit 20 exact ISGD 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Learning rate Learning rate

Mnist classification, clip = 1.0 Mnist classification, clip = 1.0 100 100

80 80

60 60

40 40 explicit Test set accuracy 20 Test set accuracy implicit 20 exact ISGD 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 203

E.5.4 MNIST autoencoder APPENDIX E. IMPLICIT BACKPROPAGATION 204

Mnist autoencoder Mnist autoencoder

explicit 7 7 implicit 6 6

5 5

4 4

3 3 Training set loss Training set loss

2 2

1 1 0.000 0.005 0.010 0.015 0.020 0.000 0.005 0.010 0.015 0.020 Learning rate Learning rate

Mnist autoencoder Mnist autoencoder

explicit 7 7 implicit 6 6

5 5

4 4

Test set loss 3 Test set loss 3

2 2

1 1 0.000 0.005 0.010 0.015 0.020 0.000 0.005 0.010 0.015 0.020 Learning rate Learning rate

Mnist autoencoder, clip = 1.0 Mnist autoencoder, clip = 1.0

explicit 7 6 implicit 6 5 5 4 4 3 3 Training set loss Training set loss

2 2

1 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 Learning rate Learning rate

Mnist autoencoder, clip = 1.0 Mnist autoencoder, clip = 1.0

explicit 7 6 implicit 6 5 5 4 4

Test set loss 3 Test set loss 3

2 2

1 1 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 205

E.5.5 JSB Chorales

JSB Chorales (no clipping) JSB Chorales (no clipping)

0.014 explicit implicit 0.020 0.012

0.010 0.015 0.008

0.006 0.010

Training set loss 0.004 Training set loss 0.005 0.002

0.000 0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

JSB Chorales (no clipping) JSB Chorales (no clipping)

0.014 explicit implicit 0.020 0.012

0.010 0.015 0.008

0.006 0.010 Test set loss Test set loss 0.004 0.005 0.002

0.000 0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 206

JSB Chorales, clip = 8.0 JSB Chorales, clip = 8.0

0.014 explicit implicit 0.020 0.012

0.010 0.015 0.008

0.006 0.010

Training set loss 0.004 Training set loss 0.005 0.002

0.000 0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

JSB Chorales, clip = 8.0, difference from no clipping JSB Chorales, clip = 8.0, difference from no clipping 0.0002 0.0000 explicit 0.0001 implicit 0.0002 0.0000

0.0001 0.0004 0.0002

0.0003 0.0006 Training set loss Training set loss 0.0004 0.0008 0.0005 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

JSB Chorales, clip = 8.0 JSB Chorales, clip = 8.0

0.014 explicit implicit 0.020 0.012

0.010 0.015 0.008

0.006 0.010 Test set loss Test set loss 0.004 0.005 0.002

0.000 0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

JSB Chorales, clip = 8.0, difference from no clipping JSB Chorales, clip = 8.0, difference from no clipping 0.0000 0.0001 explicit implicit 0.0001

0.0000 0.0002

0.0001 0.0003

0.0004 0.0002 Test set loss Test set loss 0.0005 0.0003 0.0006

0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 207

JSB Chorales, clip = 1.0 JSB Chorales, clip = 1.0

0.014 explicit implicit 0.020 0.012

0.010 0.015 0.008

0.006 0.010

Training set loss 0.004 Training set loss 0.005 0.002

0.000 0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

JSB Chorales, clip = 1.0, difference from no clipping JSB Chorales, clip = 1.0, difference from no clipping

0.0025 explicit 0.004 implicit 0.0020

0.0015 0.003

0.0010 0.002 0.0005 0.001 Training set loss

0.0000 Training set loss

0.0005 0.000 0.0010 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

JSB Chorales, clip = 1.0 JSB Chorales, clip = 1.0

0.014 explicit implicit 0.020 0.012

0.010 0.015 0.008

0.006 0.010 Test set loss Test set loss 0.004 0.005 0.002

0.000 0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

JSB Chorales, clip = 1.0, difference from no clipping JSB Chorales, clip = 1.0, difference from no clipping 0.005 explicit implicit 0.004 0.002 0.003

0.001 0.002 Test set loss Test set loss 0.001 0.000

0.000 0.001 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 208

JSB Chorales, clip = 0.1 JSB Chorales, clip = 0.1 0.020 0.014 explicit implicit 0.012 0.015 0.010

0.008 0.010 0.006

Training set loss 0.004 Training set loss 0.005

0.002 0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

JSB Chorales, clip = 0.1, difference from no clipping JSB Chorales, clip = 0.1, difference from no clipping 0.005 explicit 0.008 0.004 implicit 0.003 0.006

0.002 0.004 0.001 0.002 0.000

Training set loss 0.001 Training set loss 0.000

0.002 0.002

0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

JSB Chorales, clip = 0.1 JSB Chorales, clip = 0.1 0.016 explicit 0.020 0.014 implicit 0.012 0.015 0.010

0.008 0.010 0.006 Test set loss Test set loss 0.004 0.005

0.002 0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

JSB Chorales, clip = 0.1, difference from no clipping JSB Chorales, clip = 0.1, difference from no clipping 0.005 0.008 explicit 0.004 implicit 0.006 0.003

0.002 0.004

0.001 0.002

Test set loss 0.000 Test set loss 0.000 0.001

0.002 0.002 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 209

E.5.6 MuseData

MuseData (no clipping) MuseData (no clipping)

0.005 explicit 0.006 implicit 0.004 0.005

0.004 0.003 0.003 0.002 0.002 Training set loss Training set loss

0.001 0.001

0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

MuseData (no clipping) MuseData (no clipping)

explicit 0.004 implicit 0.005

0.004 0.003 0.003 0.002

Test set loss Test set loss 0.002

0.001 0.001

0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 210

MuseData, clip = 8.0 MuseData, clip = 8.0

0.005 explicit 0.006 implicit 0.004 0.005

0.004 0.003 0.003 0.002 0.002 Training set loss Training set loss

0.001 0.001

0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

MuseData, clip = 8.0, difference from no clipping MuseData, clip = 8.0, difference from no clipping 0.00100 explicit 0.04 implicit 0.00075 0.00050 0.02 0.00025

0.00 0.00000

0.00025 0.02 Training set loss Training set loss 0.00050

0.04 0.00075 0.00100 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

MuseData, clip = 8.0 MuseData, clip = 8.0

explicit 0.005 0.004 implicit

0.004 0.003 0.003

0.002

Test set loss Test set loss 0.002

0.001 0.001

0.000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

MuseData, clip = 8.0, difference from no clipping MuseData, clip = 8.0, difference from no clipping 0.00100 explicit 0.04 implicit 0.00075 0.00050 0.02 0.00025

0.00 0.00000

0.00025 Test set loss Test set loss 0.02 0.00050

0.04 0.00075 0.00100 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 211

E.5.7 Nottingham

Nottingham (no clipping) Nottingham (no clipping)

explicit 0.00175 0.0025 implicit 0.00150 0.0020 0.00125

0.00100 0.0015

0.00075 0.0010 Training set loss 0.00050 Training set loss 0.0005 0.00025

0.00000 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Nottingham (no clipping) Nottingham (no clipping) 0.0020 explicit 0.0025 implicit

0.0015 0.0020

0.0015 0.0010

0.0010 Test set loss Test set loss 0.0005 0.0005

0.0000 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 212

Nottingham, clip = 8.0 Nottingham, clip = 8.0

0.00175 explicit 0.0025 implicit 0.00150 0.0020 0.00125

0.00100 0.0015

0.00075 0.0010 Training set loss 0.00050 Training set loss 0.0005 0.00025

0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Nottingham, clip = 8.0, difference from no clipping Nottingham, clip = 8.0, difference from no clipping

0.0000 0.0000 0.0001

0.0001 0.0002

0.0003 0.0002 Training set loss Training set loss 0.0004 explicit 0.0003 implicit 0.0005 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Nottingham, clip = 8.0 Nottingham, clip = 8.0

0.00175 explicit 0.0025 implicit 0.00150 0.0020 0.00125

0.00100 0.0015 0.00075

Test set loss 0.0010 Test set loss 0.00050

0.00025 0.0005

0.00000 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Nottingham, clip = 8.0, difference from no clipping Nottingham, clip = 8.0, difference from no clipping 0.0000

0.0000 0.0002

0.0004 0.0002 0.0006

Test set loss 0.0004 Test set loss 0.0008 explicit implicit 0.0006 0.0010 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 213

E.5.8 Piano-midi.de

Piano-midi.de (no clipping) Piano-midi.de (no clipping) 0.0012 explicit 0.00150 implicit 0.0010 0.00125 0.0008 0.00100

0.0006 0.00075

0.0004 0.00050 Training set loss Training set loss

0.0002 0.00025

0.00000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de (no clipping) Piano-midi.de (no clipping) 0.0035 0.0020 explicit implicit 0.0030

0.0015 0.0025

0.0020 0.0010 0.0015 Test set loss Test set loss 0.0005 0.0010

0.0005 0.0000 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 214

Piano-midi.de, clip = 8.0 Piano-midi.de, clip = 8.0 0.0012 explicit 0.00150 implicit 0.0010 0.00125 0.0008 0.00100

0.0006 0.00075

0.0004 0.00050 Training set loss Training set loss

0.0002 0.00025

0.00000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 8.0, difference from no clipping Piano-midi.de, clip = 8.0, difference from no clipping 0.00100 explicit 0.04 implicit 0.00075 0.00050 0.02 0.00025

0.00 0.00000

0.00025 0.02 Training set loss Training set loss 0.00050

0.04 0.00075 0.00100 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 8.0 Piano-midi.de, clip = 8.0 0.0035 0.0020 explicit implicit 0.0030

0.0015 0.0025

0.0020 0.0010 0.0015 Test set loss Test set loss 0.0005 0.0010

0.0005 0.0000 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 8.0, difference from no clipping Piano-midi.de, clip = 8.0, difference from no clipping 0.00100 explicit 0.04 implicit 0.00075 0.00050 0.02 0.00025

0.00 0.00000

0.00025 Test set loss Test set loss 0.02 0.00050

0.04 0.00075 0.00100 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 215

Piano-midi.de, clip = 1.0 Piano-midi.de, clip = 1.0 0.0014 explicit 0.0020 0.0012 implicit

0.0010 0.0015 0.0008

0.0006 0.0010 0.0004 Training set loss Training set loss 0.0002 0.0005

0.0000 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 1.0, difference from no clipping Piano-midi.de, clip = 1.0, difference from no clipping 0.0006 explicit 0.0003 implicit 0.0005

0.0002 0.0004

0.0001 0.0003

0.0002 Training set loss 0.0000 Training set loss 0.0001 0.0001 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 1.0 Piano-midi.de, clip = 1.0 0.0020 explicit 0.0030 implicit 0.0015 0.0025

0.0020 0.0010 0.0015

Test set loss 0.0005 Test set loss 0.0010

0.0005 0.0000 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 1.0, difference from no clipping Piano-midi.de, clip = 1.0, difference from no clipping 0.0000 explicit 0.00005 implicit 0.00000 0.0001

0.00005 0.0002 0.00010 Test set loss Test set loss 0.00015 0.0003

0.00020 0.0004 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 216

Piano-midi.de, clip = 0.1 Piano-midi.de, clip = 0.1 0.0012 explicit 0.0012 implicit 0.0010 0.0010

0.0008 0.0008

0.0006 0.0006

0.0004 Training set loss 0.0004 Training set loss 0.0002 0.0002 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 0.1, difference from no clipping Piano-midi.de, clip = 0.1, difference from no clipping 0.0000 0.0002 explicit implicit 0.0002

0.0000 0.0004

0.0002 0.0006

0.0008 0.0004

Training set loss Training set loss 0.0010 0.0006 0.0012

0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 0.1 Piano-midi.de, clip = 0.1

explicit 0.0020 0.0014 implicit 0.0012 0.0015 0.0010

0.0008 0.0010

Test set loss 0.0006 Test set loss 0.0005 0.0004

0.0002 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 0.1, difference from no clipping Piano-midi.de, clip = 0.1, difference from no clipping 0.0000 0.0005 explicit implicit 0.0005 0.0000 0.0010

0.0005 0.0015

Test set loss 0.0010 Test set loss 0.0020

0.0025 0.0015 0.0030 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 217

Piano-midi.de, clip = 0.01 Piano-midi.de, clip = 0.01 0.0012 0.0014 explicit implicit 0.0012 0.0010 0.0010 0.0008 0.0008

0.0006 0.0006

Training set loss 0.0004 Training set loss 0.0004 0.0002 0.0002 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 0.01, difference from no clipping Piano-midi.de, clip = 0.01, difference from no clipping

0.0004 explicit 0.00075 implicit 0.00050 0.0002 0.00025

0.0000 0.00000

0.00025 0.0002 Training set loss Training set loss 0.00050 0.0004 0.00075

0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 0.01 Piano-midi.de, clip = 0.01

0.00175 explicit 0.0020 implicit 0.00150

0.00125 0.0015 0.00100 0.0010 0.00075 Test set loss Test set loss 0.00050 0.0005 0.00025

0.00000 0.0000 0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate

Piano-midi.de, clip = 0.01, difference from no clipping Piano-midi.de, clip = 0.01, difference from no clipping 0.0015 0.00075 explicit implicit 0.0010 0.00050

0.00025 0.0005

0.00000 0.0000 Test set loss 0.00025 Test set loss 0.0005 0.00050 0.0010 0.00075

0 100 200 300 400 500 0 100 200 300 400 500 Learning rate Learning rate APPENDIX E. IMPLICIT BACKPROPAGATION 218

E.5.9 UCI classification training losses

Here we display the training loss of IB and EB on each of the 121 UCI datasets at the end of the 10th epoch. We only ran each experiment with one random seed. The experiments are divided up into large (>1000 datapoints) and small ( 1000 datapoints). ≤ The performance of EB and IB are very similar for the smaller learning rates. This is still the case at the optimal learning rate for each dataset. However, once either EB or IB moves one learning rate higher, the loss tends to explode. On average it is 892 times higher than the for the optimal loss. This is typical behavior for learning rate sensitivity, as is shown in [Goodfellow et al., 2016, Fig 11.1, p. 425]. The fact that there is such a large explosion in the loss indicates that the learning rate grid is too coarse to pick up the differences in behaviors in learning rates slightly larger than the optimal learning rate. Hence it struggles to systematically distinguish between the performance of EB and IB. APPENDIX E. IMPLICIT BACKPROPAGATION 219

Learning rate Dataset (large) SGD type 0.001 0.01 0.1 0.5 1.0 3.0 5.0 10.0 30.0 50.0

1.11 1.10 0.83 0.77 0.76 12.30 6.13 25.20 57.09 212.61 explicit abalone 1.11 1.10 0.83 0.77 0.77 2.65 3.18 22.95 98.31 84.63 implicit 0.55 0.35 0.32 0.32 0.31 0.36 1.52 2.97 17.99 27.19 explicit adult 0.55 0.35 0.32 0.32 0.32 0.34 1.15 4.31 12.22 20.27 implicit 0.60 0.36 0.25 0.27 0.30 0.52 1.29 14.74 27.23 113.73 explicit bank 0.60 0.36 0.25 0.27 0.29 0.35 1.46 21.44 65.24 19.08 implicit 1.18 1.01 0.85 0.50 0.39 0.59 5.28 4.55 53.97 106.93 explicit car 1.18 1.01 0.85 0.50 0.47 2.40 2.12 10.38 213.52 122.84 implicit 2.30 2.18 1.24 0.68 0.68 4.36 6.71 13.91 59.09 156.53 explicit cardiotocography-10clases 2.30 2.18 1.24 0.69 0.68 5.01 21.19 12.75 119.45 113.76 implicit 1.09 0.74 0.30 0.21 0.21 0.67 5.59 19.91 23.95 173.83 explicit cardiotocography-3clases 1.09 0.74 0.30 0.21 0.21 3.26 6.42 25.49 17.43 156.47 implicit 2.81 2.45 1.96 1.70 1.75 2.13 2.22 10.26 39.35 144.99 explicit chess-krvk 2.81 2.45 1.97 1.72 1.75 1.94 3.92 72.35 55.65 136.72 implicit 0.69 0.68 0.08 0.04 0.02 1.38 0.77 12.53 156.29 467.88 explicit chess-krvkp 0.69 0.68 0.08 0.07 0.07 0.33 2.80 12.91 26.88 56.18 implicit 0.56 0.54 0.48 0.40 0.38 2.36 4.59 9.25 34.85 59.43 explicit connect-4 0.56 0.54 0.50 0.44 0.46 3.23 4.64 11.31 33.66 52.39 implicit 1.08 1.07 1.07 0.99 0.96 1.00 5.34 17.32 149.70 133.28 explicit contrac 1.08 1.07 1.07 0.99 0.96 0.99 14.08 11.72 60.05 181.86 implicit 0.70 0.69 0.69 0.69 0.69 84.43 38.06 943.16 3267.16 1005.30 explicit hill-valley 0.70 0.69 0.69 0.69 0.70 168.00 180.79 907.54 2014.49 90.33 implicit 1.95 1.94 0.95 0.20 0.12 9.13 11.97 13.45 95.62 72.58 explicit image-segmentation 1.95 1.94 0.95 0.22 0.15 2.77 2.10 13.21 277.97 131.41 implicit 2.32 2.32 2.30 1.73 1.25 1.63 4.25 7.59 109.59 209.45 explicit led-display 2.32 2.32 2.30 1.73 1.26 3.13 5.43 30.01 93.78 240.85 implicit 3.27 3.21 1.31 0.61 0.64 0.83 2.47 5.98 35.33 60.50 explicit letter 3.27 3.21 1.31 0.59 0.68 0.84 1.85 5.92 27.24 39.06 implicit 0.65 0.53 0.47 0.41 0.41 0.46 3.59 24.93 40.47 57.64 explicit magic 0.65 0.53 0.48 0.42 0.43 0.53 5.43 5.41 41.77 146.46 implicit 0.37 0.29 0.23 0.21 0.22 1.40 5.60 18.28 60.37 190.87 explicit miniboone 0.37 0.29 0.22 0.22 0.22 1.12 2.71 3.39 20.09 32.84 implicit 1.05 0.98 0.36 0.31 0.25 5.32 9.87 30.26 146.88 227.42 explicit molec-biol-splice 1.05 0.98 0.36 0.34 0.33 27.64 8.85 29.27 149.89 188.89 implicit 0.69 0.36 0.01 0.00 0.00 0.00 0.06 1.96 157.31 87.08 explicit mushroom 0.69 0.36 0.01 0.00 0.00 0.00 0.05 4.86 11.92 42.45 implicit 0.49 0.19 0.08 0.03 2.75 34.14 112.76 228.01 436.03 288.58 explicit musk-2 0.49 0.19 0.08 0.03 2.46 63.96 109.08 36.49 263.51 105.52 implicit 1.47 1.21 0.51 0.12 0.14 0.31 0.87 6.99 63.94 195.34 explicit nursery 1.47 1.21 0.52 0.18 0.12 0.19 1.36 33.21 129.24 514.70 implicit 0.65 0.63 0.61 0.69 0.73 18.79 19.89 82.75 767.96 917.21 explicit oocytes-merluccius-nucleus-4d 0.65 0.63 0.61 0.68 0.72 30.91 91.02 108.44 291.86 181.39 implicit 1.04 0.88 0.35 0.22 0.20 1.46 3.09 34.59 255.92 35.96 explicit oocytes-merluccius-states-2f 1.04 0.88 0.35 0.22 0.21 3.73 7.49 42.19 53.97 86.39 implicit APPENDIX E. IMPLICIT BACKPROPAGATION 220

Learning rate Dataset (large) SGD type 0.001 0.01 0.1 0.5 1.0 3.0 5.0 10.0 30.0 50.0

2.27 1.55 0.10 0.02 0.01 0.23 1.59 10.73 51.20 121.22 explicit optical 2.27 1.55 0.10 0.03 0.01 0.22 0.42 3.97 8.87 15.60 implicit 0.51 0.18 0.09 0.09 0.10 9.91 11.00 8.83 115.42 30.89 explicit ozone 0.51 0.18 0.09 0.09 0.10 2.46 2.67 57.39 13.99 79.68 implicit 1.30 0.47 0.25 0.16 0.13 0.13 1.36 6.97 7.46 23.52 explicit page-blocks 1.30 0.47 0.25 0.16 0.15 0.13 2.79 24.66 43.27 111.66 implicit 2.30 2.22 0.28 0.07 0.06 0.10 0.95 4.35 20.55 66.20 explicit pendigits 2.30 2.22 0.28 0.05 0.05 0.18 0.26 0.72 12.03 32.69 implicit 4.60 4.59 3.91 1.39 0.82 6.24 16.50 52.11 152.54 299.85 explicit plant-margin 4.60 4.59 3.91 1.39 0.86 21.08 28.08 46.19 117.16 208.09 implicit 4.60 4.58 3.95 3.00 2.89 23.00 42.98 79.10 293.86 496.63 explicit plant-shape 4.60 4.58 3.95 3.00 2.87 19.54 43.80 85.42 234.65 572.32 implicit 4.61 4.59 4.17 1.20 0.55 0.34 16.13 39.06 123.57 193.97 explicit plant-texture 4.61 4.59 4.17 1.20 0.57 0.84 11.28 17.27 68.12 208.77 implicit 0.69 0.69 0.46 0.19 0.24 3.51 4.33 11.12 33.76 68.65 explicit ringnorm 0.69 0.69 0.47 0.22 0.21 2.64 5.10 53.34 29.11 72.79 implicit 2.18 1.04 0.08 0.01 0.02 5.35 21.05 92.46 978.63 1565.21 explicit semeion 2.18 1.04 0.08 0.01 0.00 90.42 1.71 7.28 149.68 559.76 implicit 0.67 0.36 0.18 0.16 0.17 8.17 2.06 20.93 67.37 102.59 explicit spambase 0.67 0.36 0.18 0.16 0.17 9.62 3.12 11.46 80.37 161.66 implicit 0.73 0.65 0.59 0.46 0.47 0.54 19.04 28.39 223.84 279.24 explicit statlog-german-credit 0.73 0.65 0.59 0.46 0.47 0.55 12.99 32.00 36.53 281.25 implicit 1.95 1.94 0.98 0.20 0.13 2.63 10.29 16.16 104.27 86.13 explicit statlog-image 1.95 1.94 0.98 0.20 0.18 3.61 7.49 62.98 65.65 62.29 implicit 1.73 0.90 0.37 0.32 0.30 3.21 7.15 25.85 51.35 53.25 explicit statlog-landsat 1.73 0.90 0.37 0.33 0.33 5.02 8.10 6.22 34.73 126.06 implicit 0.80 0.18 0.03 0.04 0.01 0.02 0.54 0.71 40.30 4.51 explicit statlog-shuttle 0.80 0.18 0.03 0.02 0.02 0.02 0.12 0.26 7.60 13.13 implicit 1.91 1.80 1.22 0.72 0.69 3.44 14.42 12.85 99.23 87.18 explicit steel-plates 1.91 1.80 1.22 0.73 0.70 68.19 11.35 169.10 136.68 340.84 implicit 0.68 0.30 0.13 0.07 0.06 0.06 1.37 3.02 8.48 52.41 explicit thyroid 0.68 0.30 0.13 0.07 0.07 0.09 2.71 6.23 33.58 51.21 implicit 0.64 0.63 0.63 0.52 0.52 0.59 0.98 10.04 61.18 87.35 explicit titanic 0.64 0.63 0.63 0.52 0.53 0.61 4.44 14.18 81.66 58.33 implicit 0.69 0.55 0.06 0.06 0.06 0.07 0.59 5.49 23.61 26.69 explicit twonorm 0.69 0.55 0.06 0.06 0.07 0.08 2.58 3.33 4.87 21.40 implicit 1.38 1.18 0.67 0.33 0.34 1.56 6.02 19.54 63.01 149.71 explicit wall-following 1.38 1.18 0.68 0.34 0.32 3.83 5.84 11.73 33.44 367.02 implicit 1.09 0.89 0.28 0.28 0.29 1.43 5.87 16.95 43.49 37.55 explicit waveform 1.09 0.89 0.28 0.28 0.29 1.81 6.47 9.13 32.50 30.61 implicit 1.09 0.82 0.29 0.33 0.29 10.64 21.44 41.78 107.99 249.30 explicit waveform-noise 1.09 0.82 0.29 0.33 0.31 7.92 9.56 10.69 87.91 35.45 implicit 1.78 1.55 1.20 0.96 0.95 1.01 4.72 16.11 72.08 206.43 explicit wine-quality-red 1.78 1.55 1.20 0.97 0.96 0.99 7.27 24.77 71.79 183.35 implicit 1.96 1.42 1.18 1.12 1.14 1.36 5.40 28.62 95.23 209.29 explicit wine-quality-white 1.96 1.42 1.18 1.12 1.15 1.30 22.65 22.89 267.55 226.40 implicit 2.29 2.07 1.73 1.21 1.21 2.11 5.53 16.11 40.73 139.51 explicit yeast 2.29 2.07 1.73 1.21 1.29 2.06 18.38 30.82 100.81 318.28 implicit APPENDIX E. IMPLICIT BACKPROPAGATION 221

Learning rate Dataset (small) SGD type 0.001 0.01 0.1 0.5 1.0 3.0 5.0 10.0 30.0 50.0

0.69 0.69 0.69 0.69 0.67 0.45 9.87 26.80 58.27 30.87 explicit acute-inflammation 0.69 0.69 0.69 0.69 0.67 0.45 28.65 19.13 11.30 7.00 implicit 0.68 0.68 0.66 0.65 0.64 0.02 12.88 0.73 7.07 28.32 explicit acute-nephritis 0.68 0.68 0.66 0.65 0.64 0.02 9.84 0.00 13.72 21.91 implicit 1.52 1.20 0.73 0.49 0.57 2.24 10.88 38.62 183.52 209.44 explicit annealing 1.52 1.20 0.73 0.55 0.58 12.99 18.48 21.23 180.15 250.37 implicit 2.46 1.92 0.63 0.25 24.92 57.87 162.63 256.21 507.11 1280.03 explicit arrhythmia 2.46 1.92 0.63 0.23 30.57 48.03 111.98 329.23 1401.02 2270.51 implicit 2.89 2.85 2.55 0.93 0.42 28.32 27.10 57.28 165.67 270.77 explicit audiology-std 2.89 2.85 2.55 0.93 0.42 49.83 75.19 136.42 247.31 433.71 implicit 1.13 1.07 0.92 0.91 0.91 0.32 2.48 7.48 75.97 111.11 explicit balance-scale 1.13 1.07 0.92 0.91 0.91 0.32 19.65 30.05 151.05 267.63 implicit 0.70 0.69 0.68 0.68 0.68 0.68 6.96 43.68 25.14 88.85 explicit balloons 0.70 0.69 0.68 0.68 0.68 0.68 19.71 17.06 33.36 256.16 implicit 0.69 0.62 0.54 0.54 0.54 0.54 0.54 17.12 67.33 85.15 explicit blood 0.69 0.62 0.54 0.54 0.54 0.54 0.54 28.88 49.44 143.05 implicit 0.71 0.68 0.63 0.62 0.59 0.61 20.42 33.63 38.61 97.66 explicit breast-cancer 0.71 0.68 0.63 0.62 0.59 0.61 9.89 48.75 21.05 177.86 implicit 0.70 0.67 0.64 0.10 0.09 0.11 2.13 3.60 3.59 16.08 explicit breast-cancer-wisc 0.70 0.67 0.64 0.10 0.09 0.11 3.99 2.78 36.46 43.01 implicit 0.71 0.67 0.11 0.05 0.05 2.33 1.64 3.06 51.97 1399.89 explicit breast-cancer-wisc-diag 0.71 0.67 0.11 0.05 0.05 6.14 6.91 15.54 44.17 95.69 implicit 0.69 0.65 0.53 0.45 0.37 0.53 125.74 28.90 97.62 72.58 explicit breast-cancer-wisc-prog 0.69 0.65 0.53 0.45 0.37 0.49 25.71 36.27 54.02 1589.74 implicit 1.78 1.78 1.78 1.77 1.75 1.18 6.23 35.95 69.66 510.77 explicit breast-tissue 1.78 1.78 1.78 1.77 1.75 1.18 6.97 28.55 38.37 179.07 implicit 0.71 0.69 0.67 0.66 0.63 0.68 13.56 69.97 139.57 503.67 explicit congressional-voting 0.71 0.69 0.67 0.66 0.63 0.68 39.21 52.27 166.97 265.89 implicit 0.69 0.69 0.59 0.30 0.32 75.14 93.29 204.05 363.93 1474.36 explicit conn-bench-sonar-mines-rocks 0.69 0.69 0.59 0.30 0.33 33.36 51.40 62.84 447.59 387.23 implicit 2.41 2.41 2.40 1.33 1.02 1.00 4.54 13.65 105.52 270.46 explicit conn-bench-vowel-deterding 2.41 2.41 2.40 1.33 1.05 1.20 10.81 30.59 244.61 175.47 implicit 0.69 0.69 0.67 0.33 0.33 16.51 7.90 30.36 46.97 58.09 explicit credit-approval 0.69 0.69 0.67 0.34 0.33 1.66 8.54 42.96 31.86 35.54 implicit 0.71 0.68 0.64 0.49 0.50 5.17 20.72 75.39 299.83 175.94 explicit cylinder-bands 0.71 0.68 0.64 0.50 0.52 31.31 71.16 57.63 68.50 386.99 implicit 1.81 1.78 1.32 0.23 0.09 5.17 1.43 28.02 62.58 374.63 explicit dermatology 1.81 1.78 1.32 0.23 0.08 4.62 17.01 26.38 53.13 170.26 implicit 0.67 0.65 0.59 0.25 0.19 0.19 12.02 73.34 95.15 136.16 explicit echocardiogram 0.67 0.65 0.59 0.25 0.19 0.21 11.35 9.94 80.01 34.53 implicit 2.11 2.06 1.79 1.19 0.96 0.97 23.58 13.47 90.18 77.99 explicit ecoli 2.11 2.06 1.79 1.19 0.96 1.02 9.14 38.29 107.86 64.26 implicit 1.15 1.10 1.01 0.43 0.43 5.66 9.66 16.76 63.23 115.98 explicit energy-y1 1.15 1.10 1.01 0.45 0.49 4.34 1.61 17.47 44.28 59.35 implicit 1.18 1.11 1.02 0.40 0.39 0.44 1.36 5.68 14.01 38.33 explicit energy-y2 1.18 1.11 1.02 0.41 0.47 0.94 1.09 8.01 12.52 48.55 implicit 0.72 0.69 0.48 0.35 0.35 0.35 0.35 6.90 34.50 131.21 explicit fertility 0.72 0.69 0.48 0.35 0.35 0.35 0.35 62.82 11.22 17.92 implicit 2.09 2.07 1.92 1.62 1.40 3.40 57.37 38.55 70.61 328.58 explicit flags 2.09 2.07 1.92 1.62 1.40 19.65 38.18 103.61 150.33 631.23 implicit 1.76 1.74 1.60 1.50 1.50 1.57 39.15 41.59 255.73 224.50 explicit glass 1.76 1.74 1.60 1.50 1.50 1.59 11.74 27.31 140.43 378.32 implicit 0.62 0.60 0.58 0.58 0.58 0.59 2.22 9.43 59.62 151.06 explicit haberman-survival 0.62 0.60 0.58 0.57 0.58 0.59 2.16 11.38 35.12 166.79 implicit 1.14 1.12 1.04 1.02 1.02 1.02 7.22 10.04 104.41 229.25 explicit hayes-roth 1.14 1.12 1.04 1.02 1.02 1.02 8.16 26.48 31.05 127.38 implicit APPENDIX E. IMPLICIT BACKPROPAGATION 222

Learning rate Dataset (small) SGD type 0.001 0.01 0.1 0.5 1.0 3.0 5.0 10.0 30.0 50.0

1.64 1.58 1.30 0.91 0.90 0.90 7.38 16.45 108.54 99.98 explicit heart-cleveland 1.64 1.58 1.30 0.91 0.90 0.91 12.00 18.19 143.55 81.04 implicit 0.66 0.66 0.66 0.31 0.30 0.39 20.01 39.19 49.91 151.12 explicit heart-hungarian 0.66 0.66 0.66 0.31 0.30 0.37 18.31 5.64 211.14 81.71 implicit 1.65 1.63 1.51 1.35 1.33 1.32 42.48 47.01 296.09 295.82 explicit heart-switzerland 1.65 1.63 1.51 1.35 1.33 1.32 32.38 135.06 444.06 270.40 implicit 1.65 1.63 1.55 1.50 1.48 1.39 24.42 59.53 434.70 186.13 explicit heart-va 1.65 1.63 1.55 1.50 1.48 1.38 48.08 62.68 171.65 209.72 implicit 0.72 0.68 0.53 0.39 0.24 0.32 21.86 29.30 37.44 118.80 explicit hepatitis 0.72 0.68 0.53 0.39 0.24 0.35 32.01 19.00 41.54 111.05 implicit 0.70 0.68 0.65 0.34 0.34 0.32 24.44 60.72 42.39 335.00 explicit horse-colic 0.70 0.68 0.65 0.34 0.34 0.33 8.99 118.06 347.70 60.39 implicit 0.70 0.66 0.60 0.59 0.55 0.53 16.40 55.36 45.47 299.05 explicit ilpd-indian-liver 0.70 0.66 0.60 0.59 0.55 0.57 11.63 85.04 125.33 158.17 implicit 0.70 0.68 0.61 0.20 0.20 6.67 49.14 21.57 101.66 233.83 explicit ionosphere 0.70 0.68 0.61 0.21 0.22 34.03 9.97 64.38 768.47 572.70 implicit 1.11 1.11 1.10 1.09 1.09 0.90 7.90 14.65 14.59 257.50 explicit iris 1.11 1.11 1.10 1.09 1.09 0.77 9.25 26.47 147.58 175.54 implicit 1.09 1.07 0.98 0.91 0.91 0.91 10.79 13.42 38.90 30.30 explicit lenses 1.09 1.07 0.98 0.91 0.91 0.91 9.38 16.82 39.64 214.31 implicit 2.72 2.68 2.09 0.93 0.72 26.84 19.45 79.46 196.99 349.34 explicit libras 2.72 2.68 2.09 0.94 0.76 28.88 85.69 101.16 474.27 1377.74 implicit 2.19 1.78 0.43 0.25 11.66 14.48 53.18 222.26 605.63 449.56 explicit low-res-spect 2.19 1.78 0.43 0.26 16.57 46.29 109.37 82.74 94.69 259.85 implicit 1.11 1.10 1.04 0.27 0.03 28.79 87.95 149.18 1071.40 298.13 explicit lung-cancer 1.11 1.10 1.04 0.27 0.03 95.18 87.43 284.11 543.03 359.73 implicit 1.39 1.33 1.00 0.49 0.33 13.29 15.73 5.16 112.88 181.66 explicit lymphography 1.39 1.33 1.00 0.49 0.34 6.11 14.23 10.87 34.13 261.76 implicit 0.71 0.70 0.69 0.44 0.42 0.46 12.34 4.14 145.04 36.88 explicit mammographic 0.71 0.70 0.69 0.44 0.42 0.44 10.77 9.93 116.31 244.93 implicit 0.69 0.69 0.65 0.11 0.01 36.10 27.78 109.47 258.44 974.85 explicit molec-biol-promoter 0.69 0.69 0.65 0.11 0.01 97.50 201.21 551.19 642.16 863.06 implicit 0.69 0.69 0.69 0.69 0.62 0.71 17.77 17.11 89.50 138.78 explicit monks-1 0.69 0.69 0.69 0.69 0.62 0.71 15.65 14.64 52.53 179.91 implicit 0.68 0.66 0.65 0.65 0.65 0.65 4.11 7.96 135.92 180.88 explicit monks-2 0.68 0.66 0.65 0.65 0.65 0.65 11.61 6.54 25.41 200.93 implicit 0.70 0.69 0.69 0.68 0.51 0.76 4.16 11.69 45.17 222.38 explicit monks-3 0.70 0.69 0.69 0.68 0.51 0.76 4.11 63.57 268.51 85.94 implicit APPENDIX E. IMPLICIT BACKPROPAGATION 223

Learning rate Dataset (small) SGD type 0.001 0.01 0.1 0.5 1.0 3.0 5.0 10.0 30.0 50.0

0.69 0.63 0.32 0.59 35.26 48.17 358.12 325.26 2522.81 5693.28 explicit musk-1 0.69 0.63 0.32 0.27 26.13 83.83 373.71 1592.32 5190.91 8195.80 implicit 0.69 0.68 0.65 0.53 0.50 16.61 24.44 108.62 133.12 854.10 explicit oocytes-trisopterus-nucleus-2f 0.69 0.68 0.65 0.53 0.52 22.13 109.88 27.77 631.40 859.68 implicit 1.00 0.89 0.41 0.24 0.21 7.33 15.53 14.13 42.37 516.20 explicit oocytes-trisopterus-states-5b 1.00 0.89 0.41 0.24 0.21 49.55 19.40 24.82 154.19 116.00 implicit 0.67 0.65 0.55 0.33 0.31 2.05 3.80 24.16 51.71 52.80 explicit parkinsons 0.67 0.65 0.55 0.33 0.31 20.58 14.53 90.82 37.16 468.79 implicit 0.66 0.66 0.66 0.64 0.51 0.53 8.14 27.66 112.66 79.41 explicit pima 0.66 0.66 0.66 0.64 0.51 0.52 2.47 69.41 143.07 249.05 implicit 1.19 1.15 0.90 0.77 0.77 0.74 5.41 33.57 193.72 360.10 explicit pittsburg-bridges-MATERIAL 1.19 1.15 0.90 0.77 0.77 0.74 15.25 37.79 203.28 303.36 implicit 1.19 1.17 1.06 1.00 1.00 0.97 7.93 33.08 87.76 122.83 explicit pittsburg-bridges-REL-L 1.19 1.17 1.06 1.00 1.00 0.97 11.10 25.15 124.80 330.23 implicit 1.05 1.05 1.04 1.03 1.03 0.93 14.34 36.34 172.06 135.45 explicit pittsburg-bridges-SPAN 1.05 1.05 1.04 1.03 1.03 0.93 2.40 45.68 166.73 279.46 implicit 0.83 0.78 0.47 0.35 0.35 0.35 0.35 5.05 27.71 124.39 explicit pittsburg-bridges-T-OR-D 0.83 0.78 0.47 0.35 0.35 0.35 0.35 21.00 38.70 23.87 implicit 2.00 1.98 1.80 1.59 1.58 1.56 2.51 77.31 186.86 290.00 explicit pittsburg-bridges-TYPE 2.00 1.98 1.80 1.59 1.58 1.56 5.37 82.37 272.11 350.51 implicit 0.62 0.61 0.58 0.58 0.58 0.59 34.64 19.23 30.94 453.85 explicit planning 0.62 0.61 0.58 0.58 0.58 0.59 19.01 15.86 166.14 261.24 implicit 1.21 1.17 0.94 0.71 0.69 0.68 25.50 73.03 60.28 83.88 explicit post-operative 1.21 1.17 0.94 0.71 0.69 0.68 25.03 50.13 90.23 71.85 implicit 2.70 2.68 2.50 2.15 1.83 13.21 17.81 41.57 99.73 94.32 explicit primary-tumor 2.70 2.68 2.50 2.15 1.83 10.32 15.55 35.04 70.16 326.01 implicit 1.11 1.10 1.09 0.62 0.47 0.68 8.84 13.44 28.41 32.21 explicit seeds 1.11 1.10 1.09 0.62 0.47 7.20 5.44 28.78 87.84 85.50 implicit 2.90 2.86 2.43 0.65 0.21 2.12 5.69 33.16 151.98 119.04 explicit soybean 2.90 2.86 2.43 0.66 0.22 9.36 20.48 32.44 161.14 288.93 implicit 0.70 0.69 0.67 0.44 0.45 0.57 54.69 105.29 263.16 159.21 explicit spect 0.70 0.69 0.67 0.44 0.45 0.47 18.18 159.16 21.86 567.59 implicit 0.69 0.64 0.49 0.38 0.42 20.77 6.86 28.35 622.85 585.05 explicit spectf 0.69 0.64 0.49 0.37 0.51 80.29 132.31 27.83 130.49 1229.86 implicit 0.68 0.66 0.64 0.64 0.64 0.64 19.39 35.72 31.33 78.80 explicit statlog-australian-credit 0.68 0.66 0.64 0.64 0.64 0.64 73.71 98.74 228.37 448.01 implicit 0.69 0.69 0.69 0.32 0.33 6.74 4.19 9.82 143.11 37.86 explicit statlog-heart 0.69 0.69 0.69 0.32 0.33 15.30 0.98 28.84 157.01 541.23 implicit 1.39 1.39 1.31 0.90 0.78 16.15 9.25 42.65 163.61 403.98 explicit statlog-vehicle 1.39 1.39 1.31 0.89 0.70 32.42 20.88 93.00 74.45 656.19 implicit 1.80 1.72 0.70 0.06 5.96 24.28 43.88 113.24 69.17 482.32 explicit synthetic-control 1.80 1.72 0.70 0.06 9.76 24.30 50.20 110.53 164.13 692.48 implicit 1.11 1.11 1.09 1.09 1.09 1.11 16.23 47.19 43.59 92.10 explicit teaching 1.11 1.11 1.09 1.09 1.09 1.11 12.49 21.53 266.23 199.40 implicit 0.68 0.66 0.64 0.52 0.10 2.05 4.13 45.16 156.90 81.64 explicit tic-tac-toe 0.68 0.66 0.64 0.52 0.09 10.22 40.48 53.51 47.37 200.14 implicit 0.70 0.70 0.67 0.04 0.01 0.00 16.94 8.70 153.88 503.41 explicit trains 0.70 0.70 0.67 0.04 0.01 0.00 55.28 83.36 0.00 0.00 implicit 0.68 0.66 0.63 0.58 0.37 6.00 13.78 9.92 23.02 189.26 explicit vertebral-column-2clases 0.68 0.66 0.63 0.58 0.37 0.39 3.75 13.45 106.58 52.45 implicit 1.10 1.09 1.05 0.77 0.54 7.37 17.75 12.02 44.07 81.04 explicit vertebral-column-3clases 1.10 1.09 1.05 0.77 0.53 2.55 9.28 7.79 42.32 40.59 implicit 1.12 1.12 1.09 0.53 0.46 5.43 3.44 9.22 43.31 62.74 explicit wine 1.12 1.12 1.09 0.53 0.47 5.78 37.86 96.27 64.82 88.60 implicit 1.92 1.91 1.83 1.57 1.01 6.70 6.23 33.11 172.96 380.68 explicit zoo 1.92 1.91 1.83 1.57 1.01 6.58 22.16 56.49 198.40 189.51 implicit