Efficient Learning of Smooth Probability Functions from Bernoulli Tests With
Total Page:16
File Type:pdf, Size:1020Kb
Efficient learning of smooth probability functions from Bernoulli tests with guarantees Paul Rolland 1 Ali Kavis 1 Alex Immer 1 Adish Singla 2 Volkan Cevher 1 Abstract grows cubicly with the number of tests, and can thus be inapplicable when the amount of data becomes large. There We study the fundamental problem of learning an has been extensive work to resolve this cubic complexity unknown, smooth probability function via point- associated with GP computations (Rasmussen, 2004). How- wise Bernoulli tests. We provide a scalable algo- ever, these methods require additional approximations on rithm for efficiently solving this problem with rig- the posterior distribution, which impacts the efficiency and orous guarantees. In particular, we prove the con- make the overall algorithm even more complicated, leading vergence rate of our posterior update rule to the to further difficulties in establishing theoretical convergence true probability function in L2-norm. Moreover, guarantees. we allow the Bernoulli tests to depend on con- Recently, Goetschalckx et al.(2011) tackled the issues en- textual features and provide a modified inference countered by LGP, and proposed a scalable inference engine engine with provable guarantees for this novel based on Beta Processes called Continuous Correlated Beta setting. Numerical results show that the empirical Process (CCBP) for approximating the probability function. convergence rates match the theory, and illustrate By scalable, we mean the algorithm complexity scales lin- the superiority of our approach in handling con- early with the number of tests. However, no theoretical anal- textual features over the state-of-the-art. ysis is provided, and the approximation error saturates as the number of tests becomes large (cf., section 5.1). Hence, it is unclear whether provable convergence and scalability 1. Introduction can be obtained simultaneously. One of the central challenges in machine learning relates to This paper bridges this gap by designing a simple and scal- learning a continuous probability function from point-wise able method for efficiently approximating the probability Bernoulli tests (Casella & Berger, 2002; Johnson & Wich- functions with provable convergence. Our algorithm con- ern, 2002). Examples include, but are not limited to, clinical structs a posterior distribution that allows inference in linear trials (DerSimonian & Laird, 1986), recommendation sys- time (w.r.t. the number of tests) and converges in L2-norm tems (McNee et al., 2003), sponsored search (Pandey & to the true probability function (uniformly over the feature Olston, 2007), and binary classification. Due to the curse space), see Theorem1. of dimensionality, we often require a large number of tests in order to obtain an accurate approximation of the target In addition, we also allow the Bernoulli tests to depend on function. It is thus necessary to use a method that scalably contextual parameters influencing the success probabilities. To ensure convergence of the approximation, these features arXiv:1812.04428v3 [cs.LG] 23 Aug 2019 constructs this approximation with the number of tests. need to be taken into account in the inference engine. We A widely used method for efficiently solving this problem thus provide the first algorithm that efficiently treats these is the Logistic Gaussian Process (LGP) algorithm (Tokdar contextual features while performing inference, and retain & Ghosh, 2007). While this algorithm has no clear provable guarantees. As a motivation for this setting, we demonstrate guarantees, it is shown to be very efficient in practice in how this algorithm can efficiently be used for treating bias approximating the target function. However, the time re- in the data (Agarwal et al., 2018). quired for inferring the posterior distribution at some point 1Ecole Polytechnique Fed´ erale´ de Lausanne, Switzerland 2Max 1.1. Basic model and the challenge Planck Institute for Software Systems, Saarbrucken,¨ Germany. Correspondence to: Paul Rolland <paul.rolland@epfl.ch>. In its basic form, we seek to learn an unknown, smooth func- tion π : X! [0; 1], X ⊂ Rd from point-wise Bernoulli Proceedings of the 36 th International Conference on Machine tests, where d is the features space dimension. We model Learning, Long Beach, California, PMLR 97, 2019. Copyright such tests as si ∼ Bernoulli(π(xi)); where ∼ means dis- 2019 by the author(s). Efficient learning of smooth probability functions from Bernoulli tests with guarantees tributed as, xi 2 X , and we model our knowledge of π at 2. We provide an efficient and scalable algorithm that point x via a random variable π~(x). is able to handle contextual parameters explicitly influencing the probability function. Without additional assumptions, this problem is clearly hard, since experiments are performed only at points fxigi=1;:::;t, 3. We demonstrate the efficiency of our method on syn- which constitute a negligible fraction of the space X . In thetic data, and observe the benefit of treating con- this paper, we will make the following assumption about the textual features in the inference. We also present a probability function: real-world application of our model. Assumption 1. The function π is L-Lipschitz continuous, Roadmap We analyze the simple setting without contex- i.e., there exists a constant L 2 such that R tual features (referred to as static setting). We start by designing a Bayesian update rule for point-wise inference, jπ(x) − π(y)j ≤ Lkx − yk (1) and then include experience sharing in order to ensure L2 convergence over the whole space with a provable rate. We 8x; y 2 X for some norm k:k over X . then treat the dynamic setting in the same way, and finally In order to ensure convergence of π~(x) to π(x) for all x 2 demonstrate our theoretical findings via extensive simula- X , we must design a way of sharing experience among tions and on a case-study of clinical trials for rehabilitation variables using this smoothness assumption. (cf., Section5). Our work uses a prior for π based on the Beta distribution and designs a simple sharing scheme to provably ensure 2. Related Work convergence of the posterior. Correlated inference via GPs The idea of sharing expe- rience of experiments between points with similar target Dynamic setting In a more generic setting that we call function value is inspired by what was done with Gaussian “dynamic setting,” we assume that each Bernoulli test can be Processes (GPs) (Williams & Rasmussen, 1996). GPs essen- linearly influenced by some contextual features. Each exper- tially define a prior over real-valued functions defined on a iment is then described by a quadruplet Si = (xi; si;Ai;Bi) continuous space, and use a kernel function that represents and we study the following simple model for its probability how experiments performed at different points in the space of success: are correlated. GP-based models are not directly applicable to our problem P r(s = 1) := A π(x ) + B : (2) i i i i setting given that our function π represents probabilities in the range [0; 1]. For our problem setting, a popular ap- We have to restrict 0 ≤ B ≤ 1, 0 ≤ A + B ≤ 1 to ensure i i i proach is Logistic Gaussian Processes (LGP) (Tokdar & that this quantity remains a probability given that π(x ) lies i Ghosh, 2007)—it learns an intermediate GP over the space in [0; 1]. We assume that we have knowledge of estimates X which is then squashed to the range [0; 1] via a logistic for A and B in expectation. i i transformation. Experience sharing is then done by mod- Such contextual features naturally arise in real applications eling the covariance between tests performed at different (Krause & Ong, 2011). For example, in the case of clinical points through a predefined kernel. This allows constructing trials (DerSimonian & Laird, 1986), the goal is to learn a covariance matrix between test points, which can be used the patient’s probability of succeeding at an exercise with to estimate the posterior distribution at any other sample a given difficulty. A possible contextual feature can then point. be the state of fatigue of the patient, which can influence Gaussian Copula Processes (GCP) (Wilson & Ghahramani, its success probability. Here, LGP algorithm could be used, 2010) is another GP-based approach that learns a GP and but the contextual feature must be added as an additional uses a copula to map it to Beta distributions over the space. parameter. We show that, if we know how this feature influences the Bernoulli tests, then we can achieve faster More recently, Ranganath and Blei (Ranganath & Blei, convergence. 2017) explored correlated random measures including corre- lated Beta-Bernoulli extension. However, GPs are still used 1.2. Our contributions in order to define these correlations. We summarize our contributions as follows: There are at least two key limitations with these “indirect” approaches: First, the posterior distributions after observing 1. We provide the first theoretical guarantees for the a Bernoulli outcome is analytically intractable, and needs to problem of learning a smooth probability function be approximated, e.g. using Laplace approximation (Tokdar over a compact space using Beta Processes. & Ghosh, 2007). Second, the time complexity of prediction Efficient learning of smooth probability functions from Bernoulli tests with guarantees grows cubicly O(t3) with respect to the number of samples pose an experience sharing method and prove convergence t. There is extensive work to resolve this cubic complexity guarantees. associated to GP computations (Rasmussen, 2004). How- ever, these methods require additional approximations on 3.1. Uncorrelated case: a Bayesian approach the posterior distribution, which impacts the efficiency, and make the overall algorithm even more complicated, leading Suppose we do not use the smoothness assumption of π. to further difficulties in establishing theoretical guarantees. Then a naive solution is to model each random variable π~(x) by the conjugate prior of the Bernoulli distribution, Methods based on GPs that take context variables into ac- which is the Beta distribution.