Efficient learning of smooth functions from Bernoulli tests with guarantees

Paul Rolland 1 Ali Kavis 1 Alex Immer 1 Adish Singla 2 Volkan Cevher 1

Abstract grows cubicly with the number of tests, and can thus be inapplicable when the amount of data becomes large. There We study the fundamental problem of learning an has been extensive work to resolve this cubic complexity unknown, smooth probability function via point- associated with GP computations (Rasmussen, 2004). How- wise Bernoulli tests. We provide a scalable algo- ever, these methods require additional approximations on rithm for efficiently solving this problem with rig- the posterior distribution, which impacts the efficiency and orous guarantees. In particular, we prove the con- make the overall algorithm even more complicated, leading vergence rate of our posterior update rule to the to further difficulties in establishing theoretical convergence true probability function in L2-norm. Moreover, guarantees. we allow the Bernoulli tests to depend on con- Recently, Goetschalckx et al.(2011) tackled the issues en- textual features and provide a modified inference countered by LGP, and proposed a scalable inference engine engine with provable guarantees for this novel based on Beta Processes called Continuous Correlated Beta setting. Numerical results show that the empirical Process (CCBP) for approximating the probability function. convergence rates match the theory, and illustrate By scalable, we mean the algorithm complexity scales lin- the superiority of our approach in handling con- early with the number of tests. However, no theoretical anal- textual features over the state-of-the-art. ysis is provided, and the approximation error saturates as the number of tests becomes large (cf., section 5.1). Hence, it is unclear whether provable convergence and scalability 1. Introduction can be obtained simultaneously. One of the central challenges in relates to This paper bridges this gap by designing a simple and scal- learning a continuous probability function from point-wise able method for efficiently approximating the probability Bernoulli tests (Casella & Berger, 2002; Johnson & Wich- functions with provable convergence. Our algorithm con- ern, 2002). Examples include, but are not limited to, clinical structs a posterior distribution that allows inference in linear trials (DerSimonian & Laird, 1986), recommendation sys- time (w.r.t. the number of tests) and converges in L2-norm tems (McNee et al., 2003), sponsored search (Pandey & to the true probability function (uniformly over the feature Olston, 2007), and binary classification. Due to the curse space), see Theorem1. of dimensionality, we often require a large number of tests in order to obtain an accurate approximation of the target In addition, we also allow the Bernoulli tests to depend on function. It is thus necessary to use a method that scalably contextual parameters influencing the success . To ensure convergence of the approximation, these features arXiv:1812.04428v3 [cs.LG] 23 Aug 2019 constructs this approximation with the number of tests. need to be taken into account in the inference engine. We A widely used method for efficiently solving this problem thus provide the first algorithm that efficiently treats these is the Logistic Gaussian Process (LGP) algorithm (Tokdar contextual features while performing inference, and retain & Ghosh, 2007). While this algorithm has no clear provable guarantees. As a motivation for this setting, we demonstrate guarantees, it is shown to be very efficient in practice in how this algorithm can efficiently be used for treating bias approximating the target function. However, the time re- in the data (Agarwal et al., 2018). quired for inferring the posterior distribution at some point

1Ecole Polytechnique Fed´ erale´ de Lausanne, Switzerland 2Max 1.1. Basic model and the challenge Planck Institute for Software Systems, Saarbrucken,¨ Germany. Correspondence to: Paul Rolland . In its basic form, we seek to learn an unknown, smooth func- tion π : X → [0, 1], X ⊂ Rd from point-wise Bernoulli Proceedings of the 36 th International Conference on Machine tests, where d is the features space dimension. We model Learning, Long Beach, California, PMLR 97, 2019. Copyright such tests as si ∼ Bernoulli(π(xi)), where ∼ means dis- 2019 by the author(s). Efficient learning of smooth probability functions from Bernoulli tests with guarantees tributed as, xi ∈ X , and we model our knowledge of π at 2. We provide an efficient and scalable algorithm that point x via a random variable π˜(x). is able to handle contextual parameters explicitly influencing the probability function. Without additional assumptions, this problem is clearly hard, since experiments are performed only at points {xi}i=1,...,t, 3. We demonstrate the efficiency of our method on syn- which constitute a negligible fraction of the space X . In thetic data, and observe the benefit of treating con- this paper, we will make the following assumption about the textual features in the inference. We also present a probability function: real-world application of our model. Assumption 1. The function π is L-Lipschitz continuous, Roadmap We analyze the simple setting without contex- i.e., there exists a constant L ∈ such that R tual features (referred to as static setting). We start by designing a Bayesian update rule for point-wise inference, |π(x) − π(y)| ≤ Lkx − yk (1) and then include experience sharing in order to ensure L2 convergence over the whole space with a provable rate. We ∀x, y ∈ X for some norm k.k over X . then treat the dynamic setting in the same way, and finally In order to ensure convergence of π˜(x) to π(x) for all x ∈ demonstrate our theoretical findings via extensive simula- X , we must design a way of sharing experience among tions and on a case-study of clinical trials for rehabilitation variables using this smoothness assumption. (cf., Section5). Our work uses a prior for π based on the and designs a simple sharing scheme to provably ensure 2. Related Work convergence of the posterior. Correlated inference via GPs The idea of sharing expe- rience of experiments between points with similar target Dynamic setting In a more generic setting that we call function value is inspired by what was done with Gaussian “dynamic setting,” we assume that each Bernoulli test can be Processes (GPs) (Williams & Rasmussen, 1996). GPs essen- linearly influenced by some contextual features. Each exper- tially define a prior over real-valued functions defined on a iment is then described by a quadruplet Si = (xi, si,Ai,Bi) continuous space, and use a kernel function that represents and we study the following simple model for its probability how experiments performed at different points in the space of success: are correlated. GP-based models are not directly applicable to our problem P r(s = 1) := A π(x ) + B . (2) i i i i setting given that our function π represents probabilities in the range [0, 1]. For our problem setting, a popular ap- We have to restrict 0 ≤ B ≤ 1, 0 ≤ A + B ≤ 1 to ensure i i i proach is Logistic Gaussian Processes (LGP) (Tokdar & that this quantity remains a probability given that π(x ) lies i Ghosh, 2007)—it learns an intermediate GP over the space in [0, 1]. We assume that we have knowledge of estimates X which is then squashed to the range [0, 1] via a logistic for A and B in expectation. i i transformation. Experience sharing is then done by mod- Such contextual features naturally arise in real applications eling the covariance between tests performed at different (Krause & Ong, 2011). For example, in the case of clinical points through a predefined kernel. This allows constructing trials (DerSimonian & Laird, 1986), the goal is to learn a covariance matrix between test points, which can be used the patient’s probability of succeeding at an exercise with to estimate the posterior distribution at any other sample a given difficulty. A possible contextual feature can then point. be the state of fatigue of the patient, which can influence Gaussian Copula Processes (GCP) (Wilson & Ghahramani, its success probability. Here, LGP algorithm could be used, 2010) is another GP-based approach that learns a GP and but the contextual feature must be added as an additional uses a copula to map it to Beta distributions over the space. parameter. We show that, if we know how this feature influences the Bernoulli tests, then we can achieve faster More recently, Ranganath and Blei (Ranganath & Blei, convergence. 2017) explored correlated random measures including corre- lated Beta-Bernoulli extension. However, GPs are still used 1.2. Our contributions in order to define these correlations. We summarize our contributions as follows: There are at least two key limitations with these “indirect” approaches: First, the posterior distributions after observing 1. We provide the first theoretical guarantees for the a Bernoulli outcome is analytically intractable, and needs to problem of learning a smooth probability function be approximated, e.g. using Laplace approximation (Tokdar over a compact space using Beta Processes. & Ghosh, 2007). Second, the time complexity of prediction Efficient learning of smooth probability functions from Bernoulli tests with guarantees grows cubicly O(t3) with respect to the number of samples pose an experience sharing method and prove convergence t. There is extensive work to resolve this cubic complexity guarantees. associated to GP computations (Rasmussen, 2004). How- ever, these methods require additional approximations on 3.1. Uncorrelated case: a Bayesian approach the posterior distribution, which impacts the efficiency, and make the overall algorithm even more complicated, leading Suppose we do not use the smoothness assumption of π. to further difficulties in establishing theoretical guarantees. Then a naive solution is to model each random variable π˜(x) by the conjugate prior of the Bernoulli distribution, Methods based on GPs that take context variables into ac- which is the Beta distribution. Then, starting from a prior count have also been designed (Krause & Ong, 2011). How- π˜(x) ∼ Beta(α(x), β(x)) ∀x ∈ X , the Bayesian posterior ever, they simply allow for the use of specific kernels for π˜(x|S) conditioned on S = {(xi, si)}i=1,...,t is: these variables and still require an increase in the feature t X space dimension. In this work, by directly modifying the π˜(x|S) ∼ Beta α(x) + δsi=1δxi=x, inference process, we compute a posterior that takes into i=1 account contextual features without increasing the feature t ! X space dimension. β(x) + δsi=0δxi=x i=1 Correlated Beta Processes In contrast to GPs, it is very challenging to define correlated Beta distributions. The first where δa is the Kronecker delta. work introducing a Beta process without using GPs is the This particular update scheme does not take smoothness one of Hjort (1990) (Hjort, 1990), but lacks the correlation assumption of the function into account and any experiment aspect. Some other works studied multi-variate Beta distri- Si = (xi, si) only influences the corresponding random butions for simple settings considering only a few variables variable π˜(xi). In particular, if no experiment is performed (Gupta & Wong, 1985; Olkin & Liu, 2003). at x, then our belief of π(x) remains unchanged. It is thus Goetschalckx et al. (2011) (Goetschalckx et al., 2011) pro- necessary to make use of the smoothness assumption. posed an approach named Continuous Correlated Beta Pro- cesses (CCBP) to deal with a continuous space of Beta dis- 3.2. Leveraging smoothness of π via experience sharing tributions and to share experience between them via using a Goetschalckx et al.(2011) propose a mechanism of experi- kernel. CCBP is shown to achieve results comparable to the ence sharing among correlated variables. To this purpose, state of the art approach based on LGP. Furthermore, it is they introduce a kernel K : X ×X → [0, 1] where K(x, xi) shown that CCBP is much more time efficient—linear O(t) indicates to what extent the experience for experiment at runtime for CCBP in comparison to cubic O(t3) runtime of xi should be shared with any other point x. Indeed, thanks GP-based methods. to the Lipschitz continuity assumption (1), we expect close CCBP approach has been used in several real-world appli- points to have similar probabilities. However, although cation settings, e.g., for learning patient’s fitness function the Beta distribution is the conjugate prior of the Bernoulli in rehabilitation (Goetschalckx et al., 2011), learning the distribution, this conjugacy does not hold anymore when wandering behavior of people with dementia (Hoey et al., we use experience sharing. Instead of using the Bayesian 2012), and in the application of analyzing genetic ancestry posterior, we use the following update rule: in admixed populations (Gompert, 2016). t X π˜(x|S) ∼ α(x) + δ K(x, x ), However, the method presented in (Goetschalckx et al., Beta si=1 i 2011), by simply using a heuristic kernel, gives an approx- i=1 (3) t ! imation which does not converge to the target function as X β(x) + δ K(x, x ) . the number of samples increases. In order for the method to si=0 i converge, this kernel must depend on the number of samples. i=1 In this paper, we provide an explicit kernel to use which en- With this update rule, the result of experiment Si influences sures convergence of the approximated probability function all variables π˜(x) for which K(x, xi) > 0, and the magni- to the target function with provable rate. tude of influence is proportional to K(x, xi). Note that this update rule is no more Bayesian. However, all existing meth- 3. Inference for the static setting ods, including LGP and GCP, also involve non-Bayesian updates. We start by analyzing the static setting, in which no contex- tual features influence the Bernoulli tests. We first design a In Goetschalckx et al.(2011), authors do not specify any Bayesian update rule for point-wise inference, then we pro- particular choice of kernel function and the selection pro- cess is left as a heuristic. We show that proper selection of Efficient learning of smooth probability functions from Bernoulli tests with guarantees

Algorithm 1 Smooth Beta Process (SBP) over the whole space. However, this is not due to the partic- Input: experiments points and observations S = ular algorithm we use, and we empirically show that LGP suffers the same dependence on the space dimension (see {xi, si}i=1,..,t, query point x ∈ X , prior knowledge π˜(x) ∼ B(α(x), β(x)) Section5). Similar issues are also prevalent in GP opti- Output: Posterior distribution π˜(x|S) mization despite which great applications success has been 1 obtained (Shahriari et al., 2016). 1. Set ∆ ∝ 1 t d+2 2. Compute the posterior as in (3) using kernel In appendixB, we show how this algorithm naturally ap- 0 K(x, x ) = δkx−x0k≤∆ plies to binary classification. Restricted to classification, our algorithm becomes similar to the fixed-radius nearest neighbour algorithm (Chen et al., 2018), but the current kernel is essential for convergence to the true underlying framework allows for error quantification and precise prior distributions at all points. In particular, to ensure conver- injection. gence in L2 norm, this kernel must shrink as the number of observations increases (Algorithm1). We can see that our algorithm, called Smooth Beta Process (SBP), allows for 4. Inference for the dynamic setting fast inference at any point x ∈ X , since it simply requires We analyze the dynamic setting where Bernoulli tests are to find the tests that are performed at most ∆ far from x, influenced by contextual features as in (2). and compute the posterior distribution as in (3) depending on the number of successes and failures within these tests. 4.1. Uncorrelated case: a Bayesian approach Remark 1. The particular dependence of the kernel on the number of samples ensuring optimal convergence is not As previously, we start by analyzing the uncorrelated case, trivial, and can only be found via a theoretical analysis of i.e., how to update the distribution of π˜(x) conditioned on the model. the outputs of experiments S = {(si, xi,Ai,Bi)}i=1,...,t all performed at x. Theorem 1. Let π : [0, 1]d → [0, 1] be L-Lipschitz con- tinuous. Suppose we measure the results of experiments Since experiments are not samples from Bernoulli variables S = {(xi, si)}i=1,...,t where si is a sample from a Bernoulli with parameter π(x), Bayesian update is not straightforward distribution with parameter π(xi). Experiment points but can be achieved using sums of Beta distributions, as {xi}i=1,...,t are assumed to be i.i.d. and uniformly dis- shown in Theorem2. tributed over the space. Then, starting with a uniform prior Theorem 2. Suppose π˜(x) ∼ B(α, β) and we observe the d α(x) = β(x) = 1 ∀x ∈ [0, 1] , the posterior π˜(x|S) ob- result of a sample s ∼ Bernoulli(Aπ(x) + B). Then the tained from SBP uniformly converges in L2-norm to π(x), Bayesian posterior for π˜(x) conditioned on this observation i.e. is given by 2  2d − 2  sup S (˜π(x|S) − π(x)) = O L d+2 t d+2 , E E π˜(x|s) ∼ C0B(α + 1, β) + C1B(α, β + 1), (5) x∈[0,1]d (4) where in the case of success (s = 1), we have where the outer expectation is performed over experiment points {xi}i=1,...,t and their results {si}i=1,...,t. SBP also Bβ (A + B)α computes point-wise posterior in time . C0 = ,C1 = O(t) Bβ + (A + B)α Bβ + (A + B)α Remark 2. This theorem provides an upper bound for the L2 norm over any point of the space, and takes into account and in the case of failure (s = 0), we have where the experiments are performed in the feature space. (1 − B)β If all experiments are performed at the same point, then we C0 = , recover the familiar square-root rate at that point (Ghosal, (1 − B)β + (1 − A − B)α 1997), but we would not converge at points that are far (1 − A − B)α C = . away. 1 (1 − B)β + (1 − A − B)α Remark 3. The constraint on the input space being [0, 1]d In (5), we mean that the density function of the posterior can easily be extended to any compact space X ⊂ d. This R random variable π˜(x|s) is the weighted sum of the two would simply modify the convergence rate by a factor equal density functions given on the right-hand side. to the volume of X . Remark 4. The dependence of the convergence rate on the Then, by using this result recursively on a set of experiments feature space dimension is due to the curse of dimension- S = {(si, xi,Ai,Bi)}i=1,...,t, we can obtain a general up- ality, and the fact that we provide convergence uniformly date rule. Efficient learning of smooth probability functions from Bernoulli tests with guarantees

Corollary 1. Suppose π˜(x) ∼ B(α, β) and we observe the Algorithm 2 Inference engine for the simplified dynamic outputs of experiments S = {(si, x, Ai,Bi)}i=1,...,t where setting: Constant A, B si’s are sampled from Bernoulli random variables, each Input: experiments descriptions S = with parameter A π(x) + B . Then the Bayesian posterior i i {(xi, si, A, B)}i=1,..,t, point of interest x ∈ X , π˜(x|S) is given by prior knowledge π˜(x) ∼ B(α(x), β(x)) π˜(x|S) t Output: Posterior distribution 1 X t 1. Set ∆ ∝ 1 π˜(x|S) ∼ Ci B(α + i, β + t − i) (6) t d+2 i=0 2. Build the set of neighboring experiments Sx = {(xi, si, A, B): kx − xik ≤ ∆} t where Ci ’s can be computed via an iterative procedure 3. Compute the posterior as in Corollary1 (or2 if 0 starting from C0 = 1 and ∀n = 0, ..., t: A + B = 1) using the results of experiments Sx as if performed at x. n+1 1 n n Ci = n (BiCi (β+n−i)+(Ai +Bi)Ci−1(α+i−1)) Es Pt where S = i=1 si is the total number of successes and if sn = 1; and S 1 Ct ∝ (α − 1 + i)!(β + t − 1 − i)!BS−i (8) n+1 n i i Ci = n ((1 − Bi)Ci (β + n − i) Ef ∀i = 0, ..., S. We can compute all Ct’s in time O(t) via the + (1 − A − B )Cn (α + i − 1)) i i i i−1 t (S−i)(α+i) t relation Ci+1 = B(i+1)(β+t−1−i) Ci . n n if sn = 0. Es and Ef are normalization factors that ensure Pn n n i=0 Ci = 1∀n. For simplicity of notation, we use C−1 = 4.2. Leveraging smoothness of π via experience n Cn+1 = 0 ∀n. sharing: Simplified setting

This gives us a way of updating, in a fully Bayesian man- We now introduce the use of correlations between samples in ner, the distribution of π˜(x) conditioned on observations of the update rule, in a similar way as done in the static setting. experiments performed at x. It involves a linear combina- We first analyze a simplified setting where the contextual tion of Beta distributions, with coefficients depending on parameters are constant among all experiments, i.e. Ai = A, successes and contextual features. Bi = B ∀i. Due to the more complex form of the Bayesian update in the Certainty invariance assumption For simplicity, we uncorrelated case, it turns out that the previous technique will assume that if an event is certain (i.e., occurs with is not straightforward to apply. One way to introduce this probability 1), then context variables cannot lower this prob- idea of experience sharing would be to modify the Bayesian ability (i.e., ∀xi if π(xi), then Aiπ(xi) + Bi = 1). This update rule of Theorem2 as: is trivially equivalent to the constraint Ai + Bi = 1 ∀i. If, on the contrary, there is an impossibility invariance, i.e., π˜(x|si) ∼ C0B(α + K(x, xi), β) + C1B(α, β + K(x, xi)) context variables cannot make an impossible event possible, where K is the same kernel as defined previously. we can make a change of variable f ↔ 1 − f (i.e., invert the meanings of “success” and “failure”) in order to satisfy However, if K(x, x0) is real valued and that we apply such the certainty invariance assumption. an update for each experiment, then it turns out that the num- ber of terms required for describing the posterior distribution In Corollary1, we see that the time complexity required t 2 grows exponentially with the number of observations. In or- for computing Ci ∀i = 0, ..., t is O(t ). However, in the der to ensure the tractability of the posterior, we can restrict particular case where Ai = A, Bi = B ∀i and A + B = 1, 0 the kernel to values in {0, 1}, e.g. K(x, x ) = δkx−x0k≤∆ the update rule of Corollary1 becomes much simpler, i.e., for some kernel width ∆. This means that, each time we computable in time O(t): make an observation at xi, all random variables π˜(x) with Corollary 2. Suppose π˜(x) ∼ B(α, β) and we observe kx − xik ≤ ∆ are updated as if the same experiments had the outputs of experiments S = {(si, x, 1 − B,B)}i=1,...,t been performed at x (Algorithm2). where s ∼ Bernoulli((1−B)π(x)+B). Then the Bayesian i We provide guarantees for convergence of the posterior posterior π˜(x|S) conditioned on these observations is given distribution generated by Algorithm2 under the certainty by S invariance assumption (Theorem3). This constraint on A X t and B is equivalent to saying that the contextual parameters π˜(x|S) ∼ Ci B(α + i, β + t − i) (7) i=0 necessarily increases the success probability. Efficient learning of smooth probability functions from Bernoulli tests with guarantees

Algorithm 3 Contextual Smooth Beta Process (CSBP) by CSBP under the certainty invariance assumption (Theo- Input: experiments descriptions S = rem4). d {(xi, si,Ai,Bi)}i=1,..,t, point of interest x ∈ X , Theorem 4. Let π : [0, 1] →]0, 1] be L-Lipschitz contin- prior knowledge π˜(x) ∼ B(α(x), β(x)) uous. Suppose we observe the results of experiments S = Output: Posterior distribution π˜(x|S) {(xi, si, 1 − Bi,Bi)}i=1,...,t where si ∼ Bernoulli((1 − 1 1. Set ∆ ∝ 1 (Bi +i))π(xi)+Bi +i), i.e., contextual features are noisy. t d+2 2. Build the set of neighboring experiments We assume i’s are independent random variables with zero 2 Sx = {(xi, si,Ai,Bi): kx − xik ≤ ∆} mean and variance σ . The points {xi}i=1,...,t are assumed 1 P 3. Compute means B = Bi, to be i.i.d. uniformly distributed over the space. Then, start- |Sx| i:kx−xik≤∆ d 1 P ing with a uniform prior α(x) = β(x) = 1 ∀x ∈ [0, 1] , A = Ai |Sx| i:kx−xik≤∆ the posterior π˜(x|S) obtained from Algorithm3 uniformly 4. Compute the posterior as in Corollary1 (or2 if converges in L2-norm to π(x), i.e., A + B = 1) using the results of experiments Sx as if performed at x, and with constant parameters A, B. 2 sup ES E (˜π(x|S) − π(x)) x∈[0,1]d

 2 2d − 2  = O c(B, σ )L d+2 t d+2 , Theorem 3. Let π : [0, 1]d →]0, 1] be L-Lipschitz con- tinuous. Suppose we observe the results of experiments 2 where c(B, σ ) is a constant depending on {Bi}i=1,...,t and S = {(xi, si, 1−B,B)}i=1,...,t where si ∼ Bernoulli((1− the noise σ2. Moreover, CSBP computes the posterior in B)π(x) + B). Experiment points {xi}i=1,...,t are assumed to be i.i.d. uniformly distributed over the space. Then, start- time O(t). ing with a uniform prior α(x) = β(x) = 1 ∀x ∈ [0, 1]d, the posterior π˜(x|S) obtained from Algorithm2 uniformly 5. Numerical experiments converges in L2-norm to π(x), i.e., We devise a set of experiments to demonstrate the capa- 2 bilities of our inference engine and validate the theoretical sup ES E (˜π(x|S) − π(x)) x∈[0,1]d bounds for static and dynamic settings. We start with syn-  2d − 2  thetic experiments in 1D and 2D, and finally reproduce the = O L d+2 ((1 − B)t) d+2 . case study used in (Goetschalckx et al., 2011) to show the efficiency of our dynamic algorithm. Moreover, Algorithm2 computes the point-wise posterior in time O(t). 5.1. Synthetic examples Remark 5. We observe that adding the contextual parame- We construct a function π : X → [0, 1], uniformly select ter B does not modify the convergence rate compared to the points {x } , and sample s ∼ Bernoulli(π(x )), i = static case. By using other algorithms such as LGP, param- i i=1,...,t i i 1, ..., t. From these data, SBP constructs the posterior distri- eter B should be added to the feature space, increasing its butions π˜(x|S) ∀x ∈ X . This experiment is performed both dimension by 1, which impacts the convergence rate as we in 1D setting using a feature space X = [0, 1], and in 2D demonstrate in the sequel. with X = [0, 1]2. We also apply LGP and CCBP (with fixed square exponential kernel) to this problem for comparison. 4.3. Leveraging smoothness of π via experience Explicit forms of the chosen functions are presented in the sharing: General setting Appendix.

We finally analyze the general setting where contextual pa- For the dynamic setting, contextual parameters {Bi}i=1,...,t rameters are noisy and may vary among the experiments. are sampled independently and uniformly from [0, 1], Instead of using each Ai, Bi in the update rule of the poste- and the tests are then performed by sampling si ∼ rior, we can perform this update as if all experiments were Bernoulli((1 − Bi)π(xi) + Bi), i = 1, ..., t. The poste- performed with the same coefficients A and B, which are rior is constructed using CSBP. We also applied LGP to the means of the coefficients Ai and Bi respectively. This this dynamic setting by including the parameter B as an approximation simplifies the analysis and does not influence additional feature. In order to evaluate π, LGP returns the the error rate. approximated distribution associated with B = 0. Algorithm3, called Contextual Smooth Beta Process For the static setting (1D and 2D), Figures1 (left) and2 (CSBP), is general and can be applied to any contextual (left) show the L2 errors of the posterior distributions aver- parameters Ai,Bi with no constraints. We provide guaran- aged across all x ∈ X , and over 20 runs, as functions of the tees for convergence of the posterior distribution generated number of samples t. We can observe the convergence upper Efficient learning of smooth probability functions from Bernoulli tests with guarantees

Figure 1. Left: L2 error for 1D static setting. Middle: Mean posterior estimates E[˜π(x|S)] generated by Algorithm1 for different kernel widths. Right: Running time for Algorithm1 and LGP

Figure 2. L2 error of posterior E[˜π(x|S)] for 2D static and 1D dynamic settings, averaged over all points x ∈ X versus number of samples.

2 √ bounds O(1/t 3 ) in 1D, and O(1/ t) in 2D as predicted leads to a highly non-smooth posterior, due to insufficient by Theorem1. sharing.

We observe that LGP and our method perform similarly, Figure2 (right) similarly shows the L2 error of the posterior as pointed out in (Goetschalckx et al., 2011). However, distributions for the dynamic setting, also averaged over all running LGP takes significantly more time than our method x ∈ X , and over 20 runs.√ We again observe the convergence since its time complexity is O(t3) compared to O(t) for our upper bounds O(1/ t) in 2D as predicted by Theorem4. algorithm, as demonstrated numerically in Figure1 (right). We observe that our algorithm performs much better than We observe that CCBP saturates after some time since the LGP since it applies on a lower dimensional space. kernel is independent of the number of samples. Additionally, Figure1 (left) demonstrates two sets of error 5.2. Application to biased data: a case study from curves for variations of Algorithm1. To argue about optimal- Goetschalckx et al.(2011) ity of kernel width specification, we run SBP with fixed ker- Handling biased data is currently one of the major problems − 1 − 1 nel widths ∆1 = 50 d+2 and ∆2 = 500000 d+2 . When in machine learning. In this section, we investigate how − 1 ∆  t d+2 , the L2 error initially decays at a slow rate and CSBP can treat bias by the mean of contextual features. error remains larger than the optimal setting (green curves). − 1 In line with Goetschalckx et al.(2011), we conduct a case On the contrary if we fix the kernel width ∆  t d+2 , the study with synthetic stroke rehabilitation data. The goal is error saturates at early iterations (blue curves). to determine the probability that a patient succeeds in an Figure1 (middle) shows how the built posterior distribution exercise based on its difficulty. However, the patient can approximates the true synthetic probability function by plot- in some cases be fatigued, which influences the success ting the posterior mean over the space X for the different probability and thus introduces a bias in the experiments. kernel widths. We observe that using a wide kernel (∆ ) 1 Let f(x) denote the success probability function for exercise leads to a posterior which is too smooth, due to experience with difficulty x ∈ [0, 1], when the patient is not fatigued. oversharing. On the other hand, using a narrow kernel (∆ ) 2 We assume that the patient has a certain level of fatigue Efficient learning of smooth probability functions from Bernoulli tests with guarantees

Figure 3. Left: Mean posteriors E[˜π(x|S)] for target functions representing rested (blue) and highest fatigue (red) states. Right: L2 error for rested state averaged over all points versus sample size t.

αf ∈ [0.5, 1], in which case his success probability function apply this method to such problems. becomes αf f. Note that the impossibility invariance assumption holds in Discussion and future work The current analysis can this case since being fatigue cannot make possible a task only model a particular type of contextual influence, which A π(x ) + B which was already impossible. As mentioned previously, modifies the success probability as i i i. It we can then simply make a change of variable in order to turns out that Theorem2 can be generalized to any satisfy the certainty invariance assumption, and safely apply polynomial transformation of the success probability (i.e. Pp (j) j the dynamic algorithm. j=0 ai π(xi) ), allowing for a wider class of contextual influences. Alternatively, by treating the level of fatigue as a new di- mension in the feature space, LGP can be applied to the Moreover, the theoretical framework we provide seems to be rehabilitation case study. applicable to a large class of problems, such as risk tracking, Bandit setting, active learning, etc. Extending this model We assume that the difficulty of the exercise influences (in to such applications would also be an interesting research an unknown way) the success probability as f(x) = 1 − x, direction. x ∈ [0, 1]. We construct a synthetic dataset by uniformly sampling exercise difficulties x and fatigue levels α , and f 7. Acknowledgement then sampling the success from fαf (x). Figure3 shows the reconstructed success probability distri- This work was supported by the Swiss National Science butions when the patient is either not fatigued (rested state) Foundation (SNSF) under grant number 407540 167319. or in the final fatigued state (αf = 0.5), as well as the L2 er- ror of the posterior for the rested state. Since LGP operates References on a higher dimensional space, we observe that the L2 error decays slower and the approximation of the target function Agarwal, A., Beygelzimer, A., Dud´ık, M., Langford, J., and for the rested state is worse than CSBP. Wallach, H. A reductions approach to fair classification. arXiv preprint arXiv:1803.02453, 2018.

6. Conclusions Audibert, J.-Y., Tsybakov, A. B., et al. Fast learning rates for plug-in classifiers. The Annals of , 35(2): In this paper, we build an inference engine for learning 608–633, 2007. smooth probability functions from a set of Bernoulli experi- ments, which may be influenced by contextual features. We Casella, G. and Berger, R. L. Statistical inference, volume 2. design an efficient and scalable algorithm for computing a Duxbury Pacific Grove, CA, 2002. posterior converging to the target function with provable rate, and demonstrate its efficiency on synthetic and real- Chen, G. H., Shah, D., et al. Explaining the success of world problems. These characteristics together with the nearest neighbor methods in prediction. Foundations and simplicity of SBP make it a competitive tool compared to Trends R in Machine Learning, 10(5-6):337–588, 2018. LGP, which has been shown to be an important tool in many DerSimonian, R. and Laird, N. Meta-analysis in clinical real-world applications. We thus expect practitioners to trials. Controlled clinical trials, 7(3):177–188, 1986. Efficient learning of smooth probability functions from Bernoulli tests with guarantees

Ghosal, S. A review of consistency and convergence of pos- Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and terior distribution. In Varanashi Symposium in Bayesian De Freitas, N. Taking the human out of the loop: A review Inference, Banaras Hindu University, 1997. of bayesian optimization. Proceedings of the IEEE, 104 (1):148–175, 2016. Goetschalckx, R., Poupart, P., and Hoey, J. Continuous correlated beta processes. In IJCAI, 2011. Tokdar, S. T. and Ghosh, J. K. Posterior consistency of logis- tic gaussian process priors in . Journal Gompert, Z. A continuous correlated beta process model of statistical planning and inference, 137(1):34–42, 2007. for genetic ancestry in admixed populations. PloS one, 11(3):e0151047, 2016. van der Vaart, A. W., van Zanten, J. H., et al. Rates of contraction of posterior distributions based on gaussian Gupta, A. K. and Wong, C. On three and five parameter process priors. The Annals of Statistics, 36(3):1435–1463, bivariate beta distributions. Metrika, 32(1):85–91, 1985. 2008.

Hjort, N. L. Nonparametric bayes based on beta Williams, C. K. and Rasmussen, C. E. Gaussian processes processes in models for life history data. The Annals of for regression. In NIPS, pp. 514–520, 1996. Statistics, pp. 1259–1294, 1990. Wilson, A. G. and Ghahramani, Z. Copula processes. In Hoey, J., Yang, X., Grzes, M., Navarro, R., and Favela, NIPS, pp. 2460–2468, 2010. J. Modeling and learning for lacasa, the location and context-aware safety assistant. In NIPS 2012 Workshop on Machine Learning Approaches to Mobile Context Lake Tahoe, NV, 2012.

Johnson, R. A. and Wichern, D. Multivariate analysis. Wiley Online Library, 2002.

Knapik, B. T., van der Vaart, A. W., van Zanten, J. H., et al. Bayesian inverse problems with gaussian priors. The Annals of Statistics, 39(5):2626–2657, 2011.

Krause, A. and Ong, C. S. Contextual gaussian process bandit optimization. In NIPS, pp. 2447–2455, 2011.

Krichevsky, R. and Trofimov, V. The performance of univer- sal encoding. IEEE Transactions on Information Theory, 27(2):199–207, 1981.

McNee, S. M., Lam, S. K., Konstan, J. A., and Riedl, J. Interfaces for eliciting new user preferences in recom- mender systems. In International Conference on User Modeling, pp. 178–187. Springer, 2003.

Olkin, I. and Liu, R. A bivariate beta distribution. IEEE Transactions on Information Theory, 62(4):407–412, 2003.

Pandey, S. and Olston, C. Handling advertisements of un- known quality in search advertising. In NIPS, pp. 1065– 1072, 2007.

Ranganath, R. and Blei, D. M. Correlated random measures. Journal of the American Statistical Association, pp. 1–14, 2017.

Rasmussen, C. E. Gaussian processes in machine learning. In Advanced lectures on machine learning, pp. 63–71. Springer, 2004. Efficient learning of smooth probability functions from Bernoulli tests with guarantees A. Proofs In this appendix, we provide all proofs for Theorems and Corollaries stated in the paper. We emphasize that we are aware of existing theoretical tools provided in (van der Vaart et al., 2008) and (Knapik et al., 2011), but our approach is different and specific to the current setup.

A.1. Proofs of point-wise Bayesian update in dynamic case Pn n Pn n Theorem 5. Suppose π˜(x) ∼ i=0 Ci B(α + i, β + n − i) with i=0 Ci = 1, and we observe the result s of a sample from a Bernoulli random variable with parameter Aπ(x) + B. Then the Bayesian posterior for π˜(x) conditioned on this observation is: n+1 X n+1 π˜(x|s) ∼ Ci B(θ, α + i, β + n − i) (9) i=0 where ∀i = 0, ..., n + 1: n+1 1 n n Ci = n (BCi (β + n − i) + (A + B)Ci−1(α + i − 1)) Es if s = 1 and n+1 1 n n Ci = n ((1 − B)Ci (β + n − i) + (1 − A − B)Ci−1(α + i − 1)) Ef

n n Pn n+1 n n if s = 0. Es and Ef are normalization factors that ensure i=0 Ci = 1. For simplicity of notation C−1 = Cn+1 = 0 ∀n.

Proof. Suppose the observation is a success, i.e. s = 1. Let fπ˜(x) : [0, 1] → [0, 1] be the density function of the random variable π˜(x), and let fπ˜(x)|s=1 : [0, 1] → [0, 1] be its the density function conditioned on this observation. Then,

P r(s = 1|π˜(x) = θ)f (θ) f (θ) = π˜(x) π˜(x)|s=1 P r(s = 1) n X n ∝ (Aθ + B) Ci B(α + i, β + n − i) i=0 n X θα+i−1(1 − θ)β+n−i−1 = (B(1 − θ) + (A + B)θ) Cn i B(α + i, β + n − i) i=0 n X θα+i−1(1 − θ)β+n−i B(α + i, β + n − i + 1) = B Cn i B(α + i, β + n − i + 1) B(α + i, β + n − i) i=0 n X θα+i(1 − θ)β+n−i−1 B(α + i + 1, β + n − i) + (A + B) Cn i B(α + i + 1, β + n − i) B(α + i, β + n − i) i=0 n X β + n − i = B CnB(α + i, β + n − i + 1) i α + β + n i=0 n X α + i + (A + B) CnB(α + i + 1, β + n − i) i α + β + n i=0 n+1 X n n ∝ (BCi (β + n − i) + (A + B)Ci−1(α + i − 1))B(α + i, β + n − i) i=0 n+1 X n+1 ∝ Ci B(θ, α + i, β + n − i) i=0

B(α+1,β) α B(α,β+1) β where B is the Beta function, and satisfies B(α,β) = α+β and B(α,β) = α+β . Efficient learning of smooth probability functions from Bernoulli tests with guarantees

n+1 Pn+1 n+1 In order to ensure that this remains a probability distribution, coefficients Ci must satisfy i=0 Ci = 1. The result for s = 0 can be showed similarly.

Theorem2 is a special case of this result, for n = 0. Corollary1 directly follows from this theorem, by applying it recursively for each observations.

Corollary 2. Suppose π˜(x) ∼ B(α, β) and we observe the outputs of experiments S = {(si, x, 1 − B,B)}i=1,...,t where si ∼ Bernoulli((1 − B)π(x) + B). Then the Bayesian posterior π˜(x|S) conditioned on these observations is given by

S X t π˜(x|S) ∼ Ci B(α + i, β + t − i) (10) i=0 Pt where S = i=1 si is the total number of successes and S Ct ∝ (α − 1 + i)!(β + t − 1 − i)!BS−i (11) i i

t (S−i)(α+i) t t ∀i = 0, ..., S. Using the relation Ci+1 = B(i+1)(β+t−1−i) Ci , we can compute all Ci ’s in time O(t).

t Proof. We want to prove that the iterative process for computing the coefficients Ci ’s in Corollary1 ends with coefficients t 0 Ci ’s of equation (11). We prove this by induction over t. For t = 0, the result is obvious, since S = 0, and C0 = 1.

Now suppose the result is true for some time n and let us prove that it remains true for time n + 1. Let Sn be the total number of successes observed up to time n, and let sn+1 be the new observation at time n + 1. Suppose sn+1 = 1. Then Sn+1 = Sn + 1, and ∀i = 1, ..., Sn+1:

n+1 n n Ci ∝ BCi (β + n − i) + Ci−1(α + i − 1)   Sn ∝ (α − 1 + i)!(β + n − 1 − i)!BSn+1−i(β + n − i) i   Sn + (α − 1 + i − 1)!(β + n − i)!BSn+1−i(α + i − 1) i − 1   Sn+1 = (α − 1 + i)!(β + (n + 1) − 1 − i)!BSn+1−i i

Similarly, if sn+1 = 0, then Sn+1 = Sn, and ∀i = 1, ..., Sn+1:

n+1 n Ci ∝ (1 − B)Ci (β + n − i)   Sn+1 ∝ (α − 1 + i)!(β + (n + 1) − 1 − i)!BSn+1−i i

In particular, we can see that the number of coefficients increases only when we observe a success.

A.2. Proof of convergence in the static case Theorem 1. Let π : [0, 1]d → [0, 1] be L-Lipschitz continuous. Suppose we measure the results of experiments S = {(xi, si)}i=1,...,t where si is a sample from a Bernoulli distribution with parameter π(xi). Experiment points {xi}i=1,...,t are assumed to be i.i.d. and uniformly distributed over the space. Then, starting with a uniform prior α(x) = β(x) = d 1 ∀x ∈ [0, 1] , the posterior π˜(x|S) obtained from Algorithm1 uniformly converges in L2-norm to π(x), i.e.

2  − 2  sup ES E (˜π(x|S) − π(x)) = O t d+2 , (12) x∈[0,1]d Efficient learning of smooth probability functions from Bernoulli tests with guarantees where the outer expectation is performed over experiment points {xi}i=1,...,t and their results {si}i=1,...,t. Moreover, Algorithm1 computes the posterior in time O(t).

Proof. For simplicity, suppose we start with a uniform prior for each x, i.e. π˜(x) ∼ B(1, 1). Let x ∈ X , ∆ ∈ [0, 1] be arbitrary. Suppose we fix the experiment points X = {xi}i=1,...,t and that among these t points, n of them are at most ∆ far from x along all of d dimensions. We assume without loss of generality that these points are x1, ..., xn. Let Dx be the random variable denoting the number of experiments occurring at most ∆ far from x along each dimension. Since we d d assume that experiment points {xi}i=1,...,t are uniformly distributed over [0, 1] , it follows that Dx ∼ Bin(t, ∆ ). Pn Let Sx denote the number of successes that occurred among these n experiments. Sx can be written as a Sx = i=1 si where s = {si}i=1,...,n are sampled independently, and si ∼ Bernoulli(π(xi)) denotes whether experiment on xi was successful or not. Thus, Sx follows a Poisson-, and it follows:

n X E(Sx|Dx = n) = π(xi) (13) i=1 and n n !2 2 X X E(Sx|Dx = n) = π(xi)(1 − π(xi)) + π(xi) (14) i=1 i=1

Note that after s successes among n experiments, the update rule3 leads to the posterior:

π˜(x|S) ∼ B(1 + s, 1 + n − s). (15)

Using the properties of the Beta distribution, we have: s + 1 (˜π(x|S)|S = s, D = n) = (16) E x x n + 2 and (s + 1)(n + 1 − s) (s + 1)2 (˜π(x|S)2|S = s, D = n) = + E x x (n + 2)2(n + 3) (n + 2)2 (s + 1)(s + 2) = (n + 2)(n + 3) s2  1  = + O (n + 2)2 n + 1

Therefore:

t " n 2 X X 2 EX,s E (˜π(x|S) − π(x)) = P r(Dx = n)Ex1,...,xn P r(Sx = s|Dx = n) E(˜π(x|S) |Sx = s, Dx = n) n=0 s=0 2 −2π(x)E(˜π(x|S)|Sx = s, Dx = n) + π(x) t " n # X X  s2  1  s  = P r(D = n) P r(S = s|D = n) + O − 2π(x) + π(x)2 x Ex1,...,xn x x (n + 2)2 n + 1 n + 2 n=0 s=0 t   n n !2 X 1 X X = P r(D = n) π(x )(1 − π(x )) + π(x ) x Ex1,...,xn (n + 2)2  i i i  n=0 i=0 i=0 n # 2 X  1  − π(x) π(x ) + π(x)2 + O kx − x k ≤ ∆ ∀i = 1, ..., n n + 2 i n + 1 i i=0 t " n X 1 X = P r(D = n) π(x )(1 − π(x )) x Ex1,...,xn (n + 2)2 i i n=0 i=0 Efficient learning of smooth probability functions from Bernoulli tests with guarantees    n   1 X 1 +  (π(x) − π(xi))(π(x) − π(xj)) + O kx − xik ≤ ∆ ∀i = 1, ..., n (n + 2)2 n + 1 i,j=0 t X  1  1  ≤ P r(D = n) + O + L2∆2 x 4(n + 2) n + 1 n=0  1  = L2∆2 + O ∆d(t + 1)

1 1 − d+2 Therefore, assuming L > 0, we can choose ∆ = 2 t , and we obtain: L d+2

2  2d − 2  EX,s E((˜π(x) − π(x)) = O L d+2 t d+2 (17)

In particular, we observe that the smaller L, the larger ∆. Indeed, the smoother the function, the more we can share experience between points {xi}.

A.3. Proof of convergence in the simplified dynamic case Theorem 3. Let π : [0, 1]d →]0, 1] be L-Lipschitz continuous. Suppose we observe the results of experiments S = {(xi, si, 1 − B,B)}i=1,...,t where si ∼ Bernoulli((1 − Bi)π(x) + Bi). Experiment points {xi}i=1,...,t are assumed to be uniformly distributed over the space. Then, ∀x ∈ X , the posterior π˜(x|S) obtained from Algorithm2 converges in L2-norm to π(x): 2  − 2  ES E (˜π(x) − π(x)) = O ((1 − B)t) d+2 . (18) Moreover, Algorithm2 computes the posterior in time O(t).

Proof. Let x ∈ X , ∆ ∈]0, 1] be arbitrary. Suppose we fix the experiment points X and that among these t points, n of them d are at most ∆ far from x, i.e. Dx = n where Dx ∼ Bin(t, ∆ ) is the random variable as defined in A.2. We assume without loss of generality that these points are x1, ..., xn. For simplicity, we treat the case where α = β = 1, i.e. the prior for π˜(x) is uniform ∀x ∈ X . Note that in this case, the coefficients Ci’s in Corollary2 can be written as:

1 n − i Cn = BS−i, (19) i E0 S − i i = 0, ..., S where E0 is the normalization factor and S is the number of observed successes.

n s X X i + 1 [ (˜π(x|S))|D = n] = P r(S = s) Cn,s(x) Es E x x i n + 2 s=0 i=0 n Ps n−i s−i i+1 X i=0 s−i B n+2 = P r(Sx = s) Ps n−j s−j s=0 j=0 s−j B n Ps n−s+i i i ! X s + 1 i=0 i B n+2 = P r(Sx = s) − n + 2 Ps n−s+j j s=0 j=0 j B n n+1 s !! X s + 1 B  s + 1  B = P r(S = s) − 1 − 1 − s x n + 2 1 − B n + 2 Ps n+1 j s−j s=0 j=0 j B (1 − B) 1 + Pn (B + (1 − B)π(x )) B = i=1 i − (n + 2)(1 − B) 1 − B n n+1 s n−s+1 B X  s + 1  B (1 − B) + P r(S = s) 1 − s 1 − B x n + 2 Ps n+1 j n+1−j s=0 j=0 j B (1 − B) Efficient learning of smooth probability functions from Bernoulli tests with guarantees

Pn π(x ) 1 − 2B = i=1 i + n + 2 (1 − B)(n + 2) n n+1 s n−s+1 B X  s + 1  B (1 − B) + P r(S = s) 1 − s 1 − B x n + 2 Ps n+1 j n+1−j s=0 j=0 j B (1 − B)

Ps n−j s−j Ps n+1 j s−j At the fourth equality, we used the fact that j=0 s−j B = j=0 j B (1 − B) , which can be shown by induction over s. We also used the following calculations:

s s X n − s + i X n − s + i Bii = (n − s + 1) Bi i i − 1 i=0 i=1 s−1 X n − s + 1 + i = B(n − s + 1) Bi i i=0 s−1 s−1 ! X n − s + i X n − s + i = B(n − s + 1) Bi + Bi i i − 1 i=0 i=1 s s−1 ! X n − s + i n X n − s + 1 + i n + 1 = B(n − s + 1) Bi − Bs + B Bi − Bs i s i s i=0 i=0

n+1 n+1 n Therefore, by equaling lines 2 and 4 and using s+1 = s + s , we get:

s−1 s ! X n − s + 1 + i 1 X n − s + i n + 1 Bi = Bi − Bs (20) i 1 − B i s + 1 i=0 i=0 Thus: s s ! X n − s + i B  s + 1  X n − s + i n + 1 Bii = 1 − Bi − Bs (21) i 1 − B n + 2 i s + 1 i=0 i=0

Let Z ∼ Bin(n + 1,B). Then:

n n+1 s n−s+1 t X  s + 1  B (1 − B) X P r(Z = s) P r(S = s) 1 − s ≤ P r(S = s) (22) x n + 2 Ps n+1 j n+1−j x P r(Z ≤ s) s=0 j=0 j B (1 − B) s=0 Pn We know that E(Z) = (n + 1)B and E(Sx) = nB + i=1(1 − B)π(xi). We then have:

(Z)+ (S ) n E E x n X P r(Z = s) X2 P r(Z = s) X P r(Z = s) P r(S = s) = P r(S = s) + P r(S = s) x P r(Z ≤ s) x P r(Z ≤ s) x P r(Z ≤ s) s=0 s=0 E(Z)+E(Sx) s= 2 +1  (Z) + (S )  (Z) + (S ) ≤ P r S ≤ E E x + 2P r Z ≥ E E x x 2 2 2 − (E(Sx)−E(Z)) ≤ 3e 2n 2 − (1−B) πn¯ ≤ Ce 2

1 Pn 1 where C ∈ R, π¯ = n i=1 π(xi) > 0. In the second step, we used P r(Z ≤ s) ≥ 2 for any s ≥ E(Z). The last step follows from Hoeffding’s inequality. So the previous upper bound decays exponentially to 0. We thus have:

Pn π(x ) 1 − 2B [ (˜π(x|S))|D = n] = i=1 i + (23) Es E x n + 2 (1 − B)(n + 2) Efficient learning of smooth probability functions from Bernoulli tests with guarantees

We now bound the second moment of π˜(x|S). With the same notations as previously, we have:

n s X X (i + 1)(i + 2)  (˜π(x|S)2)|D = n = P r(S = s|D = n) Cn,s Es E x x x i (n + 2)(n + 3) s=0 i=0 n s s ! X (s + 1)(s + 2) s + 1 X i X i(i − 1)  1  = P r(S = s|D = n) − 2 Cn,s + Cn,s + O x x (n + 2)(n + 3) n + 3 s−i n + 2 s−i (n + 2)(n + 3) n + 2 s=0 i=0 i=0 n n+1 s ! X (s + 1)(s + 2) B (s + 1)(n − s + 1) B = P r(S = s|D = n) − 2 1 − s x x (n + 2)(n + 3) 1 − B (n + 2)(n + 3) Ps n+1 j s−j s=0 j=0 j B (1 − B) n+1 Bs+1 n+2 s n+1 s−1 !! B2 (n − s + 1)(n − s + 2) 1 + B 2 + B + B + − s 1−B s s−1 1 − B2 (n + 2)(n + 3) 1 − B Ps n+1 j s−j j=0 j B (1 − B) n 1 X  s2 B B2   1  = P r(S = s|D = n) − 2sn + n2 + O (n + 2)(n + 3) x x (1 − B)2 (1 − B)2 (1 − B)2 (1 − B)(n + 2) s=0 n  n !2 1 X X = P r(S = s|D = n) (B + (1 − B)π(x )) (1 − B)2(n + 2)2 x x  i s=0 i=1 n ! X  1  −2Bn (B + (1 − B)π(x )) + B2n2 + O i (1 − B)(n + 2) i=1 n n !2 1 X X  1  = P r(S = s|D = n) π(x ) + O (n + 2)2 x x i (1 − B)(n + 2) s=0 i=1

Ps n+1 j s−j where the four terms with denominator j=0 j B (1 − B) in the third line can be shown to decay exponentially Ps n,s i(i−1) fast to 0 similarly as previously. We computed i=0 Cs−i (n+2)(n+3) in the second line using similar calculations as were Ps n,s i done for i=0 Cs−i n+2 :

s s−2 X n − s + i X n − s + 2 + i Bii(i − 1) = B2(n − s + 1)(n − s + 2) Bi (24) i i i=0 i=0 n+2 n  n  n Using the identity k+2 = k+2 + 2 k+1 + k , we have:

s−2 s−2 s−2 s−2 X n − s + 2 + i X n − s + i X n − s + i X n − s + i Bi = Bi + 2 Bi + Bi i i i − 1 i − 2 i=0 i=0 i=1 i=2 s−2 s−3 s−3 X n − s + i X n − s + i + 1 X n − s + i + 2 = Bi + 2 Bi+1 + Bi+2 i i i i=0 i=1 i=2 s X n − s + i n − 1 n = Bi − Bs−1 − Bs i s − 1 s i=0 s−1 X n − s + i + 1 n − 1  n  + 2B Bi − 2 Bs−1 − 2 Bs i s − 2 s − 1 i=1 s−2 X n − s + i + 2 n − 1  n  + B2 Bi − Bs−1 − Bs i s − 3 s − 2 i=2

Ps−2 n−s+2+i i Therefore, by isolating the term i=0 i B , simplifying binomial coefficients and using equation (20), we get:

s−2 s X n − s + 2 + i 1 1 + B X n − s + i n + 2 n + 1 Bs+1 Bi = Bi − Bs − 2 i 1 − B2 1 − B i s s 1 − B i=0 i=0 Efficient learning of smooth probability functions from Bernoulli tests with guarantees

n + 1  − Bs−1 s − 1

Thus:

 2   2  2 Es E((˜π(x|S) − π(x)) )|Dx = n = Es E(˜π(x|S) )|Dx = n − 2π(x)Es [Eπ˜(x|S)|Dx = n] + π(x) n Pn 2 Pn !   X π(xi) π(xi) 1 = P r(S = s|D = n) i=1 − 2π(x) i=1 + π(x)2 + O x x n + 1 n + 2 (1 − B)(n + 2) s=0 n Pn !   X i,j=1(π(xi) − π(x))(π(xj) − π(x)) 1 = P r(S = s|D = n) + O x x (n + 1)2 (1 − B)(n + 2) s=0  1  ≤ L2∆2 + O (1 − B)(n + 2)

By taking the expectation over X, we finally get:

t  2  X  2  EX,s E((˜π(x|S) − π(x)) ) = P r(Dx = n)Es E((˜π(x|S) − π(x)) )|Dx = n n=0  1  ≤ L2∆2 + O (1 − B)∆dt

1 1 − d+2 If we choose ∆ = 2 ((1 − B)t) , we obtain the desired result. L d+2

A.4. Proof of convergence in the general dynamic case Theorem 4. Let π : [0, 1]d →]0, 1] be L-Lipschitz continuous. Suppose we observe the results of experiments S = {(xi, si, 1 − Bi,Bi)}i=1,...,t where si ∼ Bernoulli((1 − (Bi + i))π(xi) + Bi + i), i.e. contextual features are noisy. We 2 assume i’s are independent random variables with zero mean and variance σ . Experiment points {xi}i=1,...,t are assumed to be uniformly distributed over the space. Then, ∀x ∈ X , the posterior π˜(x|S) obtained from Algorithm3 converges in L2-norm to π(x) : 2  2 − 2  ES E (˜π(x|S) − π(x)) = O c(B, σ )t d+2 , (25)

2 2 where c(B, σ ) is a constant depending on {Bi}i=1,...,t and the noise σ . Moreover, Algorithm3 computes the posterior in time O(t).

Proof. The proof of theorem3 can be completely adapted to this new setting. Let x ∈ X , ∆ ∈ [0, 1] be arbitrary. Suppose we fix the experiment points X and that among these t points, n of them are at most ∆ far from x. We assume without loss 1 Pn of generality that these points are x1, ..., xn. We then define BX = n i=1 Bi.

n s X X i + 1 [ (˜π(x|S))|D = n] = P r(S = s) Cn,s(x) Es E x x i n + 2 s=0 i=0 n Ps n−i s−i i+1 X i=0 BX = P r(S = s) s−i n+2 x Ps n−j s−j s=0 j=0 s−j BX n Ps n−s+i i i ! X s + 1 i=0 BX = P r(S = s) − i n+2 x n + 2 Ps n−s+j j s=0 j=0 j BX n   n+1 s !! X s + 1 BX s + 1 s BX = P r(Sx = s) − 1 − 1 − n + 2 1 − B n + 2 Ps n+1 j s−j s=0 X j=0 j BX (1 − BX ) Efficient learning of smooth probability functions from Bernoulli tests with guarantees

1 + Pn (B +  + (1 − B −  )π(x )) B = i=1 i i i i i − X (n + 2)(1 − BX ) 1 − BX n   n+1 s n−s+1 BX X s + 1 s BX (1 − BX ) + P r(Sx = s) 1 − 1 − B n + 2 Ps n+1 j n+1−j X s=0 j=0 j BX (1 − BX ) Pn (1 − B )π(x ) 1 − 2B + Pn  (1 − π(x )) = i=1 i i + i=1 i i (1 − BX )(n + 2) (1 − BX )(t + 2) n   n+1 s n−s+1 BX X s + 1 s BX (1 − BX ) + P r(Sx = s) 1 − 1 − B n + 2 Ps n+1 j n+1−j X s=0 j=0 j BX (1 − BX )

Let Z ∼ Bin(n + 1,BX ). Then:

n   n+1 s n−s+1 t X s + 1 s BX (1 − BX ) X P r(Z = s) P r(Sx = s) 1 − ≤ P r(Sx = s) (26) n + 2 Ps n+1 j n+1−j P r(Z ≤ s) s=0 j=0 j BX (1 − BX ) s=0

Pn Pn We know that E(Z) = (n + 1)BX and E(Sx) = nBX + i=1(1 − Bi)π(xi) + i=1 i(1 − π(xi)). Since E(i) = 0, then E(Sx) − E(Z) will also increase linearly with n and thus the previous upper bound also decreases exponentially with n to 0 with very high probability. We thus have:

Pn Pn    i=1(1 − Bi)π(xi) i=1 i(1 − π(xi)) 1 Es, [E(˜π(x|S))|Dx = n] = + E + O (1 − BX )(n + 2) (1 − BX )(t + 2) (1 − BX )(n + 2) Pn (1 − B )π(x )  1  = i=1 i i + O (1 − BX )(n + 2) (1 − BX )(n + 2)

We now bound the second moment of π˜(x|S). With the same notations as previously, we have:

n s X X (i + 1)(i + 2)  (˜π(x|S)2)|D = n = P r(S = s|D = n) Cn,s ES E x x x i (n + 2)(n + 3) s=0 i=0 n s s ! X (s + 1)(s + 2) s + 1 X i X i(i − 1)  1  = P r(S = s|D = n) − 2 Cn,s + Cn,s + O x x (n + 2)(n + 3) n + 3 s−i n + 2 s−i (n + 2)(n + 3) n + 2 s=0 i=0 i=0 n n+1 s ! X (s + 1)(s + 2) BX (s + 1)(n − s + 1) s BX = P r(Sx = s|Dx = n) − 2 1 − (n + 2)(n + 3) 1 − B (n + 2)(n + 3) Ps n+1 j s−j s=0 X j=0 j BX (1 − BX )  Bs+1  2 2n+1 X + n+2Bs + n+1Bs−1 BX (n − s + 1)(n − s + 2) 1 + BX s 1−BX s X s−1 X +  −  1 − B2 (n + 2)(n + 3) 1 − B Ps n+1 j s−j X X j=0 j BX (1 − BX ) n  2 2    1 X s BX B 1 = P r(S = s|D = n) − 2sn + X n2 + O (n + 2)(n + 3) x x (1 − B )2 (1 − B )2 (1 − B )2 (1 − B )(n + 2) s=0 X X X X n n 1 X X 2 = P r(S = s|D = n) (B +  + (1 − B −  )π(x )) (1 − B )2(n + 2)2 x x i i i i i X s=0 i=1 n ! X  1  −2B n (B +  + (1 − B −  )π(x )) + B2 n2 + O X i i i i i X (1 − B )(n + 2) i=1 X n  n n 1 X X X = P r(S = s|D = n) 2  (1 − π(x ))π(x ) +   (1 − π(x ))(1 − π(x )) (1 − B )2(n + 2)2 x x  i i j i j i j X s=0 i,j=1 i,j=1 Efficient learning of smooth probability functions from Bernoulli tests with guarantees

n !2 X  1  + (1 − B )π(x ) + O i i  (1 − B )(n + 2) i=1 X

Ps n+1 j s−j where the four terms with denominator j=0 j BX (1 − BX ) in the third line can be shown to decay exponentially fast to 0 similarly as previously. Taking the expectation over , we then get: n Pn 2 X (1 − Bi)π(xi)  (˜πt(x)2)|D = n = P r(S = s|D = n) i=1 ES, E x x x (1 − B )(n + 1) s=0 X  1 σ2  + O + 2 (1 − BX )(n + 1) (1 − BX ) (n + 1)

Thus:

 2   2  2 Es, E((˜π(x|S) − π(x)) )|Dx = n = Es, E(˜π(x|S) )|Dx = n − 2π(x)ES, [E(˜π(x|S))|Dx = n] + π(x) n Pn 2 Pn ! X (1 − Bi)π(xi) (1 − Bi)π(xi) = P r(S = s|D = n) i=1 − 2π(x) i=1 + π(x)2 x x (1 − B )(n + 1) (1 − B )(n + 2) s=0 X X  1 σ2  + O + 2 (1 − BX )(n + 2) (1 − BX ) (n + 2) n Pn ! X i,j=1(1 − Bi)(1 − Bj)(π(xi) − π(x))(π(xj) − π(x)) = P r(S = s|D = n) x x (1 − B )2(n + 2)2 s=0 X  1 σ2  + O + 2 (1 − BX )(n + 2) (1 − BX ) (n + 2)  2  2 2 1 σ ≤ L ∆ + O + 2 (1 − BX )(n + 2) (1 − BX ) (n + 2)

Finally, by taking the expectation over experiment points X, we get:

t  2  X  t 2  EX,s, E((˜π(x|S) − π(x)) ) = P r(Dx = n)ES, E((˜π (x) − π(x)) )|Dx = n n=0 C(1) C(2)σ2  ≤ L2∆2 + O + ∆dt ∆dt

h i 1 (i) 1 1 − d+2 where C = EX (1−B )i . Therefore, if we choose ∆ = 2 t , then we obtain the desired result. X L d+2

B. Smooth Beta processes for classification

In this appendix, we extend the convergence rates in L2 function approximation to L1 and Bayes risk (misclassification error). These are to be understood as corollaries to the proofs presented in Sec.A. Furthermore, we establish the connection between SBPs in the static setting and nearest neighbor techniques. However, our method allows for precise prior knowledge injection, whose efficiency is empirically demonstrated on a synthetic classification experiment.

B.1. Convergence in L1 norm

 − 2  Leaving out constants, Theorems1,3, and4 provide convergence rates of the type O t d+2 . In all three settings, we obtain the following corollary for the error in L1 norm: Efficient learning of smooth probability functions from Bernoulli tests with guarantees

Corollary 3 (Convergence in L1). Under the assumptions of Theorems1,3, and4, the corresponding Algorithms1,2, and 3 converge in L1 norm to π(x):  − 1  sup ES (E |π˜(x|S) − π(x)|) = O t d+2 , x∈[0,1]d where we leave out the constants of the respective theorems.

Proof. For all three cases, the statement follows from the application of Jensen’s inequality. We have    r q 2   2 ES (E |π˜(x|S) − π(x)|) = ES E (˜π(x|S) − π(x)) ≤ ES E (˜π(x|S) − π(x)) , (27) which yields the presented convergence rates by taking the square root of the rates of the respective Theorems for L2 convergence.

B.2. Convergence in Bayes risk In the classification setting, it is natural to use the posterior predictive of the Beta-Bernoulli model. Therefore, we have the classifier s˜(x|S) based on the posterior parameters α˜(x), β˜(x):

( α˜(x) 1 if ≥ 0.5, s˜(x|S) = α˜(x)+β˜(x) (28) 0 otherwise.

To estimate the performance of a classifier, the agreement with the Bayes optimal classifier is used. The Bayes risk of a classification problem is minimized by the omniscient Bayes classifier: Definition 1 (Bayes risk and optimal classifier). For any x ∈ X , the Bayes risk of a classifier s˜ : X → {0, 1} is given by

R(˜s, x) = Ps∼B(π(x)) [s 6=s ˜(x)] . (29) The Bayes optimal classifier is given based on the underlying probability function π(x). The corresponding decision rule is

∗ s (x) = 1π(x)≥0.5, (30) where 1{·} denotes the indicator function. This decision rule incurs the following optimal Bayes risk:

R∗(x) = R(s∗, x) = min{π(x), 1 − π(x)}. (31)

To relate the convergence in L1 to Bayes risk, the following simple Lemma is useful and allows to establish convergence in Bayes risk in Thm.6. Lemma 1. Suppose B(·) denotes a Bernoulli distribution, p, q ∈ [0, 1] and s0 ∈ {0, 1}. Then we have

0 0 Ps∼B(p) [s 6= s ] ≤ Ps∼B(q) [s 6= s ] + |p − q| , (32) which relates the misclassification directly to `1 loss.

Proof. Suppose s0 = 1. Then the left-hand side is p and the right-hand side gives q + |p − q|. If p >= q, we have for the right-hand side q + p − q = p and equality holds. If p < q, we have for the right-hand side q + q − p and 2p ≤ 2q by the assumption p < q. The same argument works for s0 = 0 by symmetry. Theorem 6 (Convergence in Bayes risk). Under the assumptions of Theorems1,3, and4, the classifier in Eq. (28) based on the posterior parameters obtained by the corresponding Algorithms1,2, and3 uniformly converges to the risk of the Bayes optimal classifier s∗, i.e. for any x ∈ X :

∗  − 1  ES [R(˜s, x)] ≤ R (x) + O t d+2 , (33) Efficient learning of smooth probability functions from Bernoulli tests with guarantees

Figure 4. Bayes risk of SBP with specified informative prior, which is identical to the underlying function π(x), compared to fixed-radius NN which can not specify a prior in its standard framework. where constants of the respective theorems are left out (see Sec. B.1).

Proof. Using Lemma1, we have the following for any x ∈ X :

R(˜s, x) = Ps∼B(π(x)) [s 6=s ˜(x|S)] ≤ Ps∼B(˜π(x|S)) [s 6=s ˜(x|S)] + |π˜(x|S) − π(x)| = min{π˜(x|S), 1 − π˜(x|S)} + |π˜(x|S) − π(x)| ≤ min{π(x), 1 − π(x)} + 2 |π˜(x|S) − π(x)| = R∗(x) + 2 |π˜(x|S) − π(x)| (34)

Now, we can apply the convergence in L1 of Corollary3 and get the desired result:

 ∗  − 1  ES Ps∼B(π(x)) (s 6=s ˜(x|S)) ≤ R (x) + O t d+2 . (35)

B.3. Related methods and practical considerations Smooth Beta processes are designed for probability function approximation, in which case the estimation of the standard deviation on top of the function approximation is useful. In the particular static classification setting, SBPs are tightly connected to the fixed-radius nearest neighbors (NN) classifier. SBPs have the advantage to specify a prior, which is useful to incorporate knowledge or combat biased data. In contrast to fixed-radius NN, SBPs perform additive smoothing like the famous Krichevsky-Trofimov (Krichevsky & Trofimov, 1981) by adding pseudo-counts. Despite the introduced bias, SBPs converge optimally to the Bayes classifier: the rate proven in Thm.6 matches the lower-bound established by Audibert et al.(2007) for classification. On a practical side, faster inference methods are available due to the algorithmic fixed-radius nearest neighbors problem. Both exact (e.g. k-d and ball trees) and approximate (e.g. hashing-based) methods can be used for faster inference schemes. For further practical considerations and background on the fixed-radius NN algorithm, we refer to Chen et al.(2018). We conduct a synthetic experiment in order to show how the specification of a prior can help in the low data regime. We compare the convergence of SBP with various priors and the standard fixed-radius NN algorithm. For an informative prior, we set the prior π˜(x) ∼ Beta(α(x), β(x)) such that E[˜π(x)] = π(x) and V[˜π(x)] = v. In Fig.4, we compare the convergence for different values of v: in the low data regime, SBPs can profit strongly from an informative prior. With increasing number of observations, the approximation quality varies less as we expect it to happen for a Bayesian method. Asymptotically, the convergence rate is the same.