Efficient learning of smooth probability functions from Bernoulli tests with guarantees
Paul Rolland 1 Ali Kavis 1 Alex Immer 1 Adish Singla 2 Volkan Cevher 1
Abstract grows cubicly with the number of tests, and can thus be inapplicable when the amount of data becomes large. There We study the fundamental problem of learning an has been extensive work to resolve this cubic complexity unknown, smooth probability function via point- associated with GP computations (Rasmussen, 2004). How- wise Bernoulli tests. We provide a scalable algo- ever, these methods require additional approximations on rithm for efficiently solving this problem with rig- the posterior distribution, which impacts the efficiency and orous guarantees. In particular, we prove the con- make the overall algorithm even more complicated, leading vergence rate of our posterior update rule to the to further difficulties in establishing theoretical convergence true probability function in L2-norm. Moreover, guarantees. we allow the Bernoulli tests to depend on con- Recently, Goetschalckx et al.(2011) tackled the issues en- textual features and provide a modified inference countered by LGP, and proposed a scalable inference engine engine with provable guarantees for this novel based on Beta Processes called Continuous Correlated Beta setting. Numerical results show that the empirical Process (CCBP) for approximating the probability function. convergence rates match the theory, and illustrate By scalable, we mean the algorithm complexity scales lin- the superiority of our approach in handling con- early with the number of tests. However, no theoretical anal- textual features over the state-of-the-art. ysis is provided, and the approximation error saturates as the number of tests becomes large (cf., section 5.1). Hence, it is unclear whether provable convergence and scalability 1. Introduction can be obtained simultaneously. One of the central challenges in machine learning relates to This paper bridges this gap by designing a simple and scal- learning a continuous probability function from point-wise able method for efficiently approximating the probability Bernoulli tests (Casella & Berger, 2002; Johnson & Wich- functions with provable convergence. Our algorithm con- ern, 2002). Examples include, but are not limited to, clinical structs a posterior distribution that allows inference in linear trials (DerSimonian & Laird, 1986), recommendation sys- time (w.r.t. the number of tests) and converges in L2-norm tems (McNee et al., 2003), sponsored search (Pandey & to the true probability function (uniformly over the feature Olston, 2007), and binary classification. Due to the curse space), see Theorem1. of dimensionality, we often require a large number of tests in order to obtain an accurate approximation of the target In addition, we also allow the Bernoulli tests to depend on function. It is thus necessary to use a method that scalably contextual parameters influencing the success probabilities. To ensure convergence of the approximation, these features arXiv:1812.04428v3 [cs.LG] 23 Aug 2019 constructs this approximation with the number of tests. need to be taken into account in the inference engine. We A widely used method for efficiently solving this problem thus provide the first algorithm that efficiently treats these is the Logistic Gaussian Process (LGP) algorithm (Tokdar contextual features while performing inference, and retain & Ghosh, 2007). While this algorithm has no clear provable guarantees. As a motivation for this setting, we demonstrate guarantees, it is shown to be very efficient in practice in how this algorithm can efficiently be used for treating bias approximating the target function. However, the time re- in the data (Agarwal et al., 2018). quired for inferring the posterior distribution at some point
1Ecole Polytechnique Fed´ erale´ de Lausanne, Switzerland 2Max 1.1. Basic model and the challenge Planck Institute for Software Systems, Saarbrucken,¨ Germany. Correspondence to: Paul Rolland
Algorithm 1 Smooth Beta Process (SBP) over the whole space. However, this is not due to the partic- Input: experiments points and observations S = ular algorithm we use, and we empirically show that LGP suffers the same dependence on the space dimension (see {xi, si}i=1,..,t, query point x ∈ X , prior knowledge π˜(x) ∼ B(α(x), β(x)) Section5). Similar issues are also prevalent in GP opti- Output: Posterior distribution π˜(x|S) mization despite which great applications success has been 1 obtained (Shahriari et al., 2016). 1. Set ∆ ∝ 1 t d+2 2. Compute the posterior as in (3) using kernel In appendixB, we show how this algorithm naturally ap- 0 K(x, x ) = δkx−x0k≤∆ plies to binary classification. Restricted to classification, our algorithm becomes similar to the fixed-radius nearest neighbour algorithm (Chen et al., 2018), but the current kernel is essential for convergence to the true underlying framework allows for error quantification and precise prior distributions at all points. In particular, to ensure conver- injection. gence in L2 norm, this kernel must shrink as the number of observations increases (Algorithm1). We can see that our algorithm, called Smooth Beta Process (SBP), allows for 4. Inference for the dynamic setting fast inference at any point x ∈ X , since it simply requires We analyze the dynamic setting where Bernoulli tests are to find the tests that are performed at most ∆ far from x, influenced by contextual features as in (2). and compute the posterior distribution as in (3) depending on the number of successes and failures within these tests. 4.1. Uncorrelated case: a Bayesian approach Remark 1. The particular dependence of the kernel on the number of samples ensuring optimal convergence is not As previously, we start by analyzing the uncorrelated case, trivial, and can only be found via a theoretical analysis of i.e., how to update the distribution of π˜(x) conditioned on the model. the outputs of experiments S = {(si, xi,Ai,Bi)}i=1,...,t all performed at x. Theorem 1. Let π : [0, 1]d → [0, 1] be L-Lipschitz con- tinuous. Suppose we measure the results of experiments Since experiments are not samples from Bernoulli variables S = {(xi, si)}i=1,...,t where si is a sample from a Bernoulli with parameter π(x), Bayesian update is not straightforward distribution with parameter π(xi). Experiment points but can be achieved using sums of Beta distributions, as {xi}i=1,...,t are assumed to be i.i.d. and uniformly dis- shown in Theorem2. tributed over the space. Then, starting with a uniform prior Theorem 2. Suppose π˜(x) ∼ B(α, β) and we observe the d α(x) = β(x) = 1 ∀x ∈ [0, 1] , the posterior π˜(x|S) ob- result of a sample s ∼ Bernoulli(Aπ(x) + B). Then the tained from SBP uniformly converges in L2-norm to π(x), Bayesian posterior for π˜(x) conditioned on this observation i.e. is given by 2 2d − 2 sup S (˜π(x|S) − π(x)) = O L d+2 t d+2 , E E π˜(x|s) ∼ C0B(α + 1, β) + C1B(α, β + 1), (5) x∈[0,1]d (4) where in the case of success (s = 1), we have where the outer expectation is performed over experiment points {xi}i=1,...,t and their results {si}i=1,...,t. SBP also Bβ (A + B)α computes point-wise posterior in time . C0 = ,C1 = O(t) Bβ + (A + B)α Bβ + (A + B)α Remark 2. This theorem provides an upper bound for the L2 norm over any point of the space, and takes into account and in the case of failure (s = 0), we have where the experiments are performed in the feature space. (1 − B)β If all experiments are performed at the same point, then we C0 = , recover the familiar square-root rate at that point (Ghosal, (1 − B)β + (1 − A − B)α 1997), but we would not converge at points that are far (1 − A − B)α C = . away. 1 (1 − B)β + (1 − A − B)α Remark 3. The constraint on the input space being [0, 1]d In (5), we mean that the density function of the posterior can easily be extended to any compact space X ⊂ d. This R random variable π˜(x|s) is the weighted sum of the two would simply modify the convergence rate by a factor equal density functions given on the right-hand side. to the volume of X . Remark 4. The dependence of the convergence rate on the Then, by using this result recursively on a set of experiments feature space dimension is due to the curse of dimension- S = {(si, xi,Ai,Bi)}i=1,...,t, we can obtain a general up- ality, and the fact that we provide convergence uniformly date rule. Efficient learning of smooth probability functions from Bernoulli tests with guarantees
Corollary 1. Suppose π˜(x) ∼ B(α, β) and we observe the Algorithm 2 Inference engine for the simplified dynamic outputs of experiments S = {(si, x, Ai,Bi)}i=1,...,t where setting: Constant A, B si’s are sampled from Bernoulli random variables, each Input: experiments descriptions S = with parameter A π(x) + B . Then the Bayesian posterior i i {(xi, si, A, B)}i=1,..,t, point of interest x ∈ X , π˜(x|S) is given by prior knowledge π˜(x) ∼ B(α(x), β(x)) π˜(x|S) t Output: Posterior distribution 1 X t 1. Set ∆ ∝ 1 π˜(x|S) ∼ Ci B(α + i, β + t − i) (6) t d+2 i=0 2. Build the set of neighboring experiments Sx = {(xi, si, A, B): kx − xik ≤ ∆} t where Ci ’s can be computed via an iterative procedure 3. Compute the posterior as in Corollary1 (or2 if 0 starting from C0 = 1 and ∀n = 0, ..., t: A + B = 1) using the results of experiments Sx as if performed at x. n+1 1 n n Ci = n (BiCi (β+n−i)+(Ai +Bi)Ci−1(α+i−1)) Es Pt where S = i=1 si is the total number of successes and if sn = 1; and S 1 Ct ∝ (α − 1 + i)!(β + t − 1 − i)!BS−i (8) n+1 n i i Ci = n ((1 − Bi)Ci (β + n − i) Ef ∀i = 0, ..., S. We can compute all Ct’s in time O(t) via the + (1 − A − B )Cn (α + i − 1)) i i i i−1 t (S−i)(α+i) t relation Ci+1 = B(i+1)(β+t−1−i) Ci . n n if sn = 0. Es and Ef are normalization factors that ensure Pn n n i=0 Ci = 1∀n. For simplicity of notation, we use C−1 = 4.2. Leveraging smoothness of π via experience n Cn+1 = 0 ∀n. sharing: Simplified setting
This gives us a way of updating, in a fully Bayesian man- We now introduce the use of correlations between samples in ner, the distribution of π˜(x) conditioned on observations of the update rule, in a similar way as done in the static setting. experiments performed at x. It involves a linear combina- We first analyze a simplified setting where the contextual tion of Beta distributions, with coefficients depending on parameters are constant among all experiments, i.e. Ai = A, successes and contextual features. Bi = B ∀i. Due to the more complex form of the Bayesian update in the Certainty invariance assumption For simplicity, we uncorrelated case, it turns out that the previous technique will assume that if an event is certain (i.e., occurs with is not straightforward to apply. One way to introduce this probability 1), then context variables cannot lower this prob- idea of experience sharing would be to modify the Bayesian ability (i.e., ∀xi if π(xi), then Aiπ(xi) + Bi = 1). This update rule of Theorem2 as: is trivially equivalent to the constraint Ai + Bi = 1 ∀i. If, on the contrary, there is an impossibility invariance, i.e., π˜(x|si) ∼ C0B(α + K(x, xi), β) + C1B(α, β + K(x, xi)) context variables cannot make an impossible event possible, where K is the same kernel as defined previously. we can make a change of variable f ↔ 1 − f (i.e., invert the meanings of “success” and “failure”) in order to satisfy However, if K(x, x0) is real valued and that we apply such the certainty invariance assumption. an update for each experiment, then it turns out that the num- ber of terms required for describing the posterior distribution In Corollary1, we see that the time complexity required t 2 grows exponentially with the number of observations. In or- for computing Ci ∀i = 0, ..., t is O(t ). However, in the der to ensure the tractability of the posterior, we can restrict particular case where Ai = A, Bi = B ∀i and A + B = 1, 0 the kernel to values in {0, 1}, e.g. K(x, x ) = δkx−x0k≤∆ the update rule of Corollary1 becomes much simpler, i.e., for some kernel width ∆. This means that, each time we computable in time O(t): make an observation at xi, all random variables π˜(x) with Corollary 2. Suppose π˜(x) ∼ B(α, β) and we observe kx − xik ≤ ∆ are updated as if the same experiments had the outputs of experiments S = {(si, x, 1 − B,B)}i=1,...,t been performed at x (Algorithm2). where s ∼ Bernoulli((1−B)π(x)+B). Then the Bayesian i We provide guarantees for convergence of the posterior posterior π˜(x|S) conditioned on these observations is given distribution generated by Algorithm2 under the certainty by S invariance assumption (Theorem3). This constraint on A X t and B is equivalent to saying that the contextual parameters π˜(x|S) ∼ Ci B(α + i, β + t − i) (7) i=0 necessarily increases the success probability. Efficient learning of smooth probability functions from Bernoulli tests with guarantees
Algorithm 3 Contextual Smooth Beta Process (CSBP) by CSBP under the certainty invariance assumption (Theo- Input: experiments descriptions S = rem4). d {(xi, si,Ai,Bi)}i=1,..,t, point of interest x ∈ X , Theorem 4. Let π : [0, 1] →]0, 1] be L-Lipschitz contin- prior knowledge π˜(x) ∼ B(α(x), β(x)) uous. Suppose we observe the results of experiments S = Output: Posterior distribution π˜(x|S) {(xi, si, 1 − Bi,Bi)}i=1,...,t where si ∼ Bernoulli((1 − 1 1. Set ∆ ∝ 1 (Bi +i))π(xi)+Bi +i), i.e., contextual features are noisy. t d+2 2. Build the set of neighboring experiments We assume i’s are independent random variables with zero 2 Sx = {(xi, si,Ai,Bi): kx − xik ≤ ∆} mean and variance σ . The points {xi}i=1,...,t are assumed 1 P 3. Compute means B = Bi, to be i.i.d. uniformly distributed over the space. Then, start- |Sx| i:kx−xik≤∆ d 1 P ing with a uniform prior α(x) = β(x) = 1 ∀x ∈ [0, 1] , A = Ai |Sx| i:kx−xik≤∆ the posterior π˜(x|S) obtained from Algorithm3 uniformly 4. Compute the posterior as in Corollary1 (or2 if converges in L2-norm to π(x), i.e., A + B = 1) using the results of experiments Sx as if performed at x, and with constant parameters A, B. 2 sup ES E (˜π(x|S) − π(x)) x∈[0,1]d
2 2d − 2 = O c(B, σ )L d+2 t d+2 , Theorem 3. Let π : [0, 1]d →]0, 1] be L-Lipschitz con- tinuous. Suppose we observe the results of experiments 2 where c(B, σ ) is a constant depending on {Bi}i=1,...,t and S = {(xi, si, 1−B,B)}i=1,...,t where si ∼ Bernoulli((1− the noise σ2. Moreover, CSBP computes the posterior in B)π(x) + B). Experiment points {xi}i=1,...,t are assumed to be i.i.d. uniformly distributed over the space. Then, start- time O(t). ing with a uniform prior α(x) = β(x) = 1 ∀x ∈ [0, 1]d, the posterior π˜(x|S) obtained from Algorithm2 uniformly 5. Numerical experiments converges in L2-norm to π(x), i.e., We devise a set of experiments to demonstrate the capa- 2 bilities of our inference engine and validate the theoretical sup ES E (˜π(x|S) − π(x)) x∈[0,1]d bounds for static and dynamic settings. We start with syn- 2d − 2 thetic experiments in 1D and 2D, and finally reproduce the = O L d+2 ((1 − B)t) d+2 . case study used in (Goetschalckx et al., 2011) to show the efficiency of our dynamic algorithm. Moreover, Algorithm2 computes the point-wise posterior in time O(t). 5.1. Synthetic examples Remark 5. We observe that adding the contextual parame- We construct a function π : X → [0, 1], uniformly select ter B does not modify the convergence rate compared to the points {x } , and sample s ∼ Bernoulli(π(x )), i = static case. By using other algorithms such as LGP, param- i i=1,...,t i i 1, ..., t. From these data, SBP constructs the posterior distri- eter B should be added to the feature space, increasing its butions π˜(x|S) ∀x ∈ X . This experiment is performed both dimension by 1, which impacts the convergence rate as we in 1D setting using a feature space X = [0, 1], and in 2D demonstrate in the sequel. with X = [0, 1]2. We also apply LGP and CCBP (with fixed square exponential kernel) to this problem for comparison. 4.3. Leveraging smoothness of π via experience Explicit forms of the chosen functions are presented in the sharing: General setting Appendix.
We finally analyze the general setting where contextual pa- For the dynamic setting, contextual parameters {Bi}i=1,...,t rameters are noisy and may vary among the experiments. are sampled independently and uniformly from [0, 1], Instead of using each Ai, Bi in the update rule of the poste- and the tests are then performed by sampling si ∼ rior, we can perform this update as if all experiments were Bernoulli((1 − Bi)π(xi) + Bi), i = 1, ..., t. The poste- performed with the same coefficients A and B, which are rior is constructed using CSBP. We also applied LGP to the means of the coefficients Ai and Bi respectively. This this dynamic setting by including the parameter B as an approximation simplifies the analysis and does not influence additional feature. In order to evaluate π, LGP returns the the error rate. approximated distribution associated with B = 0. Algorithm3, called Contextual Smooth Beta Process For the static setting (1D and 2D), Figures1 (left) and2 (CSBP), is general and can be applied to any contextual (left) show the L2 errors of the posterior distributions aver- parameters Ai,Bi with no constraints. We provide guaran- aged across all x ∈ X , and over 20 runs, as functions of the tees for convergence of the posterior distribution generated number of samples t. We can observe the convergence upper Efficient learning of smooth probability functions from Bernoulli tests with guarantees
Figure 1. Left: L2 error for 1D static setting. Middle: Mean posterior estimates E[˜π(x|S)] generated by Algorithm1 for different kernel widths. Right: Running time for Algorithm1 and LGP
Figure 2. L2 error of posterior E[˜π(x|S)] for 2D static and 1D dynamic settings, averaged over all points x ∈ X versus number of samples.
2 √ bounds O(1/t 3 ) in 1D, and O(1/ t) in 2D as predicted leads to a highly non-smooth posterior, due to insufficient by Theorem1. sharing.
We observe that LGP and our method perform similarly, Figure2 (right) similarly shows the L2 error of the posterior as pointed out in (Goetschalckx et al., 2011). However, distributions for the dynamic setting, also averaged over all running LGP takes significantly more time than our method x ∈ X , and over 20 runs.√ We again observe the convergence since its time complexity is O(t3) compared to O(t) for our upper bounds O(1/ t) in 2D as predicted by Theorem4. algorithm, as demonstrated numerically in Figure1 (right). We observe that our algorithm performs much better than We observe that CCBP saturates after some time since the LGP since it applies on a lower dimensional space. kernel is independent of the number of samples. Additionally, Figure1 (left) demonstrates two sets of error 5.2. Application to biased data: a case study from curves for variations of Algorithm1. To argue about optimal- Goetschalckx et al.(2011) ity of kernel width specification, we run SBP with fixed ker- Handling biased data is currently one of the major problems − 1 − 1 nel widths ∆1 = 50 d+2 and ∆2 = 500000 d+2 . When in machine learning. In this section, we investigate how − 1 ∆ t d+2 , the L2 error initially decays at a slow rate and CSBP can treat bias by the mean of contextual features. error remains larger than the optimal setting (green curves). − 1 In line with Goetschalckx et al.(2011), we conduct a case On the contrary if we fix the kernel width ∆ t d+2 , the study with synthetic stroke rehabilitation data. The goal is error saturates at early iterations (blue curves). to determine the probability that a patient succeeds in an Figure1 (middle) shows how the built posterior distribution exercise based on its difficulty. However, the patient can approximates the true synthetic probability function by plot- in some cases be fatigued, which influences the success ting the posterior mean over the space X for the different probability and thus introduces a bias in the experiments. kernel widths. We observe that using a wide kernel (∆ ) 1 Let f(x) denote the success probability function for exercise leads to a posterior which is too smooth, due to experience with difficulty x ∈ [0, 1], when the patient is not fatigued. oversharing. On the other hand, using a narrow kernel (∆ ) 2 We assume that the patient has a certain level of fatigue Efficient learning of smooth probability functions from Bernoulli tests with guarantees
Figure 3. Left: Mean posteriors E[˜π(x|S)] for target functions representing rested (blue) and highest fatigue (red) states. Right: L2 error for rested state averaged over all points versus sample size t.
αf ∈ [0.5, 1], in which case his success probability function apply this method to such problems. becomes αf f. Note that the impossibility invariance assumption holds in Discussion and future work The current analysis can this case since being fatigue cannot make possible a task only model a particular type of contextual influence, which A π(x ) + B which was already impossible. As mentioned previously, modifies the success probability as i i i. It we can then simply make a change of variable in order to turns out that Theorem2 can be generalized to any satisfy the certainty invariance assumption, and safely apply polynomial transformation of the success probability (i.e. Pp (j) j the dynamic algorithm. j=0 ai π(xi) ), allowing for a wider class of contextual influences. Alternatively, by treating the level of fatigue as a new di- mension in the feature space, LGP can be applied to the Moreover, the theoretical framework we provide seems to be rehabilitation case study. applicable to a large class of problems, such as risk tracking, Bandit setting, active learning, etc. Extending this model We assume that the difficulty of the exercise influences (in to such applications would also be an interesting research an unknown way) the success probability as f(x) = 1 − x, direction. x ∈ [0, 1]. We construct a synthetic dataset by uniformly sampling exercise difficulties x and fatigue levels α , and f 7. Acknowledgement then sampling the success from fαf (x). Figure3 shows the reconstructed success probability distri- This work was supported by the Swiss National Science butions when the patient is either not fatigued (rested state) Foundation (SNSF) under grant number 407540 167319. or in the final fatigued state (αf = 0.5), as well as the L2 er- ror of the posterior for the rested state. Since LGP operates References on a higher dimensional space, we observe that the L2 error decays slower and the approximation of the target function Agarwal, A., Beygelzimer, A., Dud´ık, M., Langford, J., and for the rested state is worse than CSBP. Wallach, H. A reductions approach to fair classification. arXiv preprint arXiv:1803.02453, 2018.
6. Conclusions Audibert, J.-Y., Tsybakov, A. B., et al. Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2): In this paper, we build an inference engine for learning 608–633, 2007. smooth probability functions from a set of Bernoulli experi- ments, which may be influenced by contextual features. We Casella, G. and Berger, R. L. Statistical inference, volume 2. design an efficient and scalable algorithm for computing a Duxbury Pacific Grove, CA, 2002. posterior converging to the target function with provable rate, and demonstrate its efficiency on synthetic and real- Chen, G. H., Shah, D., et al. Explaining the success of world problems. These characteristics together with the nearest neighbor methods in prediction. Foundations and simplicity of SBP make it a competitive tool compared to Trends R in Machine Learning, 10(5-6):337–588, 2018. LGP, which has been shown to be an important tool in many DerSimonian, R. and Laird, N. Meta-analysis in clinical real-world applications. We thus expect practitioners to trials. Controlled clinical trials, 7(3):177–188, 1986. Efficient learning of smooth probability functions from Bernoulli tests with guarantees
Ghosal, S. A review of consistency and convergence of pos- Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and terior distribution. In Varanashi Symposium in Bayesian De Freitas, N. Taking the human out of the loop: A review Inference, Banaras Hindu University, 1997. of bayesian optimization. Proceedings of the IEEE, 104 (1):148–175, 2016. Goetschalckx, R., Poupart, P., and Hoey, J. Continuous correlated beta processes. In IJCAI, 2011. Tokdar, S. T. and Ghosh, J. K. Posterior consistency of logis- tic gaussian process priors in density estimation. Journal Gompert, Z. A continuous correlated beta process model of statistical planning and inference, 137(1):34–42, 2007. for genetic ancestry in admixed populations. PloS one, 11(3):e0151047, 2016. van der Vaart, A. W., van Zanten, J. H., et al. Rates of contraction of posterior distributions based on gaussian Gupta, A. K. and Wong, C. On three and five parameter process priors. The Annals of Statistics, 36(3):1435–1463, bivariate beta distributions. Metrika, 32(1):85–91, 1985. 2008.
Hjort, N. L. Nonparametric bayes estimators based on beta Williams, C. K. and Rasmussen, C. E. Gaussian processes processes in models for life history data. The Annals of for regression. In NIPS, pp. 514–520, 1996. Statistics, pp. 1259–1294, 1990. Wilson, A. G. and Ghahramani, Z. Copula processes. In Hoey, J., Yang, X., Grzes, M., Navarro, R., and Favela, NIPS, pp. 2460–2468, 2010. J. Modeling and learning for lacasa, the location and context-aware safety assistant. In NIPS 2012 Workshop on Machine Learning Approaches to Mobile Context Lake Tahoe, NV, 2012.
Johnson, R. A. and Wichern, D. Multivariate analysis. Wiley Online Library, 2002.
Knapik, B. T., van der Vaart, A. W., van Zanten, J. H., et al. Bayesian inverse problems with gaussian priors. The Annals of Statistics, 39(5):2626–2657, 2011.
Krause, A. and Ong, C. S. Contextual gaussian process bandit optimization. In NIPS, pp. 2447–2455, 2011.
Krichevsky, R. and Trofimov, V. The performance of univer- sal encoding. IEEE Transactions on Information Theory, 27(2):199–207, 1981.
McNee, S. M., Lam, S. K., Konstan, J. A., and Riedl, J. Interfaces for eliciting new user preferences in recom- mender systems. In International Conference on User Modeling, pp. 178–187. Springer, 2003.
Olkin, I. and Liu, R. A bivariate beta distribution. IEEE Transactions on Information Theory, 62(4):407–412, 2003.
Pandey, S. and Olston, C. Handling advertisements of un- known quality in search advertising. In NIPS, pp. 1065– 1072, 2007.
Ranganath, R. and Blei, D. M. Correlated random measures. Journal of the American Statistical Association, pp. 1–14, 2017.
Rasmussen, C. E. Gaussian processes in machine learning. In Advanced lectures on machine learning, pp. 63–71. Springer, 2004. Efficient learning of smooth probability functions from Bernoulli tests with guarantees A. Proofs In this appendix, we provide all proofs for Theorems and Corollaries stated in the paper. We emphasize that we are aware of existing theoretical tools provided in (van der Vaart et al., 2008) and (Knapik et al., 2011), but our approach is different and specific to the current setup.
A.1. Proofs of point-wise Bayesian update in dynamic case Pn n Pn n Theorem 5. Suppose π˜(x) ∼ i=0 Ci B(α + i, β + n − i) with i=0 Ci = 1, and we observe the result s of a sample from a Bernoulli random variable with parameter Aπ(x) + B. Then the Bayesian posterior for π˜(x) conditioned on this observation is: n+1 X n+1 π˜(x|s) ∼ Ci B(θ, α + i, β + n − i) (9) i=0 where ∀i = 0, ..., n + 1: n+1 1 n n Ci = n (BCi (β + n − i) + (A + B)Ci−1(α + i − 1)) Es if s = 1 and n+1 1 n n Ci = n ((1 − B)Ci (β + n − i) + (1 − A − B)Ci−1(α + i − 1)) Ef
n n Pn n+1 n n if s = 0. Es and Ef are normalization factors that ensure i=0 Ci = 1. For simplicity of notation C−1 = Cn+1 = 0 ∀n.
Proof. Suppose the observation is a success, i.e. s = 1. Let fπ˜(x) : [0, 1] → [0, 1] be the density function of the random variable π˜(x), and let fπ˜(x)|s=1 : [0, 1] → [0, 1] be its the density function conditioned on this observation. Then,
P r(s = 1|π˜(x) = θ)f (θ) f (θ) = π˜(x) π˜(x)|s=1 P r(s = 1) n X n ∝ (Aθ + B) Ci B(α + i, β + n − i) i=0 n X θα+i−1(1 − θ)β+n−i−1 = (B(1 − θ) + (A + B)θ) Cn i B(α + i, β + n − i) i=0 n X θα+i−1(1 − θ)β+n−i B(α + i, β + n − i + 1) = B Cn i B(α + i, β + n − i + 1) B(α + i, β + n − i) i=0 n X θα+i(1 − θ)β+n−i−1 B(α + i + 1, β + n − i) + (A + B) Cn i B(α + i + 1, β + n − i) B(α + i, β + n − i) i=0 n X β + n − i = B CnB(α + i, β + n − i + 1) i α + β + n i=0 n X α + i + (A + B) CnB(α + i + 1, β + n − i) i α + β + n i=0 n+1 X n n ∝ (BCi (β + n − i) + (A + B)Ci−1(α + i − 1))B(α + i, β + n − i) i=0 n+1 X n+1 ∝ Ci B(θ, α + i, β + n − i) i=0
B(α+1,β) α B(α,β+1) β where B is the Beta function, and satisfies B(α,β) = α+β and B(α,β) = α+β . Efficient learning of smooth probability functions from Bernoulli tests with guarantees
n+1 Pn+1 n+1 In order to ensure that this remains a probability distribution, coefficients Ci must satisfy i=0 Ci = 1. The result for s = 0 can be showed similarly.
Theorem2 is a special case of this result, for n = 0. Corollary1 directly follows from this theorem, by applying it recursively for each observations.
Corollary 2. Suppose π˜(x) ∼ B(α, β) and we observe the outputs of experiments S = {(si, x, 1 − B,B)}i=1,...,t where si ∼ Bernoulli((1 − B)π(x) + B). Then the Bayesian posterior π˜(x|S) conditioned on these observations is given by
S X t π˜(x|S) ∼ Ci B(α + i, β + t − i) (10) i=0 Pt where S = i=1 si is the total number of successes and S Ct ∝ (α − 1 + i)!(β + t − 1 − i)!BS−i (11) i i
t (S−i)(α+i) t t ∀i = 0, ..., S. Using the relation Ci+1 = B(i+1)(β+t−1−i) Ci , we can compute all Ci ’s in time O(t).
t Proof. We want to prove that the iterative process for computing the coefficients Ci ’s in Corollary1 ends with coefficients t 0 Ci ’s of equation (11). We prove this by induction over t. For t = 0, the result is obvious, since S = 0, and C0 = 1.
Now suppose the result is true for some time n and let us prove that it remains true for time n + 1. Let Sn be the total number of successes observed up to time n, and let sn+1 be the new observation at time n + 1. Suppose sn+1 = 1. Then Sn+1 = Sn + 1, and ∀i = 1, ..., Sn+1:
n+1 n n Ci ∝ BCi (β + n − i) + Ci−1(α + i − 1) Sn ∝ (α − 1 + i)!(β + n − 1 − i)!BSn+1−i(β + n − i) i Sn + (α − 1 + i − 1)!(β + n − i)!BSn+1−i(α + i − 1) i − 1 Sn+1 = (α − 1 + i)!(β + (n + 1) − 1 − i)!BSn+1−i i
Similarly, if sn+1 = 0, then Sn+1 = Sn, and ∀i = 1, ..., Sn+1:
n+1 n Ci ∝ (1 − B)Ci (β + n − i) Sn+1 ∝ (α − 1 + i)!(β + (n + 1) − 1 − i)!BSn+1−i i
In particular, we can see that the number of coefficients increases only when we observe a success.
A.2. Proof of convergence in the static case Theorem 1. Let π : [0, 1]d → [0, 1] be L-Lipschitz continuous. Suppose we measure the results of experiments S = {(xi, si)}i=1,...,t where si is a sample from a Bernoulli distribution with parameter π(xi). Experiment points {xi}i=1,...,t are assumed to be i.i.d. and uniformly distributed over the space. Then, starting with a uniform prior α(x) = β(x) = d 1 ∀x ∈ [0, 1] , the posterior π˜(x|S) obtained from Algorithm1 uniformly converges in L2-norm to π(x), i.e.