<<

MCPA: as Machine Learning

Marcel Böhme Monash University, [email protected]

User-Generated Executions Per Day Accuracy ϵ ABSTRACT HFD Ro3 today takes an analytical approach which OSS-Fuzz 200.0 billion test inputs* [46] 3.6 · 10−6 2.3 · 10−11 −6 −11 is quite suitable for a well-scoped system. - and control-flow Netflix 86.4 billion API requests [50] 5.5 · 10 5.3 · 10 Youtube 6.2 billion videos watched [47] 2.1 · 10−5 7.4 · 10−10 is taken into account. Special cases such as pointers, procedures, Google 5.6 billion searches [47] 2.1 · 10−5 8.2 · 10−10 and must be handled. A program is analyzed Tinder 2.0 billion swipes [51] 3.6 · 10−5 2.3 · 10−9 −5 −9 precisely on the statement level. However, the analytical approach Facebook 1.5 billion user logins [48] 4.2 · 10 3.1 · 10 Twitter 681.7 million tweets posted [47] 6.2 · 10−5 6.8 · 10−9 is ill-equiped to handle implementations of complex, large-scale, Skype 253.8 million calls [47] 1.0 · 10−4 1.8 · 10−8 heterogeneous systems we see in the real world. Existing Visa 150.0 million transactions [45] 1.3 · 10−4 3.1 · 10−8 − − static analysis techniques that scale, trade correctness (i.e., sound- Instagram 71.1 million photos uploaded [47] 1.9 · 10 4 6.5 · 10 8 (*) per widely-used, security-critical , on average. 10 trillion total. ness or completeness) for scalability and on strong assump- tions (e.g., language-specificity). Scalable static analysis are well- Figure 1: Daily number of executions of large software sys- known to report errors that do not exist (false positives) or fail to tems in 2018, and the lower (Ro3) and upper bounds (HFD) report errors that do exist (false negatives). Then, how do we know on the accuracy ϵ of an estimate µˆ of the probability µ that a the degree to which the analysis outcome is correct? binary program property φ holds (e.g., bug is found). Specifi- In this paper, we propose an approach to scale-oblivious grey- cally, µ ∈ [µˆ − ϵ, µˆ + ϵ] with probability δ = 0.01 (i.e., 99%-CIs). box program analysis with bounded error which applies efficient approximation schemes (FPRAS) from the foundations of machine We believe that static program analysis today resembles the an- learning: PAC learnability. Given two parameters δ and ϵ, with prob- alytical approach in the natural sciences. However, in the natural ability at least (1 − δ), our Monte Carlo Program Analysis (MCPA) sciences—when analytical solutions are no longer tractable—other approach produces an outcome that has an average error at most ϵ. approaches are used. Monte Carlo methods are often the only means The parameters δ > 0 and ϵ > 0 can be choosen arbitrarily close to solve very complex systems of equations ab initio [19]. For in- to zero (0) such that the program analysis outcome is said to be stance, in quantum mechanics the following multi-dimensional probably-approximately correct (PAC). We demonstrate the perti- integral gives the evolution of an atom that is driven by laser light, nent concepts of MCPA using three applications: (ϵ, δ)-approximate undergoes “quantum jumps” in a time interval (0, t), and emits quantitative analysis, (ϵ, δ)-approximate software verification, and exactly n photons at times tn ≥ ... ≥ t1: (ϵ, δ)-approximate verification. ∫ t ∫ tn ∫ t2 dtn dtn−1 ... dt1p[0,t)(t1,..., tn ) 1 INTRODUCTION 0 0 0 where p[0,t) is the elementary probability density function [14]. In 2018, Harman and O’Hearn launched an exciting new research Solving this equation is tractable only using Monte Carlo (MC) 1 agenda: the innovation of frictionless program analysis techniques ∫ ti integration [33]. MC integration solves an integral F = dti f (x) that thrive on industrial-scale software systems [21]. Much progress 0 by “executing” f (x) on random inputs x ∈ Rn. Let X ∈ [0, t ] has been made. Tools like Sapienz, Infer, and Error Prone are rou- j i be the j-th of N samples from the uniform distribution, then F is tinely used at Facebook and Google [1, 9, 37]. Yet, many challenges estimated as ⟨F N ⟩ = ti ÍN f (X ). MC integration guarantees remain. For instance, the developers of Infer set the clear expectation N j=1 j √ that the tool may report many false alarms, does not handle certain that this estimate converges to the true value F at a rate of 1/ N . language features, and can only report certain types of bugs.2 Such This is true, no matter how many variables ti the integral has or n tools often trade soundness or completeness for scalability. Then, how the function f : → R is “implemented”. arXiv:1911.04687v1 [cs.SE] 12 Nov 2019 just how sound or complete is an analysis which ignores, e.g., expen- We argue that frictionless program analysis at the large, indus- sive pointer analysis, reflection in , undefined behaviors inC, trial scale requires a fundamental change of perspective. Rather or third-party libraries for which code is unavailable? than starting with a precise program analysis, and attempting to carefully trade some of this precision for scale, we advocate an 1Friction is a technique-specific resistance to adoption, such as developers’ reluctance inherently scale-oblivious approach. In this paper, we cast scale- to adopt tools with high false positive rates. oblivious program analysis as a probably-approximately correct 2https://fbinfer.com/docs/limitations.html (PAC)-learning problem [25], provide the probabilistic framework, Permission to digital or hard copies of part or all of this work for personal or and develop several fully polynomial-time randomized approxima- classroom use is granted without fee provided that copies are not made or distributed tion schemes (FPRAS). Valiant [43] introduced the PAC framework for profit or commercial advantage and that copies bear this notice and the full citation on the first page. for third-party components of this work must be honored. to study the computational complexity of machine learning tech- For all other uses, contact the owner/author(s). niques. The objective of the learner is to receive a set of samples Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia and generate a hypothesis that with high probability has a low gen- © 2019 held by the owner/author(s). ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. eralization error. We say the hypothesis is probably, approximately https://doi.org/10.1145/nnnnnnn.nnnnnnn correct, where both adverbs are formalized and quantified. 1 Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

We call our technique Monte Carlo program analysis (MCPA). • If a patch was submitted, it would require at most 400 million The learner is the program analysis. The samples (more formally API requests (equiv. 7 minutes) to state with confidencep- ( Monte Carlo trials) are distinct program executions that are poten- value< 10−3) that the failure probability has indeed reduced tially generated by different users of the software system under if none of those 400M executions exposes the bug after the normal workload. The hypothesis is the analysis outcome. Given patch is applied. MCPA provides an a-priori conditional prob- two parameters δ and ϵ, an MCPA produces an outcome that with abilistic guarantee that the one-sided null-hypothesis can be probability at least (1−δ) has an average error at most ϵ. It is impor- rejected at significance level α = 0.01 (Sec. 6). tant to note that the analysis outcome is probably-approximately • For an arbitrary program analysis, let H be a finite hypothe- correct w.r.t. the distribution of Monte Carlo trials. For instance, if sis class containing all possible program analysis outcomes. the executions are generated by real users, the analysis outcome is If MCPA chooses the outcome h ∈ H which explains the probably-approximately correct w.r.t. the software system as it is analysis target (e.g., points-to-analysis) for all 5.6 billion daily used by real users. Google searches, then h explains the analysis targets with a (|H |/δ ) In contrast to existing program analysis techniques, MCPA maximum error of ϵ ≤ log for “unseen” searches [41]. • allows to reason about the entire software systems under 5.6·109 normal workload, i.e., an MCPA while the system is used On a high level, MCPA combines the advantages of static and normally produces an outcome w.r.t. normal system use, while mitigating individual challenges. Like a static program analysis, MCPA allows to make statements • requires only greybox access to the software system, i.e., no over all executions of a program. However, MCPA does not take is analyzed, instead lightweight program instru- an analytical approach. A static analysis evaluates statements in mentation allows to monitor software properties of interest, the program source code which requires strong assumptions about • provides probabilistic guarantees on the correctness and ac- the and its features. This poses substantial curacy of the analysis outcome; i.e., to exactly quantify the challenges for static program analysis. For instance, alias analy- scalability-correctness tradeoff, sis is undecidable [35]. Like a dynamic program analysis, MCPA • implemented in a few lines can be of Python code, and overcomes these challenges by analyzing the of the soft- • is massively parallelizable, scale-oblivious, and can be in- ware system. However, MCPA does not analyze only one execution. terupted at any time. Instead, MCPA allows to make statements over all executions gen- MCPA is a general approach that can produce outcomes with erated from a given (operational) distribution. bounded error for arbitrary program analyses. This is because the The contributions of this article are as follows: PAC-framework covers all of machine learning which can learn • We discuss the opportunities of employing Monte Carlo methods arbitrary concepts, as well. Shalev-Shwartz and Ben-David [41] as an approximate, greybox approach to large-scale quantitative provide an excellent overview of the theory of machine learning. program analysis with bounded error. ( ) However, in this paper we focus here on ϵ, δ -approximations • We develop several fully-polynomial randomized approximation of the probability that a binary property holds for an arbitrary exe- schemes (FPRAS) that guarantee that the produced estimate µˆ of cution of a terminating program. This has important applications, the probability µ that a program property holds is no more than ϵ e.g., in assessing reliability [5, 6, 16], quantifying information leaks away from µ with probability at least (1 − δ). We tackle an open [34], or exposing side-channels [3]. Given parameters δ > 0 and problem in automated program repair: When is it appropriate to ϵ > 0, our MCPA guarantees that the reported estimate µˆ is no more claim with some certainty that the bug is indeed repaired? − than ϵ away from the true value µ with probability at least 1 δ, • We evaluate the efficiency of our MCPA probabilis- (| − | ≥ ) ≤ i.e., P µˆ µ ϵ δ. In statistical terms, MCPA guarantees the tically by providing upper and lower bounds for all 0 ≤ µ ≤ 1, ( − ) ± 1 δ -confidence interval µˆ ϵ. Note that confidence and accuracy and empirically in more than 1018 simulation experiments. parameters δ and ϵ can be chosen arbitrarily close to zero (0) to minimize the probability of false negatives. 2 MONTE CARLO PROGRAM ANALYSIS To illustrate the power of MCPA, we show how often well-known, Monte Carlo program analysis (MCPA) allows to derive statements industrial-scale software systems are executed per day (Fig. 1). For about properties of a software system that are probably approxi- instance, Netflix handles 86.4 billion API requests (8.64 · 1010) per mately correct [42]. Given a confidence parameter 0 < δ < 1 and an day. MCPA guarantees that the following statements are true with accuracy parameter ϵ > 0, with probability at least (1 − δ), MCPA probability greater than 99% w.r.t. the sampled distribution: produces an analysis outcome that has an average error at most ϵ. • No matter which binary property we check or how many Problem statement. In the remainder of this paper, we focus on properties we check simultaneously, we can guarantee a a special case of MCPA, i.e., to assess the probability µ that a binary priori that the error ϵ of our estimate(s) µˆ of µ is bounded property φ holds (for an arbitrary number of such properties). The · −11 ≤ ≤ · −6 ∈ ± between 5.3 10 ϵ 5.3 10 such that µ µˆ ϵ. probability that φ does not hold (i.e., that ¬φ holds) is simply (1− µ). • If 0 out of 86.4 billion API requests expose a security flaw, Suppose, during the analysis phase, it was observed that φ holds then the true probability that there exists a security flaw that for the proportion µˆ of sample executions. We call those sample −11 has not been exposed is than 5.3 · 10 (Sec. 4). executions in the analysis phase as Monte Carlo trials. The analysis • If 1000 of 86.4 billion API requests expose a vulnerability, phase can be conducted during testing or during normal usage of then the true probability µ is probabilistically guaranteed to the software system. The analysis outcome of an MCPA is expected be within µ ∈ 1.16 × 10−8 ± 1.45 × 10−9 (Sec. 5). to hold w.r.t. the concerned executions, e.g., further executions of 2 MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia the system during normal usage. Our objective is to guarantee for 2.3 Assumption 3: Property Outcomes Are the concerned executions that the reported estimate µˆ is no more Independent and Identically Distributed than ϵ away from the true value µ with probability at least 1 − δ, n The sequence of n MC trials is a stochastic process F{Xm } i.e., P(|µˆ − µ| ≥ ϵ) ≤ δ. m=1 where Xm ∈ F is a binomial random variable that is true if φ = 1 Assumptions. We adopt the classical assumptions from soft- in the m-th trial. We assume that the property outcomes F are in- ware testing: We assume that (i) all executions terminate, (ii) prop- dependent and identically distributed (IID). This is a classic assump- erty violation is observable (e.g., an uncaught exception), and (iii) the tion in testing, e.g., all test inputs are sampled independently from Monte Carlo trials are representative of the concerned executions. the same (operational) distribution [16]. More generally, through- We elaborate each assumption and their consequences in the fol- out the analysis phase, the probability µ that φ holds is invariant; lowing sections. We make no other assumptions about the program the binomial distribution over Xm is the same for all Xm ∈ F . P or the property φ. For instance, P could be deterministic, proba- Testing IID. In order to test whether F is IID, there are several bilistic, distributed, or concurrent, written in , Java, or Haskell, or statistical tools available. For instance, the turning point test is a a trained neural network, or a plain old program. We could define φ statistical test of the independence of a series of random variables. to hold if an execution yields correct behavior, if a security property If assumption does not hold. The assumption that executions is satisfied, if a soft deadline is met, if an energy threshold isnot are identically distributed does not hold for stateful programs where exceeded, if the number of cache-misses is sufficiently low, if no the outcome of φ in one execution may depend on previous execu- buffer overwrite occurs, if an assertion is not violated, et cetera. tions. In such cases, we suggest to understand each fine-grained 2.1 Assumption 1: All Executions Terminate execution as a transition from one state to another within a single non-terminating execution. The state transitions can then be - We assume that all executions of the analyzed program P terminate. elled as Markov chain and checked using tools such as probabilistic With terminating executions we mean independent units of behav- [22, 24]. ior that have a beginning and an end. For instance, a user session runs from login to logout, a transaction runs from the initial hand- 2.4 Assumption 4: Monte Carlo Trials shake until the transaction is concluded, and a program method Represent Concerned Executions runs from the initial method call until the call is returned. Other We assume that the Monte Carlo trials are representative of the con- examples are shown in Figure 1. This is a realistic assumption also cerned executions. In other words, the executions of the software in . system that were generated during the analysis phase are from If assumption does not hold. As we require all (sample and the same distribution as the executions w.r.t. which the analysis concerned) executions to terminate, MCPA cannot verify tempo- outcome is expected to hold. This is a realistic assumption shared ral properties, such as those expressed in linear temporal logic with software testing, and any empirical analysis in general. (LTL). If the reader wishes to verify temporal properties over non- Realistic Behavior. A particular strength of MCPA compared terminating executions, we refer to probabilistic model checking to existing techniques is that the software system can be analyzed [22, 24, 28] or statistical model checking [29, 39]. If the reader under normal workload to derive an probably-approximately cor- wishes to verify binary properties over non-terminating executions, rect analysis outcome that holds w.r.t. the software system as it is we suggest to use probabilistic [18] on partial normally executed. path constraints of bounded length k. 2.2 Assumption 2: Property Outcomes Are 3 MOTIVATION: CHALLENGES OF THE Observable ANALYTICAL APPROACH We assume that the outcome of property φ ∈ {0, 1} can be auto- The state-of-the-art enabling technology for quantitative program matically observed for each execution. Simply speaking, we cannot analysis is probabilistic symbolic execution (PSE) which combines estimate the proportion µˆ of concerned executions for which φ symbolic execution and model counting. In a 2017 LPAR keynote holds, if we cannot observe whether φ holds for any Monte Carlo [44], Visser called PSE the new hammer. Indeed, we find that PSE is trial during the analysis phase. This is a realistic assumption that an exciting, new tool in the developer’s reportoire for quantitative also exists in software testing. program analysis problems particularly if an exact probability is re- Greybox access. Some properties φ can be observed externally quired. However, the analytical approach of PSE introduces several without additional code instrumentation. For instance, we can mea- challenges for the analysis of large-scale, heterogeneous software sure whether latency, performance, or energy thresholds are ex- systems which MCPA is able to overcome. ceeded by measuring the time it takes to respond or to compute the final result, or how much energy is consumed. Other properties 3.1 Probabilistic Symbolic Execution can be observed by injecting lightweight instrumention directly Suppose property φ is satisfied for all program paths I. Conceptually, into the program binary, e.g., using DynamoRIO or Intel . Other for each path i ∈ I, PSE computes the probability pi that an input Í properties can be made observable at compile-time causing very exercises i and then reports µ = i ∈I pi . To compute the probability low runtime overhead. An example is AddressSanitizer [40] which pi that an input exercises path i, PSE first uses symbolic execution to is routinely compiled into security-critical program binaries to re- translate the source code that is executed along i into a Satisfiability port (exploitable) memory-related errors, such as buffer overflows Modulo Theory (SMT) formula, called path condition. The path and use-after-frees. condition π(i) is satisfied by all inputs that exercise i.A model 3 Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme counting tool can then determine the number of inputs π(i) that (4) No confidence intervals are available for approximate PSE. exercise i. Given the size D of the entire input domainJ DK, the While approximate PSE trades accuracy (of the analysis out- probability pi to exercise pathJ iKcan be computed as pi = π(i) / D . come) for efficiency (of the analysis), it does not provide There are two approaches to establish the model count.J K ExactJ K any means to assess this trade-off.4 Approximate PSE pro- model counting [3, 7, 16, 18] determines the model count precisely. vides only a (negatively biased) point estimate µˆ of µ. Due For instance, LaTTE [31] computes π(i) as the volume of a con- to the modular nature of the composition of the estimate µˆ, vex polytope which represents the pathJ K constraint. However, an it is difficult to derive a reliable (1 − δ)-confidence interval exact count cannot be established efficiently and turns out tobe [µˆ − ϵ, µˆ + ϵ]. 3 intractable in practice [2, 11]. Hence, recent research has focussed (5) To provide an analysis outcome for programs under normal on PSE involving approximate model counting. workload, a usage profile must be developed that can be Approximate model counting trades accuracy for efficiency and integrated with the path condition. PSE requires to formally determines π(i) approximately. Inputs are sampled from non- encode user behavior as a usage profile [16]. However, de- J K overlapping, axis-aligned bounding boxes that together contain all riving the usage profile from a sample of typical execution (but not only) solutions of π(i) [5, 6, 17, 32]. To determine whether can introduce uncertainty in the stochastic model that must the sampled input exercises i, it is plugged into π(i) and checked for be accounted for [30]. satisfiability. The proportion of “hits” multiplied by the size ofthe We believe that many of these limitations can be addressed. Yet, bounding box, summed over all boxes gives the approximate model they do demonstrate a shortcoming of the analytical approach. A count. In contrast to arbitrary polytopes, the size of an axis-aligned static program analysis must parse the source code and interpret bounding box can be precisely determined efficiently. each program statement on its own. It cannot rely on the to inject meaning into each statement. The necessary assumptions 3.2 Challenges limit the applicability of the analysis. The required machinery is Probabilistic symbolic execution uses an incremental encoding of a substantial performance bottleneck. For PSE, we failed to find the program’s source code as a set of path conditions. Whether reports of experiments on programs larger than a few hundred lines static or dynamic symbolic execution, the semantic meaning of of code. Notwithstanding, we are excited about recent advances in each individual program statement is translated into an equivalent scalable program analysis where correctness is carefully traded for statement in the supported Satisfiability Modulo Theory (SMT). scalability. However, how do we quantify the error of the analysis The quantitative program analysis, whether exact or approximate, outcome? What is the probability of a false positive/negative? is then conducted upon those path conditions. This indirection introduces several limitations. 3.3 Opportunities (1) Only programs that can be represented in the available SMT Monte Carlo program analysis resolves many of these limitations. theory can be analyzed precisely (e.g., bounded linear integer (1) Every program can be analyzed. Apart from termi- arithmetic [18] or strings [7]). For approximate PSE, only nation we make no assumptions about the software system. bounded numerical input domains are allowed. Otherwise, (2) Every binary property φ can be checked, including a non- bounding boxes and their relative size cannot be derived. Pro- functional property, as long as it can be automatically ob- grams that are built on 3rd-party libraries, are implemented served. Moreover, the number of required trials is indepen- in several languages, contain unbounded loops, involve com- dent of the number of simultaneously checked properties. plex data structures, execute several threads simultaneously pose challenges for probabilistic symbolic execution. (3) The maximum likelihood estimate µˆ for the binomial pro- portion µ is unbiased. Unlike PSE, MCPA does not require (2) Only functional properties φ can be checked; those can be to identify or enumerate program paths that satisfy φ. encoded within a path condition, e.g., as assertions over vari- able values. In contrast, non-functional properties, such as (4) The error is bounded. The analysis results are probably ap- whether the response-time or energy consumption exceeds proximately correct, i.e., given the parameters δ and ϵ, with ( − ) ∈ [ − , + ] some threshold value, are difficult to check. probability 1 δ , we have that µ µˆ ϵ µˆ ϵ . (3) Only a lower bound on the probability µ that φ holds can (5) The best model of a system is the running system itself. The be provided. PSE cannot efficiently enumerate all paths for system can be analyzed directly during normal execution which property φ holds. Reachability is a hard problem. Oth- with analysis results modulo concerned executions. erwise, symbolic execution would be routinely used to prove It is interesting to note that approximate probabilistic symbolic ′ Í programs correct. Hence, PSE computes µ as µ = i ∈I ′ pi execution employs Monte Carlo techniques upon each path con- only for a reasonable subset of paths I ′ ⊆ I that satisfy φ dition, effectively simulating the execution of the program. Our [16], which is why the reported probability µ′ ≤ µ. key observation is that the compiler already imbues each statement with meaning and program binaries can be executed concretely many orders of magnitutes faster than they can be simulated. 3Gulwani et al. [38] observe that the “exact volume determination for real polyhedra is quite expensive, and often runs out of time/memory on the constraints that are obtained from our benchmarks”. Time is exponential in the number of input variables. 4We note that the sample variance σˆ is not an indicator of estimator accuracy. Suppose, While a simple constraint involving four variables is quantified in less then 3 minutes in µ = 0.5 such that the (true) variance σ 2 = 0.25/n is maximal. Due to randomness, we some experiments, a similarly simple constraint involving eight variables is quantified may take n samples where φ does not hold. Our estimate µˆ = 0 is clearly inaccurate, in just under 24 hours [31]. yet the sample variance σˆ 2 = 0 gives no indication of this inaccuracy. 4 MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia

Algorithm 1 (ϵ, δ)-Approximate Software Verification 1.0 Input: Program P, Binary properties Φ 0.8

Input: Accuracy ϵ, Confidence δ 0.6

1: Let n = log(δ)/log(1 − ϵ) ε 2: for Each of n Monte Carlo trials do 0.4

3: Execute program P and observe property outcomes ε 0.2 4: for φi ∈ Φ do � 0.0 0.0 0.2 0.4 0.6 0.8 1.0 5: if φi does not hold in current execution � 6: then return “Property φ violated for current execution P” Computed by Wolfram|Alpha Computed by Wolfram|Alpha i Wolfram Alpha LLC. 2019. 7: end for https://www.wolframalpha.com/input/?i=plot+log(delta)/log(1-epsilon)+for+delta=0..1,+epsilon=0..1 (access June 23, 2019). 8: end for return “For all φi ∈ Φ, we estimate the probability that φi holds in Figure 2: The number n of Monte Carlo trials required, as P is µˆL = (n+1)/(n+2) and guarantee that the true probability confidence δ and accuracy ϵ varies. The surface plot (left) is µi ∈ [1 − ϵ, 1] with probability at least (1 − δ).” a 3D plot of the function. The contour plot (right) visualizes how n changes as δ and ϵ varies. 4 (ϵ, δ)-APPROXIMATE SOFTWARE VERIFICATION from the true value µ, and since µˆ = 1 also on the probability that During the analysis of large software systems, the probablity µ µ ≥ 1 − ϵ. that a binary program property φ holds for an execution is often P(|µˆ − µ| ≥ ϵ) ≤ (1 − ϵ)n (2) almost or exactly one.5 For a very large number of successive Monte Carlo trials, we might not ever observe that φ is violated. Still, Given a confidence parameter δ : 0 < δ ≤ 1, we can compute the there is always some residual risk that ¬φ holds for an infinitesimal number n of Monte Carlo trials required to guarantee the (1 − δ)- proportion of concerned executions, i.e., (1−µ) > 0 [4]. Hence, (ϵ, δ)- confidence interval [µˆ − ϵ, µˆ + ϵ] as approximate software verification allows to specify an allowable log(δ) residual risk ϵ. It guarantees 0 ≤ (1 − µ) ≤ ϵ with probability (1 −δ). n ≥ (3) ( − ϵ) If the binary program property φ holds for all of n Monte-Carlo log 1 trials, our empirical estimate µˆ of the probability µ that φ holds Analogously, given n Monte Carlo trials, with probability at least n is µˆ = n = 1. In this section, we provide a better point estimate 1 − δ, the absolute difference between the estimate µˆ and the true µˆL for µ in the absence of empirical evidence for the violation of value µ is at most ϵ where

φ. Now, Höffding’s inequality already provides probabilistic error 1 bounds (Sec. 5): Given the confidence parameter δ, we can compute ϵ ≤ 1 − δ n (4) the accuracy ϵ of the estimate µˆ such that µ is within µˆ ± ϵ with Note that for a 95%-CI (i.e., δ = 0.05), we have that ϵ ≤ 3/n giving probability (1 − δ). However, in this special case we can leverage a rise to the name rule-of-three. generalization of the rule-of-three (Ro3) to reduce the radius ϵ of Efficiency. Our 1 runs in time that is polynomial in the guaranteed confidence interval by many orders of magnitude. log(1/(1 − ϵ)))−1 < 1/ϵ and in log(1/δ). It is thus a fully polyno- Point estimate. Given that the sun has risen ever since we mial randomized approximation scheme (FPRAS). The efficiency were born n days ago, what is the probability that the sun will rise of Algorithm 1 is visualized in Figure 2. To reduce ϵ by an order of tomorrow? This riddle is known as the sunrise problem and the magnitude increases the required Monte Carlo trials also only by an solution is due to Laplace. Obviously, our plugin estimator µˆ = 1 is order of magnitude. This is further demonstrated for the real-world positively biased, as there is always some residual probability that examples shown in Figure 1. the sun will fail to rise tomorrow (i.e., that ¬φ holds). If we have φ n Laplace estimate µ µ observed that holds in all of trials, the ˆL of ( , ) is computed as 5 ϵ δ -APPROXIMATE QUANTITATIVE ANALYSIS n + 1 µˆL = (1) Quantitative program analysis is concerned with the proportion µ n + 2 of executions that satisfy some property φ of interest. Quantitative Error bounds. The rule-of-three [20] is widely used in the evalu- analysis has important applications, e.g., in assessing software reli- [ / ] ation of medical trials. It yields the 95%-confidence interval 0, 3 n ability [5, 6, 16], computing performance distributions [10], quanti- for the probability of adverse effects given a trial where none of the fying information leaks [34], quantifying the semantic difference n participants experienced adverse effects. Given an accuracy pa- across two program versions [15], or exposing side-channels [3]. rameter ϵ > 0, Hanley and Lippmann-Hand [20] provide an upper We propose (ϵ, δ)-approximate quantitative analysis which yields bound on the probability that the estimate µˆ deviates more than ϵ an estimate µˆ that with probability at least (1 − δ) has an average ϵ 5 error at most . While the generalized rule of three (Ro3) provides Note that we can reduce the alternative case—where the probablity µ that φ holds is ( ) almost or exactly zero—to the case that we consider here. In that case, the probablity a lower bound on the number trials needed to compute an ϵ, δ - (1 − µ) that φ does not hold (i.e., ¬φ holds) is almost or exactly one. approximation of µ, Höffding’s inequality provides an upper bound. 5 Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

1.0 Algorithm 2 (ϵ, δ)-Approximate Quantitative Program Analysis

0.8 (Fully-Polynomial Randomized Approximation Scheme) P 0.6 Input: Program , Binary property φ

ε Input: Accuracy ϵ, Confidence δ 0.4 1: Let n = log(2/δ)/(2ϵ2)

ε 0.2 2: Let m = 1 3: repeat

� 0.0 0.0 0.2 0.4 0.6 0.8 1.0 4: Execute P, observe outcome of φ, and increment m � Computed by Wolfram|Alpha Computed by Wolfram|Alpha ′ 5: Let ϵ be the radius of the current CP interval (cf. Eq. (9)) Wolfram Alpha LLC. 2019. https://www.wolframalpha.com/input/?i=plot+log(2/delta)/(2epsilon^2)+for+delta=0..1+epsilon=0..1 (access June 23, 2019). ′ 6: until (m == n) or (ϵ ≤ ϵ) Figure 3: The number n of Monte Carlo trials required, as 7: Let µˆ = X/m where X is the total frequency φ has held confidence δ and accuracy ϵ vary. The surface plot (left) is a return “We estimate the prob. that φ holds in P is µˆ and guarantee 3D plot of the function. The contour plot (right) visualizes that the true prob. µ ∈ µˆ ± ϵ with prob. at least (1 − δ).” how n changes as δ and ϵ vary. The bright area at the bottom demonstrates substantial growth in n as ϵ approaches 0. is found as solution to the constraint n X 5.1 Höffding’s Inequality Õ n δ Õ n δ pk (1 − p )n−k = and pk (1 − p )n−k = k L L 2 k U U 2 Höffding’s inequality provides probabilistic bounds on the accuracy k=X k=0 of an estimate µˆ of the probability µ = P(φ) that a property φ holds where n = n! is the binomial coefficient; p = 0 when X = 0 for an execution of a program. The number of times φ holds is k k!(n−k)! L while p = 1 when X = n. The radius ϵ of the CP interval is a binomial random variable X = Bin(µ,n). The probability µ that U p − p φ holds is a binomial proportion. An unbiased estimator of the ϵ = U L (9) binomial proportion is µˆ = X/n, i.e., the proportion of trials in 2 which φ was observed to hold. Given an accuracy parameter ϵ > 0, Given the same number of Monte Carlo trials, the radius ϵ of the Höffding [23] provides an upper bound on the probability that the CP interval is lower-bounded by the generalized rule-of-three and estimate µˆ deviates more than ϵ from the true value µ. upper-bounded by Höffding’s inequality. The CP interval is conser- vative, i.e., the probability that µ ∈ CI may actually be higher than P(|µˆ − µ| ≥ ϵ) ≤ 2 exp(−2nϵ2) (5) (1 − δ), particularly for probabilities µ that are close to zero or one. Given a confidence parameter δ : 0 < δ ≤ 1, we can compute the Wald interval. The most widely-used confidence intervals for number n of sample executions required to guarantee the (1 − δ)- binomial proportions are approximate and cannot be used for (ϵ, δ)- confidence interval [µˆ − ϵ, µˆ + ϵ] as approximate quantiative analysis. The CP interval is computation- log(2/δ) ally expensive and conservative. Hence, the most widely-used CIs n >= (6) 2ϵ2 are based on the approximation of the binomial with the normal Analogously, given n Monte Carlo trials, with probability at least distribution. For instance, the well-known Wald interval [8] is 1 − δ, the absolute difference between the estimate µˆ and the true " r r # µ(1 − µ) µ(1 − µ) value µ is at most ϵ where − r CIWald = µˆ zδ /2 , µˆ + zδ /2 (10) log(2/δ) n n ϵ ≤ (7) 2n where z is the (1 − δ ) quantile of a standard normal distribution. Note that the probability that the absolute difference between es- δ /2 2 However, as we observe in our experiments, the Wald interval timate and true value exceeds our accuracy threshold ϵ decays has poor coverage propertoes, i.e., for small µ, the nominal 95%- exponentially in the number of sample executions. confidence interval actually contains µ only with a 10% probability. 5.2 Binomial Proportion Confidence Intervals Wilson score interval. The radius of the Wald interval is un- reasonably small when the estimand µ gets closer to zero or one. 6 CP interval. Clopper and Pearson [12] discuss a statistical method The Wilson score CI mitigates this issue and has become the rec- to compute a (1−δ)-confidence intervalCI = [pL,pU ] for a binomial ommended interval for binomial proportions [8]. The Wilson score proportion µ, such that µ ∈ CI with probability at least (1 − δ) for interval [52] is all µ : 0 < µ < 1. The Clopper-Pearson (CP) interval is constructed [ ′ − ′ ′ ′] by inverting the equal-tailed test based on the binomial distribution. CIWilson = µˆ ϵ , µˆ + ϵ (11) Given the number of times X that property φ has been observed to where µˆ′ is the relocated center estimate hold in n MC trials (i.e., X : 0 ≤ X ≤ n), the confidence interval z2 ! , z2 ! δ /2 δ /2 µˆ′ = µˆ + 1 + (12) CICP = [pL,pU ] (8) 2n n ′ 6A probabilistic method starts with the underlying process and makes predictions and ϵ is the corrected standard deviation about the observations the process generates. In contrast, a statistical analysis starts zδ /2 with the observations and attempts to derive the parameters of the underlying random s z2 , z2 ! process that explain the observations. In other words, a statistical method requires ′ µˆ(1 − µˆ) δ /2 δ /2 ϵ = z / + 1 + (13) observations to compute the estimates while a probabilistic method provides a-priori δ 2 n 4n2 n bounds on such estimates (e.g., based on Ro3 or HfD). 6 MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia

Algorithm 3 (ϵ, δ)-Approximate Quantitative Program Analysis Algorithm 4 Approximate Patch Verification (Reduced number of Clopper-Pearson CI calculations) Input: Program Pfix, Binary property φ, Confidence δ Input: Program P, Binary property φ Input: Total nbug and unsuccessful trials Xbug in Pbug Input: Accuracy ϵ, Confidence δ 1: Let nfix = log(δ)/log (1 − pL) where pL is the lower limit 1: Let n = nnext = log(δ)/log(1 − ϵ) of the Clopper-Pearson interval (cf. Eq. (8)). 2: for each of nnext Monte Carlo trials do 2: for Each of nfix Monte Carlo trials do 3: Execute P and observe the outcome of φ 3: Execute program P and observe outcome of φ 4: end for fix 5: Let µˆ = X/n where X is the total frequency φ has held 4: if φ does not hold in the current execution 6: if 0 < µˆ < 1 then 5: then return “Property φ violated for current execution Pfix” 7: repeat 6: end for l q m 2 2 2 2 2 return “The null hypothesis can be rejected at significance-level at 8: Let nnext = (z p0q0 + z z p q + 2ϵp0q0 + ϵ)/ 2ϵ 0 0 least δ, to accept the alternative that failure rate has decreased.” where p0 = µˆ and q0 = 1 − p0 and z = zδ /2 9: for each of nnext Monte Carlo trials do 10: Execute P and observe the outcome of φ 11: end for 6 APPROXIMATE PATCH VERIFICATION 12: Let n = n + nnext ′ “It turns out that detecting whether a crash is fixed or not is an 13: Let ϵ be the radius of the current CP interval (cf. Eq. (9)) interesting challenge, and one that would benefit from further 14: Let µˆ = X/n where X is the total frequency φ has held ′ 15: until ϵ ≤ ϵ scientific investigation by the research community. [..] How long 16: end if should we wait, while continually observing no re-occurrence of return “We estimate the prob. that φ holds in P is µˆ and guarantee a failure (in testing or production) before we claim that the root that the true prob. µ ∈ µˆ ± ϵ with prob. at least (1 − δ).” cause(s) have been fixed?” [1]

Let µbug and µfix be the probability to expose an error before and Evaluation. We experimentally investigate properties of the after the bug was fixed, respectively. Suppose, we have nbug exe- three confidence intervals for binomial proportions in Section 7. cutions of the buggy program out of which Xbug = Bin(µbug,nbug) We find that only the Clopper-Pearson confidence interval guaran- exposed the bug. Hence, an unbiased estimator of µbug is µˆbug = tees that µ ∈ CI with probability at least (1−δ) for all µ : 0 ≤ µ ≤ 1. Xbug/nbug. We call an execution as successful if no bug (of interest) The other two intervals cannot be used for (ϵ, δ)-approximate quan- was observed. Given a confidence parameter δ : 0 < δ < 1, we tiative analysis. ask how many successful executions nfix of the fixed program (i.e., 5.3 Quantitative Monte Carlo Program Analysis 0 = Bin(µfix,nfix) = Xfix) are needed to reject the null hypothesis with statistical significance α = δ? Algorithm 2 shows the procedure of the MCPA. Höffding’s inequal- To reject the null hypothesis at significance-level δ, we require that ity provides the upper bound on the number of Monte Carlo trials re- there is no overlap between both (1 − δ)-confidence intervals. For quired for an (ϵ, δ)-approximation of the probability µ that property the buggy program version, we can compute the Clopper-Pearson φ holds in program P (Eq. (6)). The conservative Clopper-Pearson interval CICP = [pL,pU ] (cf. Sec. 5.2). Recall that the probability confidence interval provides an early exit condition (Line 6). that a property φ holds is simply the complement of probability that Efficiency. Algorithm 2 runs in time that is polynomial in 1/ϵ ̸ φ holds. For the fixed program version, we leverage the generalized and in log(1/δ) and is thus a fully-polynomial randomized approxi- 1/n rule-of-three CIRo3 = [0, 1 − δ fix ] (cf. Eq. (4)). In order to reject mation scheme (FPRAS). The worst-case running time is visualized the null, we seek nfix such that in Figure 3. Reducing δ (i.e., increasing confidence) by an order of 1 n magnitude less than doubles execution time while reducing ϵ (i.e., 1 − δ fix < pL (14) increasing accuracy) by an order of magnitude increases execution log(δ) time by two orders of magnitude. However, computing the Clopper- which is true for nfix > (15) log (1 − pL) Pearson (CP) CI after each Monte Carlo trial is expensive. In our experiments, computing the CP interval 105 times takes 58 seconds, where pL is the lower limit of the Clopper-Pearson CI (cf. Eq. (8)). on average. Computing the CP interval for each of n = 2 · 1011 test Efficiency. The CP limit pL is upper- and lower-bounded by the inputs that OSS-Fuzz generates on a good day (cf. Figure 1) would generalized rule-of-three and Höffding’s inequality, respectively, s take about 3.8 years. log(2/δ)  1  nbug Optimization. Hence, Algorithm 3 predicts the number of Monte µˆbug − ≤ pL ≤ µˆ − 1 − δ (16) 2nbug Carlo trials needed for a Clopper-Pearson interval with radius ϵ. The estimator in Line 9 was developed by Krishnamoorthy and for all 0 ≤ µbug ≤ 1. Hence, Algorithm 4 is an FPRAS that runs in −1 Peng [27] and requires an initial guess p0 of µ. The algorithm com- time that is polynomial in log(1/(1 − µˆbug)) and in log(1/δ). The putes the first guess p0 = µˆL after running the minimal number of worst-case efficiency of Algorithm 4 is visualized in Figure 4. For trials required for the given confidence and accuracy parameters instance, if 800k of 200 billion executions expose a bug in Pbug, then (Lines 1–6)—as provided by the generalized rule-of-three (Line 1). it requires less than 1.8k executions to reject the Null at significance Subsequent guesses are computed from the improved maximum level p < 0.001 in favor of the alternative that the probability of likelihood estimate (Line 14). exposing a bug has indeed reduced for Pfix. 7 Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

1.0 Epsilon = 0.1 Epsilon = 0.01 Epsilon = 0.001

0.8 Algorithm 2 106 Algorithm 3 0.6 Upper Bound (HfD) 4 10 � Lower Bound (Ro3) 0.4 102 � 0.2

Delta = 0.1 Delta = 0.01 Delta = 0.001 µbug 0.2 0.4 0.6 0.8 1.0 µbug Computed by Wolfram|Alpha Computed by Wolfram|Alpha 106 Wolfram Alpha LLC. 2019. https://www.wolframalpha.com/input/?i=plot+log(delta)%2Flog(1-mu%2Bsqrt(log(2%2Fdelta)%2F(2*4e6)))+where+mu=0.001..1,delta=0.001..1 (access June 23, 2019). Monte Carlo trials n

104 Figure 4: The number nfix of Monte Carlo trials required for the patched program to reject the Null and conclude that the 102 failure rate has indeed decreased as significance-level δ and 10−6 10−4 10−2 100 10−6 10−4 10−2 100 10−6 10−4 10−2 100 failure rate µˆbug for the buggy program vary (nbug = 4e6). Probability (1−µ) The contour plot shows how nfix changes as δ and µbug vary. Figure 6: Required MC trials n for an (ϵ, δ)-approximation 7 EXPERIMENTS of µ. The rule-of-three (Ro3) provides the lower while Höff- ding’s inequality (HfD) provides the upper bound for all 0 ≤ We implemented Algorithms 1–4 into 300 lines of R code. The binom µ ≤ 1. In the first row, δ = 0.01. In the second row, ϵ = 0.01. package [13] was used to compute various kinds of confidence intervals. R is a statistical programming language. RQ2 (Efficiency). How efficient are Algorithms 2 and 3w.r.t. 7.1 Approximate Quantitative Analysis the probabilistic upper and lower bounds? Here, efficiency is measured as the number of MC trials required to compute an (ϵ, δ)-

Wald Wilson Clopper−Pearson approximation of the probability µ that φ holds. The upper and 100.0% CI) lower bounds for all µ : 0 ≤ µ ≤ 1 are given by Höffding’s in- ∈ µ 95.0% Nominal conf. level: 95% equality (HfD) and the generalized rule-of-three (Ro3), respectively.

90.0% Figure 6 shows the efficiency of both algorithms as µ varies, as well as the probabilistic bounds, for various values of ϵ and δ. Each ex- 85.0% periment was repeated 1000 times and average values are reported. 80.0%

75.0% Both algorithms approach the probabilistic lower bound (Ro3) as Observed conf. level P( level Observed conf. 10−10 10−8 10−6 10−4 10−2 100 10−10 10−8 10−6 10−4 10−2 100 10−10 10−8 10−6 10−4 10−2 100 µ → 1 (or µ → 0, resp.). For instance, for δ = 0.01, ϵ = 0.001, at Probability (1−µ) least 4603 and most 2.6 million Monte Carlo trials are required; Figure 5: The observed probability that µ ∈ CI given the nom- for µ = 10−4, Algorithm 2 requires 4809 trials, which is just inal probability (1 − δ) = 95% as µ varies. We generated 100 4% above the lower bound. Moreover, Algorithm 3 requires only thousand repetitions of 100 million trials for one thousand slightly more Monte Carlo trials than Algorithm 2 but reduces the values of µ, i.e., ≈ 1018 Monte Carlo trials in total. number of computed intervals substantially from n to at most 2.

RQ1. Can the Clopper-Pearson interval CICP be used for BinaryTree BinomialHeap (ϵ, δ)-approximate quantitative analysis? An (ϵ, δ)-approxima- v ∈ [0, 9] v ∈ [0, 9] v ∈ [0, 499] tion µˆ of a binomial proportion µ guarantees that µ ∈ [µˆ − ϵ, µˆ + ϵ] Loc µˆ µ Loc µˆ µ µˆ µ 0 0.9374 0.9375 1 0.64464 0.64451 0.74833 0.74799 with probability at least (1 − δ). Figure 5 shows the results for 1 0.4589 0.4592 4 0.22842 0.22831 0.31051 0.31062 1018 simulation experiments. For various values of µ, we generated 2 0.5740 0.5745 6 0.60359 0.60350 0.68618 0.68599 n = 108 trials by sampling from µ and computed the 95% confidence 3 0.4591 0.4592 7 0.26947 0.26932 0.37265 0.37263 4 0.5742 0.5745 8 0.22842 0.22831 0.31051 0.31062 interval according to the methods of Wald, Wilson, and Clopper- 8 0.0202 0.0203 9 0.38148 0.38161 0.40648 0.40583 Peason. We repeated each experiment 105 times and measured the 9 0.0189 0.0189 10 0.30421 0.30389 0.40398 0.40416 10 0.0347 0.0346 13 0.00767 0.00765 0.00025 0.00025 proportion of intervals that contain µ. This proportion gives the 11 0.0361 0.0361 14 0.11879 0.11863 0.00275 0.00274 observed while 95% is the nominal confidence-level. 12 0.0361 0.0361 15 0.00767 0.00765 0.00025 0.00025 13 0.1077 0.1077 16 0.05551 0.05552 0.00150 0.00149 Yes. The Clopper-Pearson 95%-confidence interval CICP contains µ 14 0.5742 0.5745 17 0.06640 0.06621 0.00138 0.00137 15 0.5739 0.5745 18 0.04784 0.04787 0.00125 0.00124 with probability at least 95% for all values of µ. Algorithms 2 and 19 0.00454 0.00456 0.00012 0.00012 3 provide valid (ϵ, δ)-approximations. However, both the Wald 20 0.00767 0.00765 0.00025 0.00024 and the Wilson procedures provide 95%-CIs that contain µ with Figure 7: Branch probabilities computed by Algorithm 3 (µˆ) probability less than 95%, especially for µ → 1 (or µ → 0, resp.). and JPF-ProbSym (µ). We generated sufficient MC trials to For instance, if µ = 1 − 10−8, the Wald 95%-confidence interval guarantee (ϵ, δ)-approximations for all 0 ≤ µ ≤ 1 s.t. δ = CI , which is the most widely-used interval for binomial Wald ϵ = 10−3. That is, 3.8 · 106 executions of random values v on proportions, contains µ with probability P(µ ∈ CI ) < 80%. Wald BinaryTree (4.6 seconds) and BinomialHeap (0.5 seconds).

8 MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia

RQ3 (Experience with PSE). Geldenhuys et al. [18] developed exists with an error that exceeds a residual risk of ϵ = 10−5 with what is now the state-of-the-art of exact quantitative program analy- probability at most 5%. Decreasing ϵ by an order of magnitude to sis, probabilistic symbolic execution (PSE). The authors implemented ϵ = 10−6 also increases the number of required trials only by an PSE into the JPF-ProbSymb tool and evaluated it using the two pro- order of magnitude to 3 · 10−6. grams BinaryTree and BinomialHeap. For both programs, Figure 7 shows the exact branch probabilities µ as computed by PSE and the Algorithm 1 is highly efficient. It exhibits a performance that is corresponding (ϵ, δ)-approximations as computed by Algorithm 3. nearly inversely proportional to the given allowable residual risk. To generate a Monte Carlo trial, we randomly sampled from the same domain that PSE is given. Moreover, the Clopper-Pearson interval CICP provides an upper As we can see, our Monte Carlo program analysis (MCPA) ap- limit that is very close to the probabilistic lower bound as given by proximates the branch probabilities µ within the specified minimum the rule-of-three—confirming our observation in RQ2. accuracy. For δ = ϵ = 0.001, our estimates µˆ are usually within ±0.0005 of the true values µ. We would like to highlight that ap- 7.3 Approximate Patch Verification proximate PSE [5, 6, 17, 32] does not provide any such guarantees on the accuracy of the produced estimates. We also observed that µbug=0.9 0.1 µbug=0.99 0.01 µbug=0.999 0.001 PSE is tractable only for very small integer domains (e.g,. v ∈ [0, 9] 104 x i [18] or v ∈ [0, 100] [10]) while our MCPA works with arbitrarily f large domains just as efficiently. In fact, unlike PSE, MCPA works n 103 for executions generated while the software is deployed and used. 102 The results of Alg. 3 are comparable to those produced by JPF-

ProbSym with an error ϵ that can be set arbitrarily close to zero. 101 Monte Carlo Trials Monte Carlo Trials

However, without the additional machinery of PSE, our MCPA 100 algorithm produces the analysis result much faster. For Binomial- 10−3 10−2 10−1 10010−3 10−2 10−1 10010−3 10−2 10−1 100 Heap, our algorithm took half a second (0.5s) while JPF-ProbSymb Significance level α (=δ) took 57 seconds. For BinaryTree, our algorithm took about 5 sec- Approx. Patch Verif. Mann−Whitney U Test Fisher's Exact Test onds while JPF-ProbSymb took about seven minutes. In general, we expect that a sufficiently large number of Monte Carlo trials can Figure 9: Efficiency versus significance. The number ofre- be generated quickly for arbitrarily large programs (cf. Figure 1 on quired successful executions nfix of Pfix to reject the Null at page 1), which makes MCPA effectively scale-oblivious. significance α = δ. The blue, dashed line shows α = 0.05. We conducted 1015 simulation experiments, in total; i.e., 103 rep- Versus JPF-ProbSym, our MCPA is orders of magnitude faster. 6 2 etitions of nbug = 10 trials on Pbug for each of 10 values of 4 nfix ≤ 10 on Pfix, and 3 values of µbug. 7.2 Approximate Software Verification RQ5 (Efficiency). How efficient is Algorithm 4 in rejecting + 1 × 10 9 the null hypothesis at a desired p-value compared to exist- 108 + 7.5 × 10 8 ing hypothesis testing methodologies? In Figure 9, we compare 106 our MCPA to the Fisher’s exact test and the Mann-Whitney U test. + 5 × 10 8 104 The Fisher’s exact test is the standard two-sample hypothesis test .Rule−of−three × +8 for binomial proportions when the one of the estimates (here µˆ ) 102 2.5 10 fix Clopper−Pearson is zero or one.7 The Mann-Whitney U test is the standard test used Hoeffding's Ineq. + 100 0 × 10 0 in the AB testing platform at Netflix49 [ , 50] to assess, e.g., perfor- Successful Monte Carlo trials n 1e−08 1e−06 1e−04 1e−02 1e+00 1e−08 1e−06 1e−04 1e−02 1e+00 Accuracy ε Accuracy ε mance degradation between versions. The U test is a non-parametric test of the null hypothesis that it is equally likely that a randomly Figure 8: Efficiency versus residual risk. Both plots showthe selected value from one sample will be less than or greater than a number of required Monte Carlo trialsn, where φ is observed randomly selected value from a second sample. to hold, to guarantee—with an error that exceeds ϵ with prob- ability at most δ = 0.05—that φ always holds. Algorithm 4 is more efficient than existing techniques for high levels of confidence (e.g, p-value = α ≤ 0.05) on this task of ap- RQ4 (Residual Risk). For a given allowable residual risk ϵ, proximate patch verification. The Mann-Whitney test that Netflix how efficient is Alg. 1 in providing the probabilistic guaran- uses for A/B testing performs worst on this task; the difference to tee that no bug exists when none has been observed? Figure 8 our MCPA increases as α → 0. Fisher’s exact test approaches the demonstrates the efficiency of approximate software verification. performance of our MCPA as α → 0. Alg. 1 exhibits a performance that is nearly inversely proportional to the given allowable residual risk. For instance, it requires about 7We note that the popular Chi-squared test, while easier to compute, is an approxima- 5 3 · 10 trials where no error is observed to guarantee that no error tion method that cannot be used when the one of the estimates µˆbug, µˆfix is 0 or 1. 9 Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

7.4 Threats to Validity Statistical/Probabilistic Model Checking. Informally speak- As in every empirical study, there are threats to the validity of ing, (δ, ϵ)-approximate software verification is to probabilistic and the results. In terms of external validity, the assumptions of Monte statistical model checking as software verification is to classical Carlo program analysis (MCPA) may not be realistic. A particular model checking. Our MCPA techniques are neither focused on tem- advantage of simulation experiments is their generality: Our results poral properties nor concerned with a program’s state space. More apply to arbitrary software systems of arbitrary size, as long as specifically, instead of verifying the expanding prefix of an infinite the stochastic process of generating the Monte Carlo trials satisfies path through the program’s state space, MCPA requires several the assumptions of MCPA. Thus, whether or not the assumptions distinct, terminating executions (Sec. 8). In this setting, the work are realistic is the only threat to external validity. In Section 2, we on probabilistic symbolic execution [6, 16, 18] is most related. discuss these assumptions, how realistic they are, and mitigation In model checking, there is the problem of deciding whether a strategies in case they do not hold. Particularly, most assumptions model S of a potentially non-terminating stochastic system satisfies are shared with those of automatic software testing. a temporal logic property ϕ with a probability greater than or In terms of internal validity, we cannot guarantee that our imple- equal to some threshold θ: S |= P≥θ (ϕ). The temporal property ϕ mentation is correct, but make our script available for the reader to is specified in a probabilistic variant of linear time logic (LTL)or scrutinize the implementation and reproduce our results.8 To mini- computation tree logic (CTL). For instance, we could check: “When a mize the probability of bugs, we make use of standard R packages, shutdown occurs, the probability of a system recovery being completed e.g., binom for the computation of the confidence intervals. between 1 and 2 hours without further failure is greater than 0.75”: [1,2] S |= down → P>0.75[¬fail U up] 8 RELATED WORK Probabilistic model checking (PMC) [22, 24, 28] checks such prop- We are interested in the probability that a binary property9 φ holds erties using a (harder-than-NP) analytical approach. Hence, PMC for an execution of a terminating program. This problem is also becomes quickly untractable as the number of states increases. PMC tackled by probabilistic symbolic execution [18]. After a discussion also requires a formal model of the stochastic system S as discrete- of the relationship to probabilistic symbolic execution, we discuss or continuous-time Markov chain. Statistical model checking (SMC) differences to A/B testing and runtime verification techniques. [29, 39] does not take an analytical but a statistical approach. SMC Probabilistic Symbolic Execution (PSE) computes for each checks such properties by performing hypothesis testing on a num- S path i that satisfies the property φ the probability pi using sym- ber of (fixed-length) simulations of the stochastic system . Similar bolic execution and exact [16, 18] or approximate model counting to MCPA, Sen et al. [39] propose to execute the stochastic system Í directly. However, Sen et al. assume that a “trace” is available, i.e., [5, 6, 17, 32], and then computes µ = i pi to derive the proba- bility that φ holds. In contrast to MCPA, PSE allows to analyse S can identify and report the sequence of states which S visits non-terminating programs by fixing the maximum length of the and how long each state transition takes. The properties that are analyzed path. However, in contrast to PSE, MCPA does not require checked concern particular state sequences and time intervals. In a formal encoding of the operational distribution as usage profiles. contrast, the focus of our current work is on binary (rather than Instead, the analysis results are directly derived from actual usage temporal) properties for executions of terminating (rather than of the software system. While PSE is a static quantitative program non-terminating) systems. In our setting, the concept of time is analysis that never actually executes the program, MCPA is a dy- rather secondary. Moreover, (δ, ϵ)-approximate software and patch namic quantitative analysis technique that requires no access to verification are just two particular instantiations of MCPA. the source code or the heavy machinery of constraint encoding and solving. In contrast to approximate PSE [5, 6, 17, 32], MCPA 9 CONCLUSION provides probabilistic guarantees on the accuracy of the produced We are excited about the tremendous advances that have been made point estimate µˆ. In contrast to statistical PSE [17], MCPA does not in scalable program analysis, particularly in the area of separation require an exact model count for the branch probabilities of each logic. For instance, the ErrorProne static analysis tool is routinely if-conditional. Other differences to PSE are discussed in Section 3. used at the scale of Google’s two-billion-line codebase [37]. The In- A/B Testing [26, 49] provides a statistical framework to evalu- fer tool has substantial success at finding bugs at Facebook scale9 [ ]. ate the performance of two software systems (often the new ver- Yet, there still remain several open challenges; e.g., the developers sion versus the old version) within quantifiable error. In contrast of Infer set the clear expectation that the tool may report many false to A/B testing, (ϵ, δ)-approximate quantitative analysis as well as alarms, does not handle certain language features, and can only re- (ϵ, δ)-approximate software verification pertains to a single pro- port certain types of bugs.10 We strongly believe that many of these gram. Moreover, in simulation experiments, our (ϵ, δ)-approximate challenges are going to be addressed in the future. However, we also patch verification outperforms the Fisher’s exact test and the Mann- feel that it is worthwhile to think about alternative approaches to Whitney U test, which is the two-sample hypothesis test of choice scalable program analysis. Perhaps to think about approaches that for A/B testing at Netflix [49, 50]. We also note that A/B testing is do not attempt to maintain formal guarantees at an impractical cost. subject to the same assumptions. If we are ready to give up on soundness or completeness in favor of scalability, we should at least be able to quantify the trade-off. 8https://www.dropbox.com/sh/fxiz3ycvvbi51xg/AACk0UyO_nG0UlPMEH645jbXa? The probabilistic and statistical methodologies that are presented dl=0 in this paper represent a significant progress in this direction. 9Quantitative program properties, such as energy, are considered by checking whether they exceed a threshold (for an arbitrary number of thresholds). 10https://fbinfer.com/docs/limitations.html 10 MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia

ACKNOWLEDGMENTS [24] Joost-Pieter Katoen. 2016. The Probabilistic Model Checking Landscape. In Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Science We thank Prof. David Rosenblum for his inspiring ASE’16 keynote (LICS ’16). 31–45. speech on probabilistic thinking [36]. This research was partially [25] Michael J. Kearns and Umesh V. Vazirani. 1994. An Introduction to Computational funded by the Australian Government through an Australian Re- Learning Theory. MIT Press, Cambridge, MA, USA. [26] Ron Kohavi and Roger Longbotham. 2017. Online Controlled Experiments and search Council Discovery Early Career Researcher Award (DE190100046). A/B Testing. Springer US, 922–929. [27] K. Krishnamoorthy and Jie Peng. 2007. Some Properties of the Exact and Score Methods for Binomial Proportion and Sample Size Calculation. Communications REFERENCES in Statistics - Simulation and Computation 36, 6 (2007), 1171–1186. [1] Nadia Alshahwan, Xinbo Gao, Mark Harman, Yue Jia, Ke Mao, Alexander Mols, [28] Marta Kwiatkowska, Gethin Norman, and David Parker. 2011. PRISM 4.0: Verifi- Taijin Tei, and Ilya Zorin. 2018. Deploying Search Based cation of Probabilistic Real-time Systems. In Proceedings of the 23rd International with Sapienz at Facebook. In Search-Based Software Engineering. 3–45. Conference on Computer Aided Verification (CAV’11). 585–591. [2] Sanjeev Arora and Boaz Barak. 2009. Computational Complexity: A Modern [29] Axel Legay, Benoît Delahaye, and Saddek Bensalem. 2010. Statistical Model Approach (1st ed.). Cambridge University Press, New York, NY, USA. Checking: An Overview. In Proceedings of the First International Conference on [3] Lucas Bang, Abdulbaki Aydin, Quoc-Sang Phan, Corina S. Păsăreanu, and Tevfik Runtime Verification (RV’10). 122–135. Bultan. 2016. String Analysis for Side Channels with Segmented Oracles. In Pro- [30] Yamilet R. Serrano Llerena, Marcel Böhme, Marc Brünink, Guoxin Su, and David S. ceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations Rosenblum. 2018. Verifying the Long-Run Behavior of Probabilistic System of Software Engineering (FSE 2016). 193–204. Models in the Presence of Uncertainty. In Proceedings of the 12th Joint meeting of [4] Marcel Böhme. 2019. Assurance in Software Testing: A Roadmap. In Proceedings of the European Software Engineering Conference and the ACM SIGSOFT Symposium the 41st International Conference on Software Engineering: New Ideas and Emerging on the Foundations of Software Engineering (ESEC/FSE). 1–11. Results (ICSE-NIER ’19). IEEE Press, Piscataway, NJ, USA, 5–8. [31] Jesús A. De Loera, Raymond Hemmecke, Jeremiah Tauzer, and Ruriko Yoshida. [5] Mateus Borges, Antonio Filieri, Marcelo D’Amorim, and Corina S. Păsăreanu. 2015. 2004. Effective lattice point counting in rational convex polytopes. Journal of Iterative Distribution-aware Sampling for Probabilistic Symbolic Execution. In Symbolic Computation 38, 4 (2004), 1273 – 1302. Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering [32] Kasper Luckow, Corina S. Păsăreanu, Matthew B. Dwyer, Antonio Filieri, and (ESEC/FSE 2015). 866–877. Willem Visser. 2014. Exact and Approximate Probabilistic Symbolic Execution [6] Mateus Borges, Antonio Filieri, Marcelo D’Amorim, Corina S. Păsăreanu, and for Nondeterministic Programs. In Proceedings of the 29th ACM/IEEE International Willem Visser. 2014. Compositional Solution Space Quantification for Probabilis- Conference on Automated Software Engineering (ASE ’14). 575–586. tic Software Analysis. In Proceedings of the 35th ACM SIGPLAN Conference on [33] Klaus Mølmer, Yvan Castin, and Jean Dalibard. 1993. Monte Carlo wave-function Programming Language Design and Implementation (PLDI ’14). 123–132. method in quantum optics. Journal of the Optical Society of America B 10, 3 (Mar [7] Tegan Brennan, Nestan Tsiskaridze, Nicolás Rosner, Abdulbaki Aydin, and Tevfik 1993), 524–538. Bultan. 2017. Constraint Normalization and Parameterized Caching for Quantita- [34] Quoc-Sang Phan, Pasquale Malacaria, Corina S. Păsăreanu, and Marcelo tive Program Analysis. In Proceedings of the 2017 11th Joint Meeting on Foundations D’Amorim. 2014. Quantifying Information Leaks Using Reliability Analysis. of Software Engineering (ESEC/FSE 2017). 535–546. In Proceedings of the 2014 International SPIN Symposium on Model Checking of [8] Lawrence D. Brown, T. Tony Cai, and Anirban DasGupta. 2001. Interval Estima- Software (SPIN 2014). 105–108. tion for a Binomial Proportion. Statist. Sci. 16, 2 (05 2001), 101–133. [35] G. Ramalingam. 1994. The Undecidability of Aliasing. ACM Trans. Program. Lang. [9] Cristiano Calcagno, Dino Distefano, Peter W. O’Hearn, and Hongseok Yang. 2011. Syst. 16, 5 (Sept. 1994), 1467–1471. Compositional Shape Analysis by Means of Bi-Abduction. J. ACM 58, 6, Article [36] D. S. Rosenblum. 2016. The power of probabilistic thinking. In 2016 31st IEEE/ACM 26 (Dec. 2011), 66 pages. International Conference on Automated Software Engineering (ASE). 3–3. [10] Bihuan Chen, Yang Liu, and Wei Le. 2016. Generating Performance Distributions [37] Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera via Probabilistic Symbolic Execution. In Proceedings of the 38th International Jaspan. 2018. Lessons from Building Static Analysis Tools at Google. Commun. Conference on Software Engineering (ICSE ’16). 49–60. ACM 61, 4 (March 2018), 58–66. [11] Dmitry Chistikov, Rayna Dimitrova, and Rupak Majumdar. 2017. Approximate [38] Sriram Sankaranarayanan, Aleksandar Chakarov, and Sumit Gulwani. 2013. Static counting in SMT and value estimation for probabilistic programs. Acta Informat- Analysis for Probabilistic Programs: Inferring Whole Program Properties from ica 54, 8 (01 Dec 2017), 729–764. Finitely Many Paths. In Proceedings of the 34th ACM SIGPLAN Conference on [12] C. J. Clopper and E. S. Pearson. 1934. The Use of Confidence or Fiducial Limits Programming Language Design and Implementation (PLDI ’13). 447–458. Illstrated in the Case of the Binomial. Biometrika 26, 4 (12 1934), 404–413. [39] Koushik Sen, Mahesh Viswanathan, and Gul Agha. 2004. Statistical Model Check- [13] Sundar Dorai-Raj. 2015. R – Binom package. https://cran.r-project.org/web/ ing of Black-Box Probabilistic Systems. In Computer Aided Verification. 202–215. packages/binom/binom.pdf. Accessed: 2019-06-17. [40] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry [14] R. Dum, P. Zoller, and H. Ritsch. 1992. Monte Carlo simulation of the atomic Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Checker. In Proceed- master equation for spontaneous emission. Physical Review A 45 (Apr 1992), ings of the 2012 USENIX Conference on Annual Technical Conference (USENIX 4879–4887. Issue 7. ATC’12). USENIX Association, Berkeley, CA, USA, 28–28. [15] A. Filieri, C. S. Pasareanu, and G. Yang. 2015. Quantification of Software Changes [41] Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: through Probabilistic Symbolic Execution (N). In Proceedings of the 30th IEEE/ACM From Theory to Algorithms. Cambridge University Press, New York, NY, USA. International Conference on Automated Software Engineering (ASE’15). 703–708. [42] Leslie Valiant. 2013. Probably Approximately Correct: Nature’s Algorithms for [16] Antonio Filieri, Corina S. Păsăreanu, and Willem Visser. 2013. Reliability Analysis Learning and Prospering in a Complex World. Basic Books, Inc. in Symbolic Pathfinder. In Proceedings of the 2013 International Conference on [43] L. G. Valiant. 1984. A Theory of the Learnable. Commun. ACM 27, 11 (Nov. 1984), Software Engineering (ICSE ’13). 622–631. 1134–1142. [17] Antonio Filieri, Corina S. Păsăreanu, Willem Visser, and Jaco Geldenhuys. 2014. [44] Willem Visser. 2017. Probabilistic Symbolic Execution: A New Hammer. https: Statistical Symbolic Execution with Informed Sampling. In Proceedings of the 22Nd //easychair.org/smart-program/LPAR-21/LPAR-21-s3-Visser.pdf Keynote at the ACM SIGSOFT International Symposium on Foundations of Software Engineering 21st International Conference on Logic for Programming, Artificial Intelligence (FSE 2014). 437–448. and Reasoning (LPAR-21). [18] Jaco Geldenhuys, Matthew B. Dwyer, and Willem Visser. 2012. Probabilistic Sym- [45] Website. 2010. Visa: Transactions per Day. https://usa.visa.com/ bolic Execution. In Proceedings of the 2012 International Symposium on Software run-your-business/small-business-tools/retail.html. Accessed: 2019-06-17. Testing and Analysis (ISSTA 2012). 166–176. [46] Website. 2017. OSS-Fuzz: Five Months Later. https://testing.googleblog.com/ [19] Brian L Hammond, William A Lester, and Peter James Reynolds. 1994. Monte 2017/05/oss-fuzz-five-months-later-and.html. Accessed: 2017-11-13. Carlo methods in ab initio quantum chemistry. Vol. 1. World Scientific. [47] Website. 2018. Domo: Data Never Sleeps Report 6.0. https://www.domo.com/ [20] JA Hanley and A Lippman-Hand. 1983. If nothing goes wrong, is everything all blog/data-never-sleeps-6/. Accessed: 2018-11-16. right? Interpreting zero numerators. Journal of the American Medical Association [48] Website. 2018. Facebook: Company Info and Statistics. https://newsroom.fb.com/ 249, 13 (1983), 1743–1745. company-info/. Accessed: 2018-11-16. [21] Mark Harman and Peter O’Hearn. 2018. From Start-ups to Scale-ups: Opportuni- [49] Website. 2018. Netflix Tech Blog: Automated Canary Analysis ties and Open Problems for Static and Dynamic Program Analysis. Keynote at at Netflix with Kayenta. https://medium.com/netflix-techblog/ the 18th IEEE International Working Conference on Source Code Analysis. automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69. Ac- [22] T. Hérault, R. Lassaigne, F. Magniette, and S. Peyronnet. 2004. Approximate cessed: 2018-10-13. Probabilistic Model Checking. In Proceedings of the 5th International Conference [50] Website. 2018. Netflix TechBlog: Edge Load Balancing. https://medium.com/ on Verification, Model Checking and (VMCAI’04). 307–329. netflix-techblog/netflix-edge-load-balancing-695308b5548c. Accessed: 2018-11- [23] Wassily Hoeffding. 1963. Probability Inequalities for Sums of Bounded Random 16. Variables. J. Amer. Statist. Assoc. 58, 301 (1963), 13–30. 11 Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme

[51] Website. 2019. Visa: Swipes per Day. https://www.gotinder.com/press?locale=en. Accessed: 2019-06-17. [52] Edwin B. Wilson. 1927. Probable Inference, the Law of Succession, and Statistical Inference. J. Amer. Statist. Assoc. 22, 158 (1927), 209–212.

12