Branching Processes for QuickCheck Generators (extended version)

Agustín Mista Alejandro Russo John Hughes Universidad Nacional de Rosario Chalmers University of Technology Chalmers University of Technology Rosario, Argentina Gothenburg, Sweden Gothenburg, Sweden [email protected] [email protected] [email protected]

Abstract random test data generators or QuickCheck generators for short. In QuickCheck (or, more generally, random testing), it is chal- The generation of test cases is guided by the types involved in lenging to control random data generators’ distributions— the testing properties. It defines default generators for many specially when it comes to user-defined algebraic data types built-in types like booleans, integers, and lists. However, when (ADT). In this paper, we adapt results from an area of mathe- it comes to user-defined ADTs, developers are usually required matics known as branching processes, and show how they help to specify the generation process. The difficulty is, however, to analytically predict (at compile-time) the expected number that it might become intricate to define generators so that they of generated constructors, even in the presence of mutually result in a suitable distribution or enforce data invariants. recursive or composite ADTs. Using our probabilistic formu- The state-of-the-art tools to derive generators for user- las, we design heuristics capable of automatically adjusting defined ADTs can be classified based on the automation level probabilities in order to synthesize generators which distribu- as well as the sort of invariants enforced at the data genera- tions are aligned with users’ demands. We provide a Haskell tion phase. QuickCheck and SmallCheck [27] (a tool for writing implementation of our mechanism in a tool called generators that synthesize small test cases) use type-driven and perform case studies with real-world applications. When generators written by developers. As a result, generated ran- generating random values, our synthesized QuickCheck gener- dom values are well-typed and preserve the structure described ators show improvements in code coverage when compared by the ADT. Rather than manually writing generators, libraries with those automatically derived by state-of-the-art tools. derive [24] and MegaDeTH [14, 15] automatically synthesize generators for a given user-defined ADT. The derive CCS Concepts • Software and its engineering → Soft- provides no guarantees that the generation process terminates, ware testing and debugging; while MegaDeTH pays almost no attention to the distribu- Keywords Branching process, QuickCheck, Testing, Haskell tion of values. In contrast, Feat [12] provides a mechanism to uniformly sample values from a given ADT. It enumerates ACM Reference Format: all the possible values of a given ADT so that sampling uni- Agustín Mista, Alejandro Russo, and John Hughes. 2018. Branching formly from ADTs becomes sampling uniformly from the set Processes for QuickCheck Generators: (extended version). In Pro- ceedings of the 11th ACM SIGPLAN International Haskell Symposium of natural numbers. Feat’s authors subsequently extend their (Haskell ’18), September 27–28, 2018, St. Louis, MO, USA. ACM, New approach to uniformly generate values constrained by user- York, NY, USA, 19 pages. https://doi.org/10.1145/3242744.3242747 defined predicates9 [ ]. Lastly, Luck is a domain specific lan- guage for manually writing QuickCheck properties in tandem 1 Introduction with generators so that it becomes possible to finely control Random property-based testing is an increasingly popular the distribution of generated values [19]. arXiv:1808.01520v2 [cs.SE] 7 Aug 2018 approach to finding bugs [3, 17, 18]. In the Haskell community, In this work, we consider the scenario where developers QuickCheck [10] is the dominant tool of this sort. QuickCheck are not fully aware of the properties and invariants that in- requires developers to specify testing properties describing the put data must fulfill. This constitutes a valid assumption for expected software behavior. Then, it generates a large number penetration testing [2], where testers often apply fuzzers in an of random test cases and reports those violating the testing attempt to make programs crash—an anomaly which might properties. QuickCheck generates random data by employing lead to a vulnerability. We believe that, in contrast, if users can recognize specific properties of their systems then itis Haskell ’18, September 27–28, 2018, St. Louis, MO, USA preferable to spend time writing specialized generators for © 2018 Copyright held by the owner/author(s). Publication rights licensed to that purpose (e.g., by using Luck) instead of considering auto- ACM. matically derived ones. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published Our realization is that branching processes [29], a relatively in Proceedings of the 11th ACM SIGPLAN International Haskell Symposium simple stochastic model conceived to study the evolution of (Haskell ’18), September 27–28, 2018, St. Louis, MO, USA, https://doi.org/10. populations, can be applied to predict the generation distri- 1145/3242744.3242747. bution of ADTs’ constructors in a simple and automatable Haskell ’18, September 27–28, 2018, St. Louis, MO, USA Agustín Mista, Alejandro Russo, and John Hughes manner. To the best of our knowledge, this stochastic model Although it might seem easy, writing random generators has not yet been applied to this field, and we believe it may becomes cumbersome very quickly. Particularly, if we want be a promising foundation to develop future extensions. The to write a random generator for a user-defined ADT T, it is contributions of this paper can be outlined as follows: also necessary to provide random generators for every user- ▶ We provide a mathematical foundation which helps to ana- defined ADT inside of T as well! What remains of this section lytically characterize the distribution of constructors in derived is focused on explaining the state-of-the-art techniques used QuickCheck generators for ADTs. to automatically derive generators for user-defined ADTs via ▶ We show how to use type reification to simplify our predic- type-driven approaches. tion process and extend our model to mutually recursive and composite types. 2.1 Library derive The simplest way to automatically derive a generator for a ▶ We design (compile-time) heuristics that automatically search for probability parameters so that distributions of constructors given ADT is the one implemented by the Haskell library derive can be adjusted to what developers might want. [24]. This library uses Template Haskell [28] to automatically synthesize a generator for the data type Tree semantically We provide an implementation of our ideas in the form ▶ equivalent to the one presented in Figure1. of a Haskell library1 called (the Danish word for While the library derive is a big improvement for the testing dragon, here standing for Derivation of RAndom GENerators). process, its implementation has a serious shortcoming when ▶ We evaluate our tool by generating inputs for real-world dealing with recursively defined data types: in many cases, programs, where it manages to obtain significantly more code there is a non-zero probability of generating a recursive type coverage than those random inputs generated by MegaDeTH’s constructor every time a recursive type constructor gets gen- generators. erated, which can lead to infinite generation loops. A detailed Overall, our work addresses a timely problem with a neat example of this phenomenon is given in Appendix B.1. In mathematical insight that is backed by a complete implemen- this work, we only focus on derivation tools which accom- tation and experience on third-party examples. plish terminating behavior, since we consider this an essential component of well-behaved generators. 2 Background In this section, we briefly illustrate how QuickCheck random 2.2 MegaDeTH generators work. We consider the following implementation The second approach we will discuss is the one taken by of binary trees: MegaDeTH, a meta-programming tool used intensively by QuickFuzz [14, 15]. Firstly, MegaDeTH derives random gen- data Tree = LeafA | LeafB | LeafC | Node Tree Tree erators for ADTs as well as all of its nested types—a useful In order to help developers write generators, QuickCheck de- feature not supported by derive. Secondly, MegaDeTH avoids fines the Arbitrary type-class with the overloaded symbol potentially infinite generation loops by setting an upper bound arbitrary :: Gen a, which denotes a monadic generator for to the random generation recursive depth. values of type a. Then, to generate random trees, we need Figure2 shows a simplified (but semantically equivalent) to provide an instance of the Arbitrary type-class for the version of the random generator for Tree derived by MegaDeTH. type Tree. Figure1 shows a possible implementation. At the This generator uses QuickCheck’s function sized :: (Int → top level, this generator simply uses QuickCheck’s primitive Gen a) → Gen a to build a random generator based on a func- oneof :: [Gen a] → Gen a to pick a generator from a list of tion (of type Int → Gen a) that limits the possible recursive generators with uniform probability. This list consists of a calls performed when creating random values. The integer random generator for each possible choice of data constructor passed to sized’s argument is called the generation size. When of Tree. We use applicative style [22] to describe each one of the generation size is zero (see definition gen 0), the generator them idiomatically. So, pure LeafA is a generator that always only chooses between the Tree’s terminal constructors—thus generates LeafAs, while Node ⟨$⟩ arbitrary ⟨∗⟩ arbitrary is a generator that always generates Node constructors, “filling” instance Arbitrary Tree where its arguments by calling arbitrary recursively on each of them. arbitrary = sized gen where 1Available at https://bitbucket.org/agustinmista/dragen gen 0 = oneof [pure LeafA, pure LeafB, pure LeafC ] instance Arbitrary Tree where gen n = oneof arbitrary = oneof [pure LeafA, pure LeafB, pure LeafC [pure LeafA, pure LeafB, pure LeafC , Node ⟨$⟩ arbitrary ⟨∗⟩ arbitrary] , Node ⟨$⟩ gen (div n 2) ⟨∗⟩ gen (div n 2)]

Figure 1. Random generator for Tree. Figure 2. MegaDeTH generator for Tree. Branching Processes for QuickCheck Generators Haskell ’18, September 27–28, 2018, St. Louis, MO,USA ending the generation process. If the generation size is strictly positive, it is free to randomly generate any Tree constructor (see definition gen n). When it chooses to generate a recursive constructor, it reduces the generation size for its subsequent recursive calls by a factor that depends on the number of re- cursive arguments this constructor has (div n 2). In this way, MegaDeTH ensures that all generated values are finite. Although MegaDeTH generators always terminate, they have a major practical drawback: in our example, the use of oneof to uniformly decide the next constructor to be gener- ated produces a generator that generates leaves approximately three quarters of the time (note this also applies to the genera- tor obtained with derive from Figure1). This entails a distribu- Figure 3. Size distribution of 100000 randomly generated Tree tion of constructors heavily concentrated on leaves, with a very values using MegaDeTH (▲) with generation size 10, and Feat small number of complex values with nested nodes, regardless (■) with generation size 400. how large the chosen generation size is—see Figure3 (left). As we have shown, by using both MegaDeTH and Feat, the 2.3 Feat user is tied to the fixed generation distribution that each tool The last approach we discuss is Feat [12]. This tool determines produces, which tends to be highly dependent on the particular the distribution of generated values in a completely different data type under consideration on each case. Instead, this work way: it uses uniform generation based on an exhaustive enu- aims to provide a theoretical framework able to predict and meration of all the possible values of the ADTs being considered. later tune the distributions of automatically derived generators, Feat automatically establishes a bijection between all the pos- giving the user a more flexible testing environment, while sible values of a given type T, and a finite prefix of the natural keeping it as automated as possible. numbers. Then, it guarantees a uniform generation over the complete space of values of a given data type T up to a certain 3 Simple-Type Branching Processes size.2 However, the distribution of size, given by the number Galton-Watson Branching processes (or branching processes of constructors in the generated values, is highly dependent for short) are a particular case of Markov processes that model on the structure of the data type being considered. the growth and extinction of populations. Originally conceived Figure3 (right) shows the overall distribution shape of a to study the extinction of family names in the Victorian era, QuickCheck generator derived using Feat for Tree using a gen- this formalism has been successfully applied to a wide range eration size of 400, i.e., generating values of up to 400 con- of research areas in biology and physics—see the textbook by structors.3 Notice that all the generated values are close to the Haccou et al. [16] for an excellent introduction. In this sec- maximum size! This phenomenon follows from the exponen- tion, we show how to use this theory to model QuickCheck’s tial growth in the number of possible Trees of n constructors as distribution of constructors. we increase n. In other words, the space of Trees up to 400 con- We start by analyzing the generation process for the Node structors is composed to a large extent of values with around constructors in the data type Tree as described by the genera- 400 constructors, and (proportionally) very few with a smaller tors in Figure1 and2. From the code, we can observe that the number of constructors. Hence, a generation process based stochastic process they encode satisfies the following assump- on uniform generation of a natural number (which thus ig- tions (which coincide with the assumptions of Galton-Watson nores the structure of the type being generated) is biased very branching processes): i) With a certain probability, it starts strongly towards values made up of a large number of con- with some initial Node constructor. ii) At any step, the probabil- structors. In our tests, no Tree with less than 390 constructors ity of generating a Node is not affected by the Nodes generated was ever generated. In practice, this problem can be partially before or after. iii) The probability of generating a Node is solved by using a variety of generation sizes in order to get independent of where in the tree that constructor is about to more diversity in the generated values. However, to decide be placed. which generation sizes are the best choices is not a trivial task The original Galton-Watson process is a simple stochastic either. As consequence, in this work we consider only the case process that counts the population sizes at different points in of fixed-size random generation. time called generations. For our purposes, populations consist of Node constructors, and generations are obtained by selecting 2 We avoid including any source code generated by Feat, since it works by tree levels. synthetizing Enumerable type-class instances instead of Arbitrary ones. Such instances give no insight into how the derived random generators work. Figure4 illustrates a possible generated value. It starts by 3 We choose to use this generation size here since it helps us to compare generating a Node constructor at generation (i.e., depth) zero MegaDeTH and Feat with the results of our tool in Section8. (G0), then another two Node constructors as left and right Haskell ’18, September 27–28, 2018, St. Louis, MO, USA Agustín Mista, Alejandro Russo, and John Hughes subtrees in generation one (G ), etc. (Dotted edges denote to see that it must be the case that G = ξ n + ξ n + ··· + ξ n . 1 n 1 2 Gn−1 further constructors which are not drawn, as they are not To deduce E[Gn], i.e. the expected number of Nodes in the nth essential for the point being made.) This process repeats until generation, we apply the (standard) Law of Total Expectation 4 the population of Node constructors becomes extinct or stable, E[X] = E[E[X |Y]] with X = Gn and Y = Gn−1 to obtain: or alternatively grows forever. E[G ] = E[E[G |G ]]. (1) The mathematics behind n n n−1 the Galton-Watson process By expanding Gn, we deduce that: G0 Node allows us to predict the ex- E[G |G ] = E[ξ n + ξ n +···+ ξ n |G ] n n−1 1 2 Gn−1 n−1 pected number of offspring n n n G1 Node Node = E[ξ |G − ] + E[ξ |G − ] +···+ E[ξ |G − ] at the nth-generation, i.e., 1 n 1 2 n 1 Gn−1 n 1 the number of Node con- Since ξ n, ξ n, ..., and ξ n are all governed by the distribution Leaf Node Leaf Node 1 2 Gn−1 structors at depth n in the G2 A B captured by the random variable (recall the assumptions at generated tree. Formally, we the beginning of the section), we have that: start by introducing the ran- Figure 4. Generation of dom variable R to denote Node constructors. E[Gn |Gn−1] = E[R|Gn−1] + E[R|Gn−1] + ··· + E[R|Gn−1] the number of Node con- Since R is independent of the generation where Node constructors structors in the next generation generated by a Node con- decide to generate other Node constructors, we have that structor in this generation—the R comes from “reproduction” E[G |G ] = E[R] + E[R] + ··· + E[R] = E[R]·G (2) and the reader can think it as a Node constructor reproducing n n−1 n−1 | {z } Node constructors. To be a bit more general, let us consider Gn−1 times the Tree random generator automatically generated using de- From now on, we introduce m to denote the mean of R, i.e., the rive (Figure1), but where the probability of choosing between mean of reproduction. Then, by rewriting m = E[R], we obtain: any constructor is no longer uniform. Instead, we have a pC (1) (2) m is constant probability of choosing the constructor . These probabilities E[Gn] = E[E[Gn |Gn−1]] = E[m·Gn−1] = E[Gn−1]·m are external parameters of the prediction mechanism, and Sec- tion7 explains how they can later be instantiated with actual By unfolding this recursive equation many times, we obtain: n values found by optimization, enabling the user to tune the E[Gn] = E[G0]·m (3) generated distribution. As the equation indicates, the expected number of Node con- We note p as the probability of generating a leaf of any Leaf structors at the nth generation is affected by the mean of kind, i.e., p = p + p + p . In this setting, Leaf Leaf A Leaf B Leaf C reproduction. Although we obtained this intuitive result us- and assuming a parent constructor Node, the probabilities of ing a formalism that may look overly complex, it is useful to generating R numbers of Node offspring in the next generation understand the methodology used here. In the next section, (i.e., in the recursive calls of arbitrary) are as follows: we will derive the main result of this work following the same

P(R = 0) = pLeaf · pLeaf reasoning line under a more general scenario. P(R = 1) = p · p + p · p = 2 · p · p We can now also predict the total expected number of indi- N ode Leaf Leaf N ode N ode Leaf viduals up to the nth generation. For that purpose, we introduce P(R = 2) = p · p N ode N ode the random variable Pn to denote the population of Node con- structors up to the nth generation. It is then easy to see that One manner to understand the equations above is by consid- Ín ering what QuickCheck does when generating the subtrees of Pn = i=0 Gi and consequently: n n n+1 a given node. For instance, the cases when generating exactly Õ (3) Õ  1−m  E[P ]= E[G ] = E[G ]· mi = E[G ]· (4) one Node as descendant (P(R = 1)) occurs in two situations: n i 0 0 1−m when the left subtree is a Node and the right one is a Leaf ; i=0 i=0 and viceversa. The probability for those events to occur is where the last equality holds by the geometric series definition. pN ode ∗ pLeaf and pLeaf ∗ pN ode , respectively. Then, the prob- This is the general formula provided by the Galton-Watson ablity of having exactly one Node as a descendant is given process. In this case, the mean of reproduction for Node is by the sum of the probability of both events—the other cases given by: follow a similar reasoning. 2 Õ Now that we have determined the distribution of R, we m = E[R] = k · P(R = k) = 2 · pN ode (5) proceed to introduce the random variables Gn to denote the k=0 population of Node constructors in the nth generation. We 4 [ | ] [ | ] [ | ] write ξ n for the random variable which captures the number E X Y is a function on the random variable Y , i.e., E X Y y = E X Y = y i and therefore it is a random variable itself. In this light, the law says that if of (offspring) Node constructors at the nth generation produced we observe the expectations of X given the different ys , and then we do the by the ith Node constructor at the (n-1)th generation. It is easy expectation of all those values, then we have the expectation of X . Branching Processes for QuickCheck Generators Haskell ’18, September 27–28, 2018, St. Louis, MO,USA

By (4) and (5), the expected number of Node constructors up implementation. Additionally note that, like MegaDeTH, our to generation n is given by the following formula: generators use sized to limit the number of recursive calls to  n+1   n+1  ensure termination. We note that the theory behind branching 1−m 1−(2 · pNode) E[P ]=E[G ]· =p · processes is able to predict the termination behavior of our n 0 1−m Node 1−2·p Node generators and we could have used this ability to ensure their If we apply the previous formula to predict the distribution termination without the need of a depth limiting mechanism of constructors induced by MegaDeTH in Figure2, where like sized. However, using sized provides more control over pLeaf A = pLeaf B = pLeaf C = pN ode = 0.25, we obtain an the obtained generator distributions. expected number of Node constructors up to level 10 of 0.4997, To predict the distribution of constructors provided by which denotes a distribution highly biased towards small val- generators, we introduce a generalization of the previ- ues, since we can only produce further subterms by producing ous Galton-Watson branching process called multi-type Galton- Nodes. However, if we set pLeaf A = pLeaf B = pLeaf C = 0.1 and Watson branching process. This generalization allows us to pN ode = 0.7, we can predict that, as expected, our general ran- consider several kinds of individuals, i.e., constructors in our dom generator will generate much bigger trees, containing an setting, to procreate (generate) different kinds of offspring (con- average number of 69.1173 Nodes up to level 10! Unfortunately, structors). Additionally, this approach allows us to consider we cannot apply this reasoning to predict the distribution of not just one constructor, as we did in the previous section, but constructors for derived generators for ADTs with more than rather to consider all of them at the same time. one non-terminal constructor. For instance, let us consider the Before we present the mathematical foundations, which following data type definition: follow a similar line of reasoning as that in Section3, Figure6 ′ ′ ′ ′ illustrates a possible generated value of type Tree′. data Tree = Leaf | NodeA Tree Tree | NodeB Tree In the generation process, In this case, we need to separately consider that a NodeA can it is assumed that the kind generate not only NodeA but also NodeB offspring (similarly (i.e., the constructor) of the NodeA with NodeB ). A stronger mathematical formalism is needed. parent might affect the prob- The next section explains how to predict the generation of this abilities of reproducing (gen- NodeB NodeA kind of data types by using an extension of Galton-Waston erating) offspring of a certain processes known as multi-type branching processes. kind. Observe that this is the NodeA NodeB Leaf case for a wide range of de- 4 Multi-Type Branching Processes rived ADT generators, e.g., Figure 6. A generated In this section, we present the basis for our main contribution: choosing a terminal con- value of type Tree′. the application of multi-type branching processes to predict the structor (e.g., Leaf ) affects distribution of constructors. We will illustrate the technique the probabilities of generat- by considering the Tree′ ADT that we concluded with in the ing non-terminal ones (by setting them to zero). The popula- previous section. tion at the nth generation is then characterized as a vector of 1 2 Before we dive into technicalities, Figure5 shows the auto- random variables Gn = (Gn,Gn, ··· ,Gn ), where d is the num- matically derived generator for Tree′ that our tool produces. ber of different kinds of constructors. Each random variable i Our generators depend on the (possibly) different probabili- Gn captures the number of occurrences of the ith-constructor ties that constructors have to be generated—variables pLeaf , of the ADT at the nth generation. Essentially, Gn “groups” the pN odeA, and pN odeB . These probabilities are used by the func- population at level n by the constructors of the ADT. By esti- tion chooseWith :: [(Double, Gen a)] → Gen a, which picks a mating the expected shape of the vector Gn, it is possible to ob- random generator of type a with an explicitly given probabil- tain the expected number of constructors at the nth generation. 1 2 d ity from a list. This function can be easily expressed by using Specifically, we have that E[Gn] = (E[Gn], E[Gn], ··· , E[Gn ]). QuickCheck’s primitive operations and therefore we omit its To deduce E[Gn], we focus on deducing each component of the vector. ′ instance Arbitrary Tree where As explained above, the reproduction behavior is deter- arbitrary = sized gen where mined by the kind of the individual. In this light, we introduce gen 0 = pure Leaf random variable Rij to denote a parent ith constructor re- gen n = chooseWith producing a jth constructor. As we did before, we apply the j [(pLeaf , pure Leaf ) equation E[X] = E[E[X |Y]] with X = Gn and Y = Gn−1 to ob- j j , (pN odeA, NodeA ⟨$⟩ gen (n−1) ⟨∗⟩ gen (n−1)) tain E[Gn] = E[E[Gn |Gn−1]]. To calculate the expected number , (pN odeB , NodeB ⟨$⟩ gen (n−1))] of jth constructors at the level n produced by the constructors j present at level (n − 1), i.e., E[Gn |G(n−1)], it is enough to count Figure 5. generator for Tree′ the expected number of children of kind j produced by the Haskell ’18, September 27–28, 2018, St. Louis, MO, USA Agustín Mista, Alejandro Russo, and John Hughes different parents of kind i, i.e., E[Rij ], times the amount of interchangeable. They are the general formulas for the Galton- ( − ) i parents of kind i found in the level n 1 , i.e., G(n−1). This Watson multi-type branching processes. ′ result is expressed by the following equation marked as (⋆), Then, to predict the distribution of our Tree data type ex- and is formally verified in the Appendix B.2. ample, we proceed to build its mean matrix MC . For instance, the mean number of Leaf s generated by a NodeA is: d d · · · · (⋆) Õ Õ mN odeA,Leaf = 1 pLeaf pN odeA + 1 pLeaf pN odeB [ j | ] i · [ ] i · E Gn Gn−1 = G(n−1) E Rij = G(n−1) mij (6) | {z } i=1 i=1 One Leaf as left-subtree

Similarly as before, we rewrite E[Rij ] as mij , which now + 1 · pN odeA · pLeaf + 1 · pN odeB · pLeaf represents a single expectation of reproduction indexed by the | {z } kind of both the parent and child constructor. One Leaf as right-subtree

+ 2 · pLeaf · pLeaf Mean matrix of constructors In the previous section, m | {z } was the expectation of reproduction of a single constructor. Leaf as left- and right-subtree Now we have m as the expectation of reproduction indexed ij = 2 · pLeaf (10) by the parent and child constructor. In this light, we define The rest of MC can be similarly computed, obtaining: MC , the mean matrix of constructors (or mean matrix for sim- plicity) such that each mij stores the expected number of jth Leaf N odeA N odeB constructors generated by the ith constructor. M is a param- Leaf   C  0 0 0  eter of the Galton-Watson multi-type process and can be built  · · ·  MC = N odeA  2 pLeaf 2 pN odeA 2 pN odeB  (11) at compile-time using statically known type information. We    p p p  are now able to deduce E[Gj ]. N odeB  Leaf N odeA N odeB  n   " d # (6) Õ Note that the first row, corresponding to the Leaf constructor, [ j ] [ [ j | ]] i · E Gn = E E Gn Gn−1 = E G(n−1) mij is filled with zeros. This is because Leaf is a terminal construc- i=1 tor, i.e., it cannot generate further subterms of any kind.5 d d With the mean matrix in place, we define E[G ] (the initial Õ i Õ i 0 = E[G( − ) ·mij ] = E[G( − )]·mij n 1 n 1 vector of mean probabilities) as (pLeaf ,pN odeA ,pN odeB ). By ap- i=1 i=1 plying (9) with E[G0] and MC , we can predict the expected [ ] Using this last equation, we can rewrite E Gn as follows. number of generated non-terminal NodeA constructors (and d d ! analogously NodeB ) with a size parameter n as follows: Õ 1 Õ d E[Gn] = E[G ]·mi1, ··· , E[G ]·mid   n (n−1) (n−1)  T  T I −(MC ) i=1 i=1 E[NodeA]= E[Pn−1] .NodeA = E[G0] · .NodeA I −MC By linear algebra, we can rewrite the vector above as the T T Function (_).C simply projects the value corresponding to con- matrix multiplication E[Gn] = E[Gn−1] ·MC . By repeatedly unfolding this definition, we obtain that: structor C from the population vector. It is very important to note that the sum only includes the population up to level T T n E[Gn] = E[G0] ·(MC ) (7) (n − 1). This choice comes from the fact that our QuickCheck This equation is a generalization of (3) when considering many generator can choose between only terminal constructors at constructors. As we did before, we introduce a random variable the last generation level (recall that gen 0 generates only Ín Pn = Gi to denote the population up to the nth generation. Leaf s in Figure5). As an example, if we assign our gener- i=0 ′ It is now possible to obtain the expected population of all the ation probabilities for Tree as pLeaf 7→ 0.2, pN odeA 7→ 0.5 and constructors but in a clustered manner: pN odeB 7→ 0.3, then the formula predicts that our QuickCheck " n #T n n generator with a size parameter of 10 will generate on average T Õ Õ T (7) Õ T n 21.322 Node s and 12.813 Node s. This result can easily be E[Pn] = E Gi = E[Gi ] = E[G0] ·(MC ) (8) A B i=0 i=0 i=0 verified by sampling a large number of values with agenera- It is possible to write the resulting sum as the closed formula: tion size of 10, and then averaging the number of generated NodeAs and NodeB s across the generated values.  − ( )n+1  T T I MC In this section, we obtain a prediction of the expected num- E[Pn] = E[G0] · (9) I − MC ber of non-terminal constructors generated by gen- where I represents the identity matrix of the appropriate size. erators. To predict terminal constructors, however, requires a Note that equation (9) only holds when (I−MC ) is non-singular, special treatment as discussed in the next section. ( − ) however, this is the usual case. When I MC is singular, we 5The careful reader may notice that there is a pattern in the mean matrix if resort to using equation (8) instead. Without losing general- inspected together with the definition of Tree′. We prove in Section6 that each ity, and for simplicity, we consider equations (8) and (9) as mi j can be automatically calculated by simply exploiting type information. Branching Processes for QuickCheck Generators Haskell ’18, September 27–28, 2018, St. Louis, MO,USA

5 Terminal constructors terminal constructors, we have to consider the probabilities In this section we introduce the special treatment required to of choosing each one of them when filling the recursive argu- predict the generated distribution of terminal constructors, i.e. ments of non-terminal constructors at the previous level. For constructors with no recursive arguments. instance, consider the following ADT: ′′ ′′ ′′ ′′ Consider the generator in Figure5. It generates terminal data Tree = LeafA | LeafB | NodeA Tree Tree | NodeB Tree constructors in two situations, i.e., in the definition of gen 0 Figure7 shows the corresponding generator for and gen n. In other words, the random process introduced by Tree ′′. Note there are two sets of probabilities to choose termi- our generators can be considered to be composed of two inde- nal nodes, one for each random process. The p∗ andp∗ pendent parts when it comes to terminal constructors—refer Leaf A Leaf B to Appendix B.3 for a graphical interpretation. In principle, probabilities are used to choose between terminal constructors the number of terminal constructors generated by the stochas- at the last generation level. These probabilities preserve the tic process described in gen n is captured by the multi-type same proportion as their non-starred versions, i.e., they are branching process formulas. However, to predict the expected normalized to form a probability distribution: pLeaf pLeaf number of terminal constructors generated by exercising gen 0, p∗ = A p∗ = B LeafA LeafB we need to separately consider a random process that only pLeafA + pLeafB pLeafA + pLeafB generates terminal constructors in order to terminate. For this In this manner, we can use the same generation probabilities purpose, and assuming a maximum generation depth n, we for terminal constructors in both random processes—therefore need to calculate the number of terminal constructors required reducing the complexity of our prediction engine implementa- to stop the generation process at the recursive arguments of ′ tion (described in Section7). each non-terminal constructor at level (n − 1). In our Tree To compute the overall expected number of terminals, we example, this corresponds to two Leaf s for every NodeA and need to predict the expected number of terminal constructors one Leaf for every NodeB constructor at level (n − 1). at the last generation level which could be descendants of Since both random processes are independent, to predict non-terminal constructors at level (n − 1). More precisely: the overall expected number of terminal constructors, we can  T  ∗  T  simply add the expected number of terminal constructors gen- E[Leaf ] = E[P − ] .Leaf + 2·p · E[G − ] .Node A n 1 A LeafA n 1 A erated in each one of them. Recalling our previous example, ′ | {z } | {z } we obtain the following formula for Tree terminals as follows: branching process expected leaves to fill N odeAs

    ∗  T  T T + 1·p · E[G − ] .Node E[Leaf ] = E[Pn−1] .Leaf + 2· E[Gn−1] .NodeA LeafA n 1 B | {z } | {z } | {z } branching process case (N odeA Leaf Leaf ) expected leaves to fill N odeB s

 T  where the case of E[LeafB ] follows analogously. + 1· E[Gn−1] .NodeB | {z } 6 Mutually-recursive and composite ADTs case (N odeB Leaf ) In this section, we introduce some extensions to our model that The formula counts the Leaf s generated by the multi-type allow us to derive generators for data types found branching process up to level (n − 1) and adds the expected in existing off-the-shelf Haskell libraries. We start by show- number of Leaf s generated at the last level. ing how multi-type branching processes naturally extend to Although we can now predict the expected number of gener- mutually-recursive ADTs. Consider the mutually recursive ated Tree′ constructors regardless of whether they are terminal ADTs T and T with their automatically derived generators or not, this approach only works for data types with a single 1 2 shown in Figure8. Note the use of the QuickCheck’s function terminal constructor. If we have a data type with multiple resize :: Int → Gen a → Gen a, which resets the generation instance Arbitrary Tree ′′ where size of a given generator to a new value. We use it to decre- ment the generation size at the recursive calls of arbitrary that arbitrary = sized gen where generate subterms of a mutually recursive data type. gen 0 = chooseWith The key observation is that we can ignore that A, B, C and [(p∗ , pure Leaf ), (p∗ , pure Leaf )] Leaf A A Leaf B B D are constructors belonging to different data types and just con- gen n = chooseWith sider each of them as a kind of offspring on its own. Figure9 [(pLeaf A, pure LeafA), (pLeaf B, pure LeafB ) visualizes the possible offspring generated by the non-terminal , (pN odeA, NodeA ⟨$⟩ gen (n−1) ⟨∗⟩ gen (n−1)) constructor B (belonging to T1) with the corresponding proba- , (pN odeB, NodeB ⟨$⟩ gen (n−1))] bilities as labeled edges. Following the figure, we obtain the expected number of Ds generated by B constructors as follows: Figure 7. Derived generator for Tree ′′ mBD = 1 · pA · pD + 1 · pB · pD = pD ·(pA + pB ) = pD Haskell ’18, September 27–28, 2018, St. Louis, MO, USA Agustín Mista, Alejandro Russo, and John Hughes

the number of occurrences in the ADT which the offspring be- data T1 = A | B T1 T2 longs to. For instance, m is 2 · p (10), where 2 data T = C | D T N odeA,Leaf Leaf 2 1 is the number of occurrences of Tree′ in the declaration of instance Arbitrary T where 1 NodeA. Similarly, mBD is 1 · pD , where 1 is the number of oc- arbitrary = sized gen where currences of T2 in the declaration of B. This observation means gen 0 = pure A that instead of dealing with constructors, we could directly gen n = chooseWith deal with types! [(pA, pure A) We can think about a branching process as generating “place , (pB, B ⟨$⟩ gen (n−1) ⟨∗⟩ resize (n−1) arbitrary)] holders” for constructors, where place holders can only be populated by constructors of a certain type. instance Arbitrary T where 2 Figure 10 illustrates offspring as types for the definitions arbitrary = sized gen where ′ T1, T2, and Tree . A place holder of type T1 can generate a gen 0 = pure C place holder for type T1 and a place holder for type T2. A place gen n = chooseWith holder of type T2 can generate a place holder of type T1.A [(pC , pure C), (pD , D ⟨$⟩ resize (n−1) arbitrary)] place holder of type Tree′ can generate two place holders of ′ type Tree when generating NodeA, one place holder when Figure 8. Mutually recursive types T1 and T2 and their generating NodeB , or zero place holders when generating a generators. Leaf (this last case is not shown in the figure since it is void). With these considerations, the mean matrices of types for ′ ′ Doing similar calculations, we obtain the mean matrix MC for Tree , written MT ree ; and types T1 and T2, written MT1T2 are A, B, C, and D as follows: defined as follows: T 1 T 2 A B C D T ree′ " #   h i T 1 pB pB A 0 0 0 0 ′ ′ ·   MT ree = T ree 2 pN odeA + pN odeB MT1T2 =   T 2 pD 0 B  pA pB pC pD  MC =   (12) C  0 0 0 0    Note how MT ree′ shows that the mean matrices of types might  p p 0 0  D  A B  reduce a multi-type branching process to a simple-type one.   Having the type matrix in place, we can use the following We define the mean of the initial generation as E[G0] = equation (formally stated and proved in the AppendixA) to (pA,pB, 0, 0)—we assing pC = pD = 0 since we choose to start soundly predict the expected number of constructors of a given by generating a value of type T1. With MC and E[G0] in place, set of (possibly) mutually recursive types: we can apply the equations explained through Section4 to predict the expected number of A, B, C and D constructors. C t T (E[Gn ]).Ci = (E[Gn ]).Tt · pCt ( n ≥ 0) While this approach works, it completely ignores the types i ∀ C T T1 and T2 when calculating MC ! For a large set of mutually- Where Gn and Gn denotes the nth-generations of construc- recursive data types involving a large number of constructors, t tors and type place holders respectively. Ci represents the handling MC like this results in a high computational cost. We ith-constructor of the type Tt . The equation establishes that, show next how we cannot only shrink this mean matrix of t the expected number of constructors Ci at generation n con- constructors but also compute it automatically by making use sists of the expected number of type place holders of its type of data type definitions. (i.e., Tt ) at generation n times the probability of generating that constructor. This equation allows us to simplify many of Mean matrix of types If we analyze the mean matrices of our calculations above by simply using the mean matrix for Tree′ (11) and the mutually-recursive types T and T (12), 1 2 types instead of the mean matrix for constructors. it seems that determining the expected number of offspring generated by a non-terminal constructor requires us to count

pA · pC pB · pD ′ T1 T2 B Tree B AC B (B ···)(D ···) pB pD pN odeA pN odeB pA · pD pB · pC T1 T2 T1 ′ ′ ′ B A (D ···) B (B ···) C Tree Tree Tree

Figure 9. Possible offspring of constructor B. Figure 10. Offspring as types Branching Processes for QuickCheck Generators Haskell ’18, September 27–28, 2018, St. Louis, MO,USA

6.1 Composite types 7 Implementation In this subsection, we extend our approach in a modular man- is a tool chain written in Haskell that implements ner to deal with composite ADTs, i.e., ADTs which use already the multi-type branching processes (Section4 and5) and its defined types in their constructors’ arguments and which are extensions (Section6) together with a distribution optimizer, not involved in the branching process. We start by considering which calibrates the probabilities involved in generators to fit the ADT Tree modified to carry booleans at the leaves: developers’ demands. synthesizes generators by call- → data Tree = Leaf Bool | Leaf Bool Bool | · · · ing the Template Haskell function dragenArbitrary :: Name A B Size → CostFunction → Q [Dec], where developers indicate Where ··· denotes the constructors that remain unmodified. the target ADT for which they want to obtain a QuickCheck To predict the expected number of True (and analogously of generator; the desired generation size, needed by our predic- False) constructors, we calculate the multi-type branching pro- tion mechanism in order to calculate the distribution at the cess for Tree and multiply each expected number of leaves by last generation level; and a cost function encoding the desired the number of arguments of type Bool present in each one: generation distribution.

E[True] = pT rue ·(1 · E[LeafA] + 2 · E[LeafB ]) The design decision to use a probability optimizer rather | {z } | {z } than search for an analytical solution is driven by two im- case LeafA case LeafB portant aspects of the problem we aim to solve. Firstly, the

In this case, Bool is a ground type like Int, Float, etc. Predictions computational cost of exactly solving a non-linear system of become more interesting when considering richer composite equations (such as those arising from branching processes) types involving, for instance, instantiations of polymorphic can be prohibitively high when dealing with a large number types. To illustrate this point, consider a modified version of of constructors, thus a large number of unknowns to be solved Tree where LeafA now carries a value of type Maybe Bool: for. Secondly, the existence of such exact solutions is not guar- anteed due to the implicit invariants the data types under data Tree = Leaf (Maybe Bool) | Leaf Bool Bool | · · · A B consideration might have. In such cases, we believe it is much In order to calculate the expected number of Trues, now we more useful to construct a distribution that approximates the need to consider the cases that a value of type Maybe Bool user’s goal, than to abort the entire compilation process. We actually carries a boolean value, i.e., when a Just constructor give an example of this approximate solution finding behavior gets generated: later in this section.

E[True] = pT rue ·(1 · E[LeafA]· pJust + 2 · E[LeafB ]) 7.1 Cost functions In the general case, for The optimization process is guided by a user-provided cost constructor arguments uti- LeafA lizing other ADTs, it is nec- function. In our setting, a cost function assigns a real number p pN othinд essary to know the chain Just (a cost) to the combination of a generation size (chosen by the of constructors required to Just Nothing user) and a mapping from constructors to probabilities: generate “foreign” values— pT rue pF alse type CostFunction = Size → ProbMap → Double in our example, a True value True False gets generated if a LeafA Type ProbMap encodes the mapping from constructor names gets generated with a Just Figure 11. Constructor de- to real numbers. Our optimization algorithm works by gener- constructor “in between.”To pendency graph. ating several ProbMap candidates that are evaluated through obtain such information, we the provided cost function in order to choose the most suitable create of a constructor dependency graph (CDG), that is, a di- one. Cost functions are expected to return a smaller positive rected graph where each node represents a constructor and number as the predicted distribution obtained from its param- each edge represents its dependency. Each edge is labeled with eters gets closer to a certain target distribution, which depends its corresponding generation probability. Figure 11 shows the on what property that particular cost function is intended to CDG for Tree starting from the LeafA constructor. Having encode. Then, the optimizator simply finds the best ProbMap this graph together with the application of the multi-type by minimizing the provided cost function. branching process, we can predict the expected number of Currently, our tool provides a basic set of cost functions to constructors belonging to external ADTs. It is enough to mul- easily describe the expected distribution of the derived gener- tiply the probabilities at each edge of the path between every ator. For instance, uniform :: CostFunction encodes constuctor- constructor involved in the branching process and the desired wise uniform generation, an interesting property that nat- external constructor. urally arises from our generation process formalization. It The extensions described so far enable our tool (presented guides the optimization process to a generation distribution in the next section) to make predictions about QuickCheck that minimizes the difference between the expected num- generators for ADTs defined in many existing Haskell libraries. ber of each generated constructor and the generation size. Haskell ’18, September 27–28, 2018, St. Louis, MO, USA Agustín Mista, Alejandro Russo, and John Hughes

Moreover, the user can restrict the generation distribution neighbors to evaluate, or if each step improvement is lower to a certain subset of constructors using the cost functions than a minimum predetermined ε. only :: [Name] → CostFunction and without :: [Name] → The final stage synthesizes a Arbitrary type-class instance CostFunction to describe these restrictions. In this case, the for the target data types using the optimized generation proba- whitelisted constructors are then generated following the bilities. For this stage, we extend some functionality present in uniform behavior. Similarly, if the branching process involves MegaDeTH in order to derive generators parametrized by our mutually recursive data types, the user could restrict the gen- previously optimized probabilities. Refer to Appendix B.4 for eration to a certain subset of data types by using the func- further details on the cost functions and algorithms addressed tions onlyTypes and withoutTypes. Additionally, when the user by this section. wants to generate constructors according to certain propor- tions, weighted :: [(Name, Int)] → CostFunction allows to en- 8 Case Studies code this property, e.g. three times more LeafA’s than LeafB ’s. We start by comparing the generators for the ADT Tree derived Table1 shows the number of expected and observed con- by MegaDeTH and Feat, presented in Section2, with the corre- structors of different Tree generators obtained by using differ- sponding generator derived by using a uniform cost ent cost functions. The observed expectations were calculated function. We used a generation size of 10 both for MegaDeTH averaging the number of constructors across 100000 gener- and , and a generation size of 400 for Feat—that is, ated values. Firstly, note how the generated distributions are Feat will generate test cases of maximum 400 constructors, soundly predicted by our tool. In our tests, the small differences since this is the maximum number of constructors generated between predictions and actual values dissapear as we increase by our tool using the generation size cited above. Figure 12 the number of generated values. As for the cost functions’ be- shows the differences between the complexity of the generated havior, there are some interesting aspects to note. For instance, values in terms of the number of constructors. As shown in in the uniform case the optimizer cannot do anything to break Figure3, generators derived by MegaDeTH and Feat produce the implicit invariant of the data type: every binary tree with n very narrow distributions, being unable to generate a diverse nodes has n + 1 leaves. Instead, it converges to a solution that variety of values of different sizes. In contrast, the “approximates” a uniform distribution around the generation optimized generator provides a much wider distribution, i.e., size parameter. We believe this is desirable behavior, to find from smaller to bigger values. an approximate solution when certain invariants prevent the It is likely that the richer the values generated, the better optimization process from finding an exact solution. This way the chances of covering more code, and thus of finding more the user does not have to be aware of the possible invariants bugs. The next case studies provide evidence in that direction. that the target data type may have, obtaining a solution that Although can be used to test Haskell code, we is good enough for most purposes. On the other hand, notice follow the same philosophy as QuickFuzz, targeting three com- that in the weighted case at the second row of Table1, the plex and widely used external programs to evaluate how well expected number of generated Nodes is considerably large. our derived generators behave. These applications are GNU This constructor is not listed in the proportions list, hence bash 4.4—a widely used Unix shell, GNU CLISP 2.49—the GNU the optimizer can freely adjust its probability to satisfy the , and giffix—a small test utility from the proportions specified for the leaves. GIFLIB 5.1 library focused on reading and writing Gif images. It is worth noticing that these applications are not written in 7.2 Derivation Process Haskell. Nevertheless, there are Haskell libraries designed to ’s derivation process starts at compile-time with a type inter-operate with them: language-bash, atto-lisp, and JuicyP- reification stage that extracts information about the structure ixels, respectively. These libraries provide ADT definitions of the types under consideration. It follows an intermediate stage composed of the optimizer for probabilities used in gen- erators, which is guided by our multi-type branching process model, parametrized on the cost function provided. This opti- mizer is based on a standard local-search optimization algo- rithm that recursively chooses the best mapping from construc- tors to probabilities in the current neighborhood. Neighbors are ProbMaps, determined by individually varying the prob- abilities for each constructor with a predetermined ∆. Then, to determine the “best” probabilities, the local-search applies our prediction mechanishm to the immediate neighbors that have not yet been visited by evaluating the cost function to • select the most suitable next candidate. This process contin- Figure 12. MegaDeTH (▲) vs. Feat (■) vs. ( ) gener- ues until a local minimum is reached when there are no new ated distributions for type Tree. Branching Processes for QuickCheck Generators Haskell ’18, September 27–28, 2018, St. Louis, MO,USA

Table 1. Predicted and actual distributions for Tree generators using different cost functions.

Cost Function Predicted Expectation Observed Expectation LeafA LeafB LeafC Node LeafA LeafB LeafC Node uniform 5.26 5.26 5.21 14.73 5.27 5.26 5.21 14.74 ′ ′ ′ weighted [( LeafA, 3), ( LeafB, 1), ( LeafC , 1)] 30.07 9.76 10.15 48.96 30.06 9.75 10.16 48.98 ′ ′ weighted [( LeafA, 1), ( Node, 3)] 10.07 3.15 17.57 29.80 10.08 3.15 17.58 29.82 ′ ′ only [ LeafA, Node ] 10.41 0 0 9.41 10.43 0 0 9.43 ′ without [ LeafC ] 6.95 6.95 0 12.91 6.93 6.92 0 12.86 which we used to synthesize generators for the in- execution paths in comparison to MegaDeTH ones. Our results puts of the aforementioned applications. Moreover, they also indicate average increases approximately between 35% and come with serialization functions that allow us to transform 41% with an standard error close to 0.35% in the number of the randomly generated Haskell values into the actual test files different execution paths triggered in the programs under test. that we used to test each external program. The case studies An attentive reader might remember that MegaDeTH tends contain mutually recursive and composite ADTs with a wide to derive generators which produce very small test cases. If we number of constructors (e.g., GNU bash spans 31 different consider that small test cases should take less time (on average) ADTs and 136 different constuctors)—refer to B.5 for a rough to be tested, is fair to think there is a trade-off between being estimation of the scale of such data types and the data types able to test a bigger number of smaller test cases or a smaller involved with them. number of bigger ones having the same time available. How- For our experiments, we use the coverage measure known as ever, when testing external software like in our experiments, it execution path employed by American Fuzzy Lop (AFL) [21]—a is important to consider the time overhead introduced by the well known fuzzer. It was chosen in this work since it is also . In this scenario, it is much more preferable used in the work by Grieco et al. [15] to compare MegaDeTH to test interesting values over smaller ones. In our tests, size with other techniques. The process consists of the instrumen- differences between the generated values of each tool doesdo tation of the binaries under test, making them able to return not result in significant differences in the runtimes required to the path in the code taken by each execution. Then, we use test each corpora—refer to Appendix B.5. A user is most likely AFL to count how many different executions are triggered by to get better results by using our tool instead of MegaDeTH, a set of randomly generated files—also known as a corpus. In with virtually the same effort. this evaluation, we compare how different QuickCheck genera- We also remark that, if we run sufficiently many tests, then tors, derived using MegaDeTH and using our approach, result the expected code coverage will tend towards 100% of the in different code coverage when testing external programs, reachable code in both cases. However, in practice, our ap- as a function of the size of a set of independently, randomly proach is more likely to achieve higher code coverage for the generated corpora. We have not been able to automatically same number of test cases. derive such generators using Feat, since it does not work with some Haskell extensions used in the bridging libraries. 9 Related Work We generated each corpus using the same ADTs and genera- Fuzzers are tools to tests programs against randomly generated tion sizes for each derivation mechanism. We used a generation unexpected inputs. QuickFuzz [14, 15] is a tool that synthesizes size of 10 for CLISP and bash files, and a size of 5 for Gif files. data with rich structure, that is, well-typed files which can be For , we used uniform cost functions to reduce any used as initial “seeds” for state-of-the-art fuzzers—a work flow external bias. In this manner, any observed difference in the which discovered many unknown vulnerabilities. Our work code coverage triggered by the corpora generated using each could help to improve the variation of the generated initial derivation mechanism is entirely caused by the optimization seeds, by varying the distribution of QuickFuzz generators—an stage that our predictive approach performs, which does not interesting direction for future work. represent an extra effort for the . Moreover, we SmallCheck [27] provides a framework to exhaustively test repeat each experiment 30 times using independently gener- data sets up to a certain (small) size. The authors also propose ated corpora for each combination of derivation mechanism a variation called Lazy SmallCheck, which avoids the genera- and corpus size. tion of multiple variants which are passed to the test, but not Figure 13 compares the mean number of different execution actually used. paths triggered by each pair of generators and corpus sizes, QuickCheck has been used to generate well-typed lambda with error bars indicating 95% confidence intervals of the mean. terms in order to test [25]. Recently, Midtgaard It is easy to see how the generators synthesize test et al. extend such a technique to test compilers for impure cases capable of triggering a much larger number of different programming languages [23]. Haskell ’18, September 27–28, 2018, St. Louis, MO, USA Agustín Mista, Alejandro Russo, and John Hughes

Figure 13. Path coverage comparison between MegaDeTH (▲) and (•).

Luck [19] is a domain specific language for writing testing constructors—this allows exploiting software behavior under properties and QuickCheck generators at the same time. We test to guide the parameter optimization. see Luck’s approach as orthogonal to ours, which is mostly in- The efficiency of random testing is improved if the gen- tended to be used when we do not know any specific property erated inputs are evenly spread across the input domain [6]. of the system under test, although we consider that borrowing This is the main idea of Adaptive Random Testing (ART) [7]. some functionality from Luck into is an interesting However, this work only covers the particular case of testing path for future work. programs with numerical inputs and it has also been argued Recently, Lampropoulos et al. propose a framework to au- that adaptive random testing has inherent inefficiencies com- tomatically derive random generators for a large subclass of pared to random testing [1]. This strategy is later extended in Coqs’ inductively defined relations20 [ ]. This derivation pro- [8] for object-oriented programs. These approaches present cess also provides proof terms certifying that each derived no analysis of the distribution obtained by the heuristics used, generator is sound and complete with respect to the inductive therefore we see them as orthogonal work to ours. relation it was derived from. Boltzmann models [11] are a general approach to randomly 10 Final Remarks generating combinatorial structures such as trees and graphs— We discover an interplay between the stochastic theory of also extended to work with closed simply-typed lambda terms branching processes and algebraic data types structures. This [5]. By implementing a Boltzmann sampler, it is possible to connection enables us to describe a solid mathematical foun- obtain a random generator built around such models which dation to capture the behavior of our derived QuickCheck uniformly generates values of a target size with a certain size generators. Based on our formulas, we implement a heuristic tolerance. However, this approach has practical limitations. to automatically adjust the expected number of constructors Firstly, the framework is not expressive enough to represent being generated as a way to control generation distributions. complex constrained data structures, e.g red-black trees. Sec- One holy grail in testing is the generation of structured ondly, Boltzmann samplers give the user no control over the data which fulfills certain invariants. We believe that our work distribution of generated values besides ensuring size-uniform could be used to enforce some invariants on data “up to some generation. They work well in theory but further work is re- degree.” For instance, by inspecting programs’ source code, we quired to apply them to complex structures [26]. Conversely, could extract the pattern-matching patterns from programs provides a simple mechanism to predict and tune (e.g., (Cons (Cons x))) and derive generators which ensure the overall distribution of constructors analytically at compile- that such patterns get exercised a certain amount of times (on time, using statically known type information, and requir- average)—intriguing thoughts to drive our future work. ing no runtime reinforcements to ensure the predicted distri- butions. Future work will explore the connections between Acknowledgments branching processes and Boltzmann models. We would like to thank Michał Pałka, Nick Smallbone, Martin Similarly to our work, Feldt and Poulding propose GödelTest Ceresa and Gustavo Grieco for comments on an early draft. [13], a search-based framework for generating biased data. It This work was funded by the Swedish Foundation for Strategic relies on non-determinism to generate a wide range of data Research (SSF) under the project Octopi (Ref. RIT17-0023) and structures, along with metaheuristic search to optimize the WebSec (Ref. RIT17-0011) as well as the Swedish research parameters governing the desired biases in the generated data. agency Vetenskapsrådet. Rather than using metaheuristic search, our approach employs a completely analytical process to predict the generation dis- References tribution at each optimization step. A strength of the GödelTest [1] A. Arcuri and L. Briand. 2011. Adaptive Random Testing: An Illusion approach is that it can optimize the probability parameters of Effectiveness?. In Proc. of the International Symposium on Software even when there is no specific target distribution over the Testing and Analysis (ISSTA ’11). ACM. [2] B. Arkin, S. Stender, and G. McGraw. 2005. Software penetration testing. IEEE Security Privacy (2005). Branching Processes for QuickCheck Generators Haskell ’18, September 27–28, 2018, St. Louis, MO,USA

[3] T. Arts, J. Hughes, U. Norell, and H. Svensson. 2015. Testing AUTOSAR [17] J. Hughes, C. Pierce B, T. Arts, and U. Norell. 2016. Mysteries of DropBox: software with QuickCheck. In In Proc. of IEEE International Conference Property-Based Testing of a Distributed Synchronization Service. In Proc. on , Verification and Validation, ICST Workshops. of the Int. Conference on Software Testing, Verification and Validation, [4] N. Balakrishnan, V. Voinov, and M.S Nikulin. 2013. Chi-Squared Goodness ICST. of Fit Tests with Applications. 1st Edition. Academic Press. [18] J. Hughes, U. Norell, N. Smallbone, and T. Arts. 2016. Find more bugs [5] M. Bendkowski, K. Grygiel, and P. Tarau. 2017. Boltzmann Samplers with QuickCheck!. In Proc. of the International Workshop on Automation for Closed Simply-Typed Lambda Terms. In In Proc. of International of Software Test, AST@ICSE. Symposium on Practical Aspects of Declarative Languages. ACM. [19] L. Lampropoulos, D. Gallois-Wong, C. Hritcu, J. Hughes, B. C. Pierce, and [6] F.T. Chan, T.Y. Chen, I.K. Mak, and Y.T. Yu. 1996. Proportional sampling L. Xia. 2017. Beginner’s luck: a language for property-based generators. strategy: guidelines for software testing practitioners. Information and In Proc. of the ACM SIGPLAN Symposium on Principles of Programming Software Technology 38, 12 (1996), 775 – 782. Languages, POPL. [7] T. Y. Chen, H. Leung, and I. K. Mak. 2005. Adaptive Random Testing. [20] L. Lampropoulos, Z. Paraskevopoulou, and B. C. Pierce. 2017. Generating In Advances in Computer Science - ASIAN 2004. Higher-Level Decision Good Generators for Inductive Relations. In Proc. ACM on Programming Making, Michael J. Maher (Ed.). Springer Berlin Heidelberg. Languages 2, POPL, Article 45 (2017). [8] I. Ciupa, A. Leitner, M. Oriol, and B. Meyer. 2008. ARTOO: adaptive [21] M. Zalewski. 2010. American Fuzzy Lop: a security-oriented fuzzer. random testing for object-oriented software. In Proc. of International http://lcamtuf.coredump.cx/afl/. (2010). Conference on Software Engineering. ACM/IEEE. [22] C. McBride and R. Paterson. 2008. Applicative Programming with Effects. [9] K. Claessen, J. Duregård, and M. H. Palka. 2014. Generating Constrained Journal of Functional Programming 18, 1 (Jan. 2008). Random Data with Uniform Distribution. In Proc. of the Functional and [23] J. Midtgaard, M. N. Justesen, P. Kasting, F. Nielson, and H. R. Nielson. Logic Programming FLOPS. 2017. Effect-driven QuickChecking of compilers. In Proceedings of the [10] K. Claessen and J. Hughes. 2000. QuickCheck: A Lightweight Tool for ACM on Programming Languages, Volume 1 ICFP (2017). Random Testing of Haskell Programs. In Proc. of the ACM SIGPLAN [24] N. Mitchell. 2007. Deriving Generic Functions by Example. In Proc. of International Conference on Functional Programming (ICFP). the 1st York Doctoral Syposium. Tech. Report YCS-2007-421, Department [11] P. Duchon, P. Flajolet, G. Louchard, and G. Schaeffer. 2004. Boltzmann of Computer Science, University of York, UK, 55–62. Samplers for the Random Generation of Combinatorial Structures. Com- [25] M. Pałka, K. Claessen, A. Russo, and J. Hughes. 2011. Testing and Optimis- binatorics, Probability and Computing. 13 (2004). ing Compiler by Generating Random Lambda Terms. In The IEEE/ACM [12] J. Duregård, P. Jansson, and M. Wang. 2012. Feat: Functional enumeration International Workshop on Automation of Software Test (AST 2011). of algebraic types. In Proc. of the ACM SIGPLAN Symposium on Haskell. [26] S. M. Poulding and R. Feldt. 2017. Automated Random Testing in Multiple [13] R. Feldt and S. Poulding. 2013. Finding test data with specific properties Dispatch Languages. IEEE International Conference on Software Testing, via metaheuristic search. In Proc. of International Symp. on Software Verification and Validation (ICST) (2017). Reliability Engineering (ISSRE). IEEE. [27] C. Runciman, M. Naylor, and F. Lindblad. 2008. Smallcheck and Lazy [14] G. Grieco, M. Ceresa, and P. Buiras. 2016. QuickFuzz: An automatic Smallcheck: automatic exhaustive testing for small values. In Proc. of the random fuzzer for common file formats. In Proc. of the International ACM SIGPLAN Symposium on Haskell. Symposium on Haskell. ACM. [28] T. Sheard and Simon L. Peyton Jones. 2002. Template meta-programming [15] G. Grieco, M. Ceresa, A. Mista, and P. Buiras. 2017. QuickFuzz testing for Haskell. SIGPLAN Notices 37, 12 (2002), 60–75. for fun and profit. Journal of Systems and Software 134, Supp. C (2017). [29] H. W. Watson and F. Galton. 1875. On the probability of the extinction [16] P. Haccou, P. Jagers, and V. Vatutin. 2005. Branching processes. Variation, of families. The Journal of the Anthropological Institute of Great Britain growth, and extinction of populations. Cambridge University Press. and Ireland (1875). Haskell ’18, September 27–28, 2018, St. Louis, MO, USA Agustín Mista, Alejandro Russo, and John Hughes

Cu Cv A Demonstrations i j Then, we can calculate the expectancy of each Xk : u v u v u v In this appendix, we provide the formal development to show Ci Cj Ci Cj Ci Cj E[X ] = 1 · P(X = 1) + 0 · P(X = 0) = p v that the mean matrix of types can be used to predict the distri- k k k Cj bution of constructors. We start by defining some terminology. (14)

Definition. Let Tt be a data type defined as a sum of type Finally, we can calculate the expected number of construc- v u constructors: tors Cj generated whenever we generate a constructor Ci by v t t t adding the expected number of Cj generated by each argu- Tt := C1 + C2 + ··· + Cn u ment of Ci of type Tv : Where each constructor is defined as a product of data types: u v Õ Ci Cj t mCu Cv = E[X ] = × × · · · × i j k Cc : T1 T2 Tm u {Tk ∈arдs(Ci ) | Tk =Tv } We will define the following observation functions: Õ = p v (by (14)) t n Cj cons(Tt ) = {Cc }c=1 u {Tk ∈arдs(Ci ) | Tk =Tv } t m arдs(C ) = {Tj } Õ c j=1 = p v · 1 (p v is constant) Cj Cj |T | = |cons(T )| = n u t t {Tk ∈arдs(Ci ) | Tk =Tv } We will also define the branching factor from Cu to T as the v Õ i v = pCv · |{T ∈ arдs(C ) | T = Tv }| ( 1 = |S|) u j k j k natural number β(Tv ,Ci ) denoting the number of occurrences S u u of Tv in the arguments of Ci : = p v · β(T ,C )(by def. of β) Cj v i β(T ,Cu ) = |{T ∈ arдs(Cu ) | T = T }| v i k i k v ■ Before showing our main theorem, we need some prelim- The next propositions relates the mean of reproduction of inary propositions. The following one relates the mean of types with their constructors. reproduction of constructors with their types and the number of occurrences in the ADT declaration. Proposition A.2. Let MT be the mean matrix for types for a n given, possibly mutually recursive data types {Tt } and type Proposition A.1. Let MC be the mean matrix for constructors t=1 t |Tt | { }n constructors {C } . Assuming p t to be the probability of for a given, possibly mutually recursive data types Tt t=1 and i i=1 Ci t t |Tt | ∈ ( ) type constructors {C } . Assuming p t to be the probability generating a constructor Ci cons Tt whenever a value of type i i=1 Ci t Tt is needed, then it holds that: of generating a constructor Ci ∈ cons(Tt ) whenever a value of t Õ u type T is needed, then it holds that: m = β(T ,C )· p u (15) TuTv v k Ck u Cu ∈cons(T u ) m u v = β(T ,C )· p v (13) k Ci Cj v i Cj

Proof. Let m u v be an element of M , we know that m u v Proof. Let mTuTv be an element of MT , we know that mTuTv Ci Cj C Ci Cj v represents the expected number of placeholders of type Tv represents the expected number of constructors Cj ∈ cons(Tv ) u generated whenever a placeholder of type Tu is generated, generated whenever a constructor Ci ∈ cons(Tu ) is generated. Since every constructor is composed of a product of (possibly) i.e. by any of its constructors. Therefore, we need to average many arguments, we need sum the expected number of con- the number of place holders of type Tv appearing on each v u constructor of Tu . For that, we introduce the random variable structors Cj generated by each argument of Ci of typeTv —the uv v Y capturing this behavior. expected number of constructors Cj generated by an argu- uv ment of type different than Tv is null. For this, we define the Y : cons(Tu ) → N u v Ci Cj uv u u Y (C ) = β(Tv ,C ) random variable Xk capturing the number of constructors k k Cv generated by the k-th argument of Cu as follows: j i And we can obtain mTuTv by calculating the expected value of uv Cu Cv Y as follows. X i j : cons(T ) → N k v uv mTuTv = E[Y ] u v ( Ci Cj v 1 if c = j Õ X (C ) = = β(T ,Cu )· P(Y uv = Cu )(def. of E[Y uv ]) k c 0 otherwise v k k u ∈ ( ) Ck cons Tu We can calculate the probabilities of generating zero or one Õ u v u = β(Tv ,C )· pCu (def. of pCu ) constructors C by the k-th argument of C as follows: k k k j i u C ∈ cons(Tu ) u v k Ci Cj P(X = 0) = 1 − p v k Cj u v ■ Ci Cj P(X = 1) = p v k Cj Branching Processes for QuickCheck Generators Haskell ’18, September 27–28, 2018, St. Louis, MO,USA

C The next proposition relates one entry in MT with its corre- where Gn denotes the constructors population at the level n, and T sponding in MC . Gn denotes the type placeholders population at the level n. The expected number of constructors Ct at the n-th level is given Proposition A.3. Let MC and MT be the mean matrices for i C constructors and types respctively for a given, possibly mutually by the expected constructors population at the n-level E[Gn ] n t |Tt | indexed by the corresponding constructor. Similarly, the expected recursive data types {Tt }t=1 and type constructors {Ci }i=1 . As- number of placeholders of type Tt at the n-th level is given by the suming pCt to be the probability of generating a type constructor i expected types population at the n-level E[GT ] indexed by the Ct ∈ cons(T ) whenever a value of type T is needed, then it n i t t corresponding type. The initial constructors population E[GC ] is holds that: 0 Õ defined as the probability of each constructor if it belongs tothe p v · m = m u v · p u (16) Ci TuTv Cj Ci Cj root data type, and zero if it belong to any other data type: u ∈ ( ) Cj cons Tu ( pCt if t = r Proof. Let Cu and Cv be type constructors of T u and T v re- E[GC ].Ct = i i j 0 i 0 otherwise spectively. Then, by (13) and (15) we have: u The initial type placeholders population is defined as the almost mCu Cv = β(Tv ,C )· pCv (17) i j i j surely probability for the root type, and zero for any other type: Õ u mT T = β(Tv ,C )· pCu (18) ( u v k k = u T 1 if t r C ∈cons(Tu ) E[G ].Tt = k 0 0 otherwise Now, we can rewrite (17) as follows: Finally, it holds that: mCu Cv u i j β(Tv ,Ci ) = (if pCv , 0) (19) C t T t p v j (E[Gn ]).Ci = (E[Gn ]).T · pCt Cj i t (In the case that p v = 0, the last equation in this proposition In other words, the expected number of constructors Ci at the Cj holds trivially by (17).) And by replacing (19) in (18) we obtain: n-th level consists of the expected number of placeholders of its type (i.e., ) at level times the probability to generate that m u v Tt n Õ Ci Cj u constructor. mTuTv = · pC pCv k Cu ∈cons(T ) j k u Proof. By induction on the generation size n. 1 Õ u v u v mTuTv = · mC C · pC (pC constant) v i j k j pC u • j C ∈cons(Tu ) Base case k C t T We want to prove (E[G ]).C = (E[G ]).T · p t . Õ 0 i 0 t Ci pCv · mT T = mCu Cv · pCu j u v i j k Let Tt be a data type from the Galton-Watson branching Cu ∈cons(T ) k u process. ■ – If T = T then by the definitions of the initial type Now, we proceed to prove our main result. t r constructors and type placeholders populations we Theorem 1. Consider a QuickCheck generator for a (possibly) have: k mutually recursive data types {Tt }t=1 and type constructors C t T t (E[C ]).C = p t (E[G ]).Tt = 1 t |T | 0 i Ci 0 {C } . We assume p t as the probability of generating a type i i=1 Ci t And the theorem trivially holds by replacing (E[GC ]).Ct constructor Ci ∈ cons(Tt ) when a value of type Tt is needed. We 0 i and (E[GT ]).T with the previous equations in the will call Tr (1 ≤ r ≤ k) to the generation root data type, and 0 t goal. MC and MT to the mean matrices for the multi-type branching process capturing the generation behavior of type constructors and types respectively. The branching process predicting the – If Tt , Tr then by the definitions of the initial type expected number of type constructors at level n is governed by constructors and type placeholders populations we the formula: have:  n+1  (E[GC ]).Ct = 0 (E[GT ]).T = 0 C T C T I − (MC ) 0 i 0 t E[Gn ] = E[G0 ] · I − MC And once again, the theorem trivially holds by re- C t T In the same way, the branching process predicting the expected placing (E[G0 ]).Ci and (E[G0 ]).Tt with the previous number of type placeholders at level n is given by: equations in the goal.  n+1  T T T T I − (MT ) E[Gn ] = E[G0 ] · I − MT Haskell ’18, September 27–28, 2018, St. Louis, MO, USA Agustín Mista, Alejandro Russo, and John Hughes

• Inductive case C t T We want to prove (E[G ]).C = (E[G ]).T · p t . n i n t Ci k For simplicity, we will call Γ = {Tt }t=1.     C t  Õ © Õ C k ª (E[Gn ]).Ci = E ­ (G( − )).Cj · mCk Ct ® (by G.W. proc.)  ­ n 1 j i ® Tk ∈Γ Ck ∈cons(T )   j k   « ¬   Õ  Õ C k  = E  (G ).C · m k t  (by prob.)  (n−1) j Cj Ci  Tk ∈Γ Ck ∈cons(T )   j k   

Õ © Õ C k ª = ­ E[(G( − )).Cj · mCk Ct ]® (by prob.) ­ n 1 j i ® ∈ k Tk Γ C ∈cons(Tk ) « j ¬

Õ © Õ C k ª = ­ E[(G( − )).Cj ]· mCk Ct ® (by prob.) ­ n 1 j i ® ∈ k Tk Γ C ∈cons(Tk ) « j ¬

Õ © Õ C k ª = ­ (E[G( − )]).Cj · mCk Ct ® (by linear alg.) ­ n 1 j i ® ∈ k Tk Γ C ∈cons(Tk ) « j ¬

Õ © Õ T ª = ­ (E[G( − )]).Tt · pCk · mCk Ct ® (by I.H.) ­ n 1 j j i ® ∈ k Tk Γ C ∈cons(Tk ) « j ¬ Õ T Õ = (E[G ]).Tt · p k · m k t (by linear alg.) (n−1) Cj Cj Ci Tk ∈Γ k Cj ∈cons(Tk ) Õ T = (E[G ]).T · p t · m (by (16)) (n−1) t Ci TkTt Tk ∈Γ Õ T = (E[G ]).T · m · p t (rearrange) (n−1) t TkTt Ci Tk ∈Γ Õ T = E[(G ).Tt ]· mT Tt · p k (by linear alg.) (n−1) k Ci Tk ∈Γ Õ T = E[(G ).T · m ]· p t (by prob.) (n−1) t TkTt Ci Tk ∈Γ " # Õ T = E (G ).T · m · p t (by prob.) (n−1) t TkTt Ci Tk ∈Γ T = (E[G ]).T · p t (by G.W. proc.) n t Ci ■ Branching Processes for QuickCheck Generators Haskell ’18, September 27–28, 2018, St. Louis, MO,USA

B Additional information generation) can be calculated by finding the smallest fix point This appendix is meant to provide further analyses for the of the generation recurrence. In our example, this is the small- 2 2 aspects presented throughout this work that would not fit into est d such that d = PA + (PB + PC )· d = (1/3) + (2/3)· d , the available space. where Pi denotes the probabilty of generating a i constructor. In this case d = 1/2. B.1 Termination issues with library derive Figure 14 provides an empirical verification of this non- As we have introduced in Section2, the library derive provides terminating behavior. It shows the distribution (in terms of an easy alternative to automatically synthesize random gener- amount of constructors) of 100000 randomly generated T val- ators in compile time. However, in presence of recursive data ues obtained using the derive generator shown above. The types, the generators obtained with this tool lack mechanisms black bar on the right represents the amount of values that to ensure termination. For instance, consider the following induced an infinite generation loop. Such values were recog- data type definition and its corresponding generator obtained nized using a sufficiently big timeout. The random generation with derive: gets stuck in an infinite generation loop almost exactly half of the times we generate a random T value. In practice, this non terminating behavior gets worse as we data T = A | BTT | CTT increase either the number of recursive constructors or the instance Arbitrary T where number of their recursive arguments in the data type definition, arbitrary = oneof [pure A since this increases the probability of choosing a recursive , B ⟨$⟩ arbitrary ⟨∗⟩ arbitrary constructor each time we need to generate a subterm. , C ⟨$⟩ arbitrary ⟨∗⟩ arbitrary] B.2 Multi-type Branching Processes We will verify the soundness of the step noted as (⋆), used j When using this generator, every constructor in the obtained to deduce E[Gn |Gn−1] in Section4. In first place, note that generator has the same probability of being chosen. Adition- j E[Gn |Gn−1] can be rewritten as: ally, at each point of the generation process, if we randomly " d Gn−1 # generate a recursive type constructor (either B or C), then we j Õ Õ p E[Gn |Gn−1] = E ξij also need to generate two new T values in order to fill the i=1 p=1 arguments of the chosen type constructor. As a result, it is p expected (on average) that each time QuickCheck generates Where symbol ξij denotes the number of offspring of kind j that the parent p of kind i produces. If the parent p has not kind a recursive constructor (i.e., B or C) at one level, more than p one recursive constructor is generated at the next level—thus, i, then ξij = 0. Essentially, the sums simply iterate on all of the frequently leading to an infinite generation loop. This behav- different kind of parents present in the nth-generation, count- ior can be formalized using the concept known as probability ing the number of offspring of kind j that they produce. Then, generating function, where it is proven that the extinction prob- since the expectation of the sum is the sum of expectation, we ability of a generated value d (and thus the termination of the have that: d Gn−1 j Õ Õ h p i E[Gn |Gn−1] = E ξij i=1 p=1 In the inner sum, there are some terms which are 0 and others which are the expected offspring of kind j that a parent of kind i produces. As introduced in Section4, we capture with random variable Rij the distribution governing that a parent of kind i produces offspring of kind j. Finally, by filtering out all the terms which are 0 in the inner sum, i.e., where p , i, we obtain the expected result: d Õ [ j | ] i · [ ] E Gn Gn−1 = G(n−1) E Rij i=1 B.3 Terminal constructors As we explained in Section5, our tool sinthesizes random gen- erators for which the generation of terminal constructors can Figure 14. Distribution of (the amount of) T constructors be thought of two different random processes. More specifi- induced by derive. cally, the firstn ( − 1) generations of the branching process are Haskell ’18, September 27–28, 2018, St. Louis, MO, USA Agustín Mista, Alejandro Russo, and John Hughes composed of a mix of non-terminals and terminals construc- optimized generator. This fitness test was chosen for empiri- tors. The last level, however, only contains terminal construc- cal reasons, since it provides better results in practice when tors since the size limit has been reached. Figure 15 shows a finding probabilities that ensure certain distributions. graphical representation of the overall process. In this appendix we will take special attention to the weighted cost function, since it is the most general one that our tool • G0 provides—the remaining cost funcions provided could be ex- pressed in terms of weighted. This function uses our previously discussed prediction mechanism to obtain a prediction of the G • ■ • 1 constructors distribution under the current given probabilities and the generation size (see obs), and uses it to calculate the • • ■ • Gn−1 Chi-Square Goodness of Fit Test. A simplified implementation of this cost function is as follows. ■ ■ ■ ■ ■ ■ Gn weighted :: [(Name, Double)] → CostFunction Figure 15. Generation processes of non-terminal (•) and weighted weights size probs = chiSquare obs exp terminal (■) constructors. where chiSquare = sum ◦ zipWith (λo e → (o − e) 2 / e) B.4 Implementation obs = predict size probs In this subsection, will give more details on the implementation exp = map weight (Map.keys probs) of our tool. Firstly, Figure 16 shows a schema for the automatic weight con = case lookup con weights of derivation pipeline our tool performs. The user provides a Just w → w ∗ size target data type, a cost function and a desired generation size, Nothing → 0 and our tool returns an optimized random generator. The components marked in red are heavily dependent on Template Haskell and refer to the type introspection and code generation Note how we multiply each weight by the generation size stages of , while the intermediate stages (in blue) are provided by the user (case Just w), as a simple way to con- composed by our prediction mechanishm and the probabilities trol the relative size of the generated values. Moreover, the optimizator. generation probabilities for the constructors not listed in the proportions list do not contribute to the cost (case Nothing), and thus they can be freely adjusted by the optimizer to fit the Distribution Generation prediction proportions of the listed constructors. In this light, the uniform size cost function can be seen as a special case of weighted, where

Target Type Probabilities Code Optimized every constructor is listed with weight 1. data type reifcation optimization generation generator Optimization algorithm As introduced in Section7, our Cost tool makes use of an optimization mechanishm in order to function Figure 16. Generation schema. obtain a suitable generation probabilities assignment for its derived generators. Figure 17 illustrates a simplified implemen- tation of our optimization algorithm. This optimizer works Cost functions The probabilities optimizer that our tool im- selecting recursively the most suitable neighbor, i.e., a prob- plements essentially works minimizing a provided cost func- ability assignment that it close to the current one and that tion that encodes the desired distribution of constructors at the minimizes the output of the provided cost function. This pro- optimized generator. As shown in Section7, comes cess is repeated until a local minimum is found, when the are with a minimal set of useful cost functions. Such functions no further neighbors that remains unvisited; or if the step are built around the Chi-Square Goodness of Fit Test [4], a sta- improvement is below a minimum predetermined ε. tistical test used quantify how the observed value of a given In our setting, neighbors are obtained by taking the current phenomena is significantly different from its expected value: probability distribution, and constructing a list of paired prob- Õ (observed − expected )2 ability distributions, where each one is constructed from the χ 2 = i i current distribution, adjusting each constructor probability ∈ expectedi Ci Γ by ±∆. This behavior is shown in Figure 18. Note the need of Where Γ is a subset of the constructors involved in the gener- bound checking and normalization of the new neighbors in ation process; observedi corresponds to the predicted number order to enforce a probability distribution (max 0 and norm). of generated Ci constructors; and expectedi corresponds to Each pair of neighbors is then joined together and returned as the amount of constructors Ci desired in the distribution of the the current probability distribution immediate neighborhood. Branching Processes for QuickCheck Generators Haskell ’18, September 27–28, 2018, St. Louis, MO,USA

that we use to convert randomly generated Haskell values into optimize :: CostFunction → Size → ProbMap → ProbMap actual test input files. Table2 illustrates the complexity of the optimize cost size init = localSearch init [] where bridging libraries used in our case studies. localSearch focus visited | null new = focus Table 2. Type information for ADTs used in the case studies. | gain ⩽ ε = focus | otherwise = localSearch best frontier Number of Number of Mutually Case Composite involved involved recursive where Study types best = minimumBy (comparing (cost size)) new types constructors types new = neighbors focus \\ (focus : visited) Lisp 7 14 Yes Yes frontier = new ++ visited Bash 31 136 Yes Yes gain = cost size focus − cost size best Gif 16 30 Yes No

Testing runtimes As we have shown, MegaDeTH tends to Figure 17. Optimization algorithm. derive generators which produce very small test cases. How- ever, in our tests, the size differences in the test cases generated by each tool does not produce remarkable differences in the neighbors :: ProbMap → [ProbMap] runtimes required to test each corpora. Figure 19 shows the neighbors probs = concatMap perturb (Map.keys probs) execution time required to test each case of the biggest corpora where perturb con = [norm (adj (+∆) con) previously generated by each tool consisting of 1000 test cases. , norm (adj (max 0 ◦ (−∆)) con)] norm m = fmap (/sum (Map.elems m)) m adj f con = Map.adjust f con probs

Figure 18. Immediate neighbors of a probability distribution.

B.5 Case studies As explained in Section8, our test cases targeted three complex programs to evaluate the power of our derivation tool, i.e. GNU CLISP 2.49, GNU bash 4.4 and GIFLIB 5.1. We derived random generators for each test case input format using some existent Haskell libraries. Each one of these libraries contains data Figure 19. Execution time required to test the biggest ran- types definition encoding the structure of the input format of domly generated corpora consisting of 1000 files. its corresponding test case, as well as serialization functions