<<

arXiv:1906.11199v1 [cs.PL] 20 Jun 2019 epooedsg udlnsfrapoaiitcprogram- probabilistic a for guidelines design propose We eeec Format: Reference ACM hps://doi.org/10.1145/nnnnnnn.nnnnnnn 978-x-xxxx-xxxx-x/YY/MM...$15.00 ISBN ACM Greece 2019. Athens, 2019, 20-25, October 2019, Onward! in[,1] n others. and 10], optimiza- [7, 55, gradient-based tion 38, 60], [23, [33, inference Carlo variational Monte 62], propaga- Sequential of expectation extensions [10], [29], tion Carlo Monte Hamiltonian 63], 59, Metropolis-Hastings[26, black-bo include a algorithms in The programs manner. probabilistic of class appli- wide usually pro- a to is cable probabilistic algorithm each the In- and by framework, variables. gramming provided the are of algorithms estimates pos- ference point the or obtain distribution to can programs terior Inference probabilistic on variables. performed random be of conditioning defini- the and general for tion otherwise syntax an provides that in language written programming programs as statis models represents tical 62] 26, 14, [13, programming Probabilistic Introduction 1 pages. 16 USA, NY, 2019). (Onward! 2019 Onward! SPLASH of ceedings Programming Probabilistic Deployable 2019. Tolpin. David frameworks. programming abilistic prob- inference-centric case dedicated to on Infergo comparing cases marks, bench- several use on performance various Infergo’s evaluate to and studies, Infergo illus- We of Infergo. applicability of trate development in employed techniques and choices implementation pro share we other languages, to gramming capabilities addi programming facilitate probabilistic To of data. tion the on based inference probabilistic through making decision algorithmic and parameters program languages. general- programming modern purpose most to added be can facility probabilistic programming similar a server-side that argue for We choice development. software of language programming modern Go a for facility programming probabilistic a in- Infergo, we produc- troduce implementation, a reference of a As part system. a software as tion deployment for suitable facility ming Abstract rbblsi rgamn nbe uoai uigof tuning automatic enables programming Probabilistic elybePoaiitcProgramming Probabilistic Deployable hps://doi.org/10.1145/nnnnnnn.nnnnnnn C,NwYork, New ACM, [email protected] In . ai Tolpin David Pro- x - - - , PUB+ Israel iiei 5] 8poaiitcpormiglnugswr pre- were languages programming probabilistic 18 [58]. Wikipedia hc o nteWkpdalist. Wikipedia the on not are which iw ttsia oesaeitgae notesfwr sys software the into integrated are models statistical view, 1 rgamn oesaalbefrpouto use. production for probabilistic available make models to approaches programming and ideas new for there need that a suggest is hand, other the prob- on of programming use abilistic production about stories success of scarcity and rmiglnugst aelanbeprmtr.I this In parameters. learnable toolbox have to programming pro- languages general-purpose regular gramming in a implemented of algorithms extension allowing an is ming system. software production a of algorithms into occasional integrated and visualized, post-processed, are anal- results obtained, ysis is model statistical robust algorithms. sufficiently inference a Once of testing and design model - of ploratory acquisi- iterations several data by followed of pre-processing, and consists science per- tion workflow data view typical for this A tool of a practitioners. as Proponents programming 59]. sta- probabilistic of 51, ceive analysis [11, and models specification tistical is for programming framework probabilistic flexible that a is view One gramming. etddrn h eeoe etpa RBRG21,teinaugur programming, the probabilistic PROBPROG’2018, at on meetup conference developer the during sented i rgamn agae n frameworks and probabilis pro- languages of the programming or proliferation tic exploratory the the However, either scenario. in duction used be cer- to a extent, Most to tain inference. suited, statistical are frameworks ubiquitous programming for probabilistic tool flexible programming a probabilistic of as adoption wider systems, a production promoting in de- programs to probabilistic path of a pave ployment to is objective Our programming. abilistic devices. robotic autonomous of planning or motion monitoring, online health ongoing networks, user social about in inference for behavior example, used for are 61]; algorithm [15, par- the making in decision of process, variables physical latent or when mental ticular a of model generative a is algorithm the algo- whenever arise the applications of Natural part rithms. integral an unat- are results runs inference and code tended, making programming Probabilistic decision production. algorithmic in and data during prediction, place take processing, optimization and inference and tem, hr r 5poaiitcpormiglnugslse on listed languages programming probabilistic 45 are There h te,eegn,ve sta rbblsi program- probabilistic that is view emerging, other, The pro- probabilistic of role the on views polar two are There nti okw r ocre ihteltrve nprob- on view later the with concerned are we work this In hp://probprog.cc/ 1 noehand, one on afof half , ly al - - Onward! 2019, October 20-25, 2019, Athens, Greece David Tolpin

Our attitude differs from the established approach in that different semantics [16, 46] than a ‘regular’ program of sim- instead of proposing yet another probabilistic programming ilar structure. Goodman and Stuhlmüller [14] offer a sys- language, either genuinely different [10, 28] or derived from tematic approach to design and implementation of proba- an existing general-purpose programming language by ex- bilistic programming languages based on extension of the tending the syntax or changing the semantics [11, 13, 51], syntax of a (subset of) general-purpose programming lan- we advocate enabling probabilistic inference on models guage and transformational compilation to continuation- written in a general-purpose probabilistic programming lan- passing style [2, 3]. van de Meent et al. [54] give a com- guage, the same one as the language used to program the prehensive overview of available choices of design and im- bulk of the system. We formulate guidelines for such imple- plementation of first-order and higher-order probabilistic mentation, and, based on the guidelines, introduce a prob- programming languages. [10] is built around an im- abilistic programming facility for the Go programming lan- perative probabilistic programming language with program guage [50]. By explaining our implementation choices and structure tuned for most efficient execution of inference on techniques, we argue that a similar facility can be added the model. SlicStan [17] is built on top of Stan as a source- to any modern general-purpose purpose programming lan- to-source compiler, providing the model developer with a guage, giving a production system modeling and inference more intuitive language while retaining performance ben- capabilities which do not fall short from and sometimes efits of Stan. Our work differs from existing approaches in exceed those of probabilistic programming-centric frame- that we advocate enabling probabilistic programming in a works with custom languages and runtimes. Availability of general-purpose programming language by leveraging ex- such facilities in major general-purpose programming lan- isting capabilities of the language, rather than building a guages would free probabilistic programming researchers new language on top or besides an existing one. and practitioners alike from language wars [48] and foster The need for integration between probabilistic programs practical applications of probabilistic programming. and the surrounding software environment is well under- stood in the literature [7, 51]. Probabilistic programming Contributions languages are often implemented as embedded domain- This work brings the following major contributions: specific languages [11, 42, 51] and include a mechanism of calling functions from libraries of the host languages. Our • Guidelines for design and implementation of a proba- work goes a step further in this direction: since the language bilistic programming facility deployable as a part of a of implementation of probabilistic models coincides with production software system. the host language, any library or user-defined function of • A reference implementation of deployable probabilis- the host language can be directly called from the probabilis- tic programming facility for the Go programming lan- tic model, and model methods can be directly called from guage. the host language. • Implementation highlights outlining important Automatic differentiation is widely employed in machine choices we made in implementation of the proba- learning [5, 56] where it is also known as ‘differentiable bilistic programming facility, and providing a basis programming’, and is responsible for enabling efficient in- for implementation of a similar facility for other ference in many probabilistic programming frameworks [7, general-purpose programming languages. 10, 11, 21, 52]. Different automatic differentiation tech- 2 Related Work niques [18] allow different compromises between flexibil- ity, efficiency, and feature-richness [21, 34, 44]. Automatic Our work is related to three interconnected areas of re- differentiation is usually implemented through either op- search: erator overloading [10, 11, 34] or source code transforma- 1. Design and implementation of probabilistic program- tion [21, 57]. Our work, too, relies on automatic differentia- ming languages. tion, implemented as source code transformation. However, 2. Integration and interoperability between probabilistic a novelty of our approach is that instead of using explicit programming models and the surrounding software calls or directives to denote parts of code which need to be systems. differentiated, we rely on the type system of Go to selec- 3. Differentiable programming for machine learning. tively differentiate the code relevant for inference, thus com- bining advantages of both operator overloading and source Probabilistic programming is traditionally associated code transformation. with design and implementation of probabilistic program- ming languages. An apparent justification for a new pro- gramming language is that a probabilistic program has a DeployableProbabilisticProgramming Onward!2019,October20-25, 2019, Athens, Greece

3 Challenges A flexible data interface is also beneficial for support Incorporating a probabilistic program, or rather a proba- of mini-batch optimization [24]. Mini-batch optimization is bilistic procedure, within a larger code body appears to be used when the data set is too big to fit in memory, and the ob- rather straightforward: one implements the model in the jective function gradient is estimated based on mini-batches probabilistic programming language, fetches and prepro- — small portions of data. Different models and inference ob- cesses the data in the host programming language, passes jectives may require different mini-batch loading schemes. the data and the model to an inference algorithm, and post- For best performance of mini-batch optimization, it should processes the results in the host programming language be possible to program loading of mini-batches on case-by- again to make algorithmic decisions based on inference out- case basis. comes. To choose the best option for each particular applica- tion, probabilistic programming languages and frameworks Integration and deployment Deployment of software are customarily compared by the expressive power (classes systems is a delicate process involving automatic builds of probabilistic models that can be represented), concise- and maintenance of dependencies. Adding a component, ness (e.g. the number of lines of code to describe a partic- which possibly introduces additional software dependen- ular model), and time and space efficiency of inference al- cies or even a separate runtime, complicates deployment. gorithms [11, 35, 37, 62]. However, complex software sys- Minimizing the burden of probabilistic programming on the tems make integration of probabilistic inference challeng- integration and deployment process should be a major con- ing, and considerations beyond expressiveness and perfor- sideration in design or selection of probabilistic program- mance come into play. ming tools. Probabilistic programming frameworks that are implemented or provide an interface in a popular program- Simulation vs. inference Probabilistic models often fol- ming language, e.g. Python (Edward [52], Pyro [7]) are eas- low the design pattern of simulation-inference: a significant ier to integrate and deploy, however the smaller the foot- part of the model is a simulator running an algorithm with print of a probabilistic programming framework, the easier given parameters; the optimal parameters, or their distribu- is the adoption. An obstacle in integration of probabilistic tion, are to be inferred. The inferred parameters are then programming lies also in adoption of the probabilistic pro- used by the software system to execute the simulation inde- gramming language [27], if the latter differs from the imple- pendently of inference for prediction and decision making. mentation language of the rest of the system. This pattern suggests re-use of the simulator: instead of implementing the simulator twice, in the probabilistic 4 Guidelines model and in the host environment, one can invoke the Based on the experience of developing and deploying so- same code for both purposes. However to achieve this, the lutions using different probabilistic programming environ- host language must coincide with the implementation lan- ments, we came up with guidelines to implementation of guage of the probabilistic model, on one hand, and allow a a probabilistic programming facility for server-side applica- computationally efficient implementation of the simulation, tions. We believe that these guidelines, when followed, help on the other hand. Some probabilistic programming frame- easier integration of probabilistic programming inference works (Figaro [36], Anglican [51], Turing [11]) are built with into large-scale software systems. tight integration with the host environment in mind; more often than not though the probabilistic code is not trivial to 1. A probabilistic model should be programmed in the re-use. host programming language. The facility may im- pose a discipline on model implementation, such as Data interface The data for inference may come from a through interface constraints, but otherwise support- variety of sources: network, databases, distributed file sys- ing unrestricted use of the host language for imple- tems, and in many different formats. Efficient inference de- mentation of the model. pends on fast data access and updating. Libraries for data 2. Built-in and user-defined data structures and libraries access and manipulation are available in the host environ- should be accessible in the probabilistic programming ment. While the host environment can be used as a proxy model. Inference techniques relying on the code struc- retrieving and transforming the data, such as in the case ture, such as those based on automatic differentiation, of Stan [10] integrations, sometimes direct access from the should support the use of common data structures of probabilistic code is the preferred option, for example when the host language. the data is streamed or retrieved conditionally, as in active 3. The model code should be reusable between inference learning [49]. and simulation. The code which is not required solely Onward! 2019, October 20-25, 2019, Athens, Greece David Tolpin

for inference should be written once for both infer- World!" program — inferring parameters of a unidimen- ence of parameters and use of the parameters in the sional Normal distribution. In this example, the model ob- host environment. It should be possible to run simula- ject holds the set of observations from which the distribu- tion outside the probabilistic model without runtime tion parameters are to be inferred: or memory overhead imposed by inference needs. type ExampleModel struct { Data []float64 5 Probabilistic Programming in Go }

In line with the guidelines, we have implemented a proba- The Observe method computes the log-likelihood of the bilistic programming facility for the Go programming lan- distribution parameters. The Normal distribution has two guage, Infergo. We have chosen Go because Go is a small parameters: mean µ and standard deviation σ. To ease the in- but expressive programming language with efficient im- ference, positive σ is transformed to unrestricted log σ, and plementation that has recently become quite popular for the parameter vector x is [µ, logσ]. Observe first imposes a computation-intensive server-side programming. Infergo is prior on the parameters, and then conditions the posterior used in production environment for inference of mission- distribution on the observations: critical algorithm parameters. // x[0] is the mean, // x[1] is the log stddev of the distribution 5.1 An Overview of Infergo func (m *ExampleModel) Infergo comprises a command-line tool that augments the Observe(x []float64) float64 { source code of a model to facilitate inference, a collection of // Our prior is a unit normal ... ll := Normal.Logps(0, 1, x...) basic models (distributions) serving as building blocks for // ... but the posterior is based other models, and a library of inference algorithms. // on data observations. A probabilistic model in Infergo is an implementation ll += Normal.Logps(x[0], math.Exp(x[1]), of an interface requiring a single method Observe2 ac- m.Data...) cepts a vector (a Go slice) of floats, the parameters to infer, return ll } and returns a single float, interpreted as unnormalized log- likelihood of the posterior distribution. The model object For inference, we supply the data and optionally initial- usually encapsulates the observations, either as a data field ize the parameter vector. Here the data is hard-coded, but in or as a reference to a data source. Implementation of model general can be read from a file, a database, a network con- methods can be written in virtually unrestricted Go and use nection, or dynamically generated by another part of the any Go libraries. program: For inference, Infergo relies on automatic differentiation. // Data The source code of the model is translated by the command- m := &ExampleModel{[]float64{ line tool into an equivalent model with reverse-mode auto- -0.854, 1.067, -1.220, 0.818, -0.749, matic differentiation of the log-likelihood with respect to 0.805, 1.443, 1.069, 1.426, 0.308}} the parameters applied. The differentiation operates on the built-in floating-point type and incurs only a small compu- // Parameters x := []float64{} tational overhead. However, even this overhead is avoided when the model code is executed outside inference algo- The fastest and most straightforward inference is the rithms: both the original and the differentiated model are maximum a posteriori (MAP) estimation, which can be done simultaneously available to the rest of the program code, so using a gradient descent algorithm (Adam [22] in this case): the methods can be called on the differentiated model for opt := &infer.Adam{ inference, and on the original model for the most efficient Rate: 0.01, execution with the inferred parameters. } for iter := 0; iter != 1000; iter++ { 5.2 A Basic Example opt.Step(m, x) } Let us illustrate the use of Infergo on a basic example serv- ing as a probabilistic programming equivalent of the "Hello, To recover the full posterior distribution of the parame- ters we use Hamiltonian Monte Carlo [30]: 2The method name follows Venture [26], Anglican [51], and other proba- hmc := &infer.HMC{ bilistic programming languages in which observe is the language construct Eps: 0.1, for conditioning of the model on observations. } DeployableProbabilisticProgramming Onward!2019,October20-25, 2019, Athens, Greece samples := make(chan []float64) derivative is not registered for an elemental, calling the ele- hmc.Sample(m, x, samples) mental in a differentiated context will cause a run-time er- for i := 0; i != 5000; i++ { ror. x = <-samples } Functions are considered elementals (and must have hmc.Stop() a registered derivative) if their signature is of kind func (float64, float64*) float64; that is, one or more non- Models are of course not restricted to linear code, and variadic float64 arguments and float64 return value. For ex- may comprise multiple methods containing loops, condi- ample, function func (float64, float64, float64) float64 tional statements, function and method calls, and recursion. is considered elemental, while functions Case studies (Section 7) provide more model examples. • func (...float64) float64 • func ([]float64) float64 5.3 Model Syntax and Rules of Differentiation • func (int, float64) float64 InInfergo,theinference is performedonamodel.Just like in are not. Gradients for selected functions from the math pack- Stan [10] models, latent random variables in Infergo models age are pre-defined (Sqrt, Exp, Log, Pow, Sin, Cos, Tan). Aux- are continuous. Additionally, Infergo can express and per- iliary elemental functions with pre-defined gradients are form inference on non-deterministic models by including provided in a separate package. stochastic choices in the model code. Any user-defined function of appropriate kind can be An Infergo model is a data type and methods defined on used as an elemental inside model methods. The gradient the data type. Inference algorithms accept a model object as must be registered for the function using a call to an argument. A model object usually encapsulates observa- tions — the data to which the parameter values are adapted. func RegisterElemental( A model is implemented in Go, virtually unrestricted3. Be- f interface{}, cause of this, the rules of model specification and differen- g func(value float64, params ...float64) []float64) tiation are quite straightforward. Nonetheless, the differen- tiation rules ensure that provided that the model’s Observe For example, the call method (including calls to other model methods) is differen- RegisterElemental(Sigm, tiable, the gradient is properly computed [18]. func(value float64, A model is defined in its own package. The model _ ...float64) []float64 { must implement interface Model containing a single method return []float64{value*(1-value)} Observe. Observe accepts a slice of float64 — the model }) parameters — and returns a float64 scalar. For probabilis- x − registers the gradient for function Sigm(x) ≔ (1 + e− ) 1. tic inference, the returned value is interpreted as the log- likelihood of the model parameters. Inference models rely 5.4 Inference on computing the gradient of the returned value with re- Out of the box, Infergo offers spect to the model parameters through automatic differen- tiation. In the model’s source code: • maximum a posteriori (MAP) estimation by stochastic gradient descent [40]; 1. Methods on the type implementing Model returning a • full posterior inference by Hamiltonian Monte single float64 or nothing are differentiated. Carlo [19, 30]. 2. Within the methods, the following is differentiated: • assignments to float64 (including parallel assign- When only a point estimate of the model parameters is re- ments if all values are of type float64); quired, as, for example, in the simulation-inference case, sto- • returns of float64; chastic gradient descent is a fast and robust inference family. • standalone calls to methods on the type implement- Infergo offers stochastic gradient descent with momentum ing Model (apparently called for side effects on the and Adam [22]. The inference is performed by calling the model). Step method of the optimizer until a termination condition is met: Functions for which the gradient is provided rather than derived through automatic differentiation are called elemen- for ! { tals [18]. Derivatives do not propagate through a function optimizer.Step(model, parameters) } that is not an elemental or a call to a model method. If a

3There are insignificant implementation limitations which are gradually When the full posterior must be recovered for integra- lifted as Infergo is being developed. tion, Hamiltonian Monte Carlo (HMC) can be used instead. Onward! 2019, October 20-25, 2019, Athens, Greece David Tolpin

Vanilla HMC as well as the No-U-Turn sampler (NUTS) [19] decisions which we believe to have been crucial for address- are provided. To support lazy any-time computation, the in- ing the challenges and following the guidelines we formu- ference runs in a separate goroutine, connected by a channel lated for implementation of a deployable probabilistic pro- to the caller, and asynchronously writes samples to the chan- gramming facility. While different choices may be as well nel. The caller reads as many samples from the channel as suited for the purpose as the ones we made, we feel it is im- needed, and further postprocesses them: portant to share our contemplations and justify decisions, to which the rest of this section is dedicated, from choosing samples := make(chan []float64) the implementation language, through details of automatic hmc.Sample(model, parameters, samples) differentiation, to support for inference on streaming and for ! { sample = <-samples stochastic observations. // postprocess the sample } 6.1 Choice of Programming Language hmc.Stop() Python [39], Scala [31], and Julia [6] have been popular choices for development of probabilistic programming lan- Other inference schemes, most notably automatic differ- guages and frameworks integrated with general-purpose entiation variational inference [23], are planned for addition programming languages [7, 9, 11, 21, 36, 41, 57]. Each of in future versions of Infergo. However, an inference algo- these languages has advantages for implementing a prob- rithm does not have to be a part of Infergo to work with abilistic programming facility. However, for a reference im- Infergo models. Since Infergo models operate on a built-in plementation of a facility best suited for probabilistic infer- Go floating point type, float64, any third-party inference or ence in a production software system, we were looking for optimization library can be used as well. a language that As an example, the Gonum [4] library for numerical com- putations contains an optimization package offering several • compiles into fast and compact code; algorithms for gradient-based optimization. A Gonum op- • does not incur dependency on a heavy runtime; and timization algorithm requires the function to optimize, as • is a popular choice for building server-side software. well as the function’s gradient, as parameters. An Infergo In addition, since we were going to automatically differenti- model can be trivially wrapped for use with Gonum, and ate the language’s source code, we were looking for a rela- Infergo provides a convenience function tively small language, preferably one for which parsing and func FuncGrad(m Model) ( program analysis tools are readily available. Func func(x []float64) float64, We chose Go [50]. Go is a small but expressive Grad func(grad []float64, x []float64)) programming language widely used for implementing computationally-intensive server-side software systems. which accepts a model and returns the model’s Observe The standard libraries and development environment offer method and its gradient suitable for passing to Gonum op- capabilities which made implementation of Infergo afford- timization algorithms. Among available algorithms are ad- able and streamlined: vanced versions of stochastic gradient descent, as well as • The Go parser and abstract syntax tree serializer are algorithms of BFGS family, in particular L-BFGS [25], pro- a part of the standard library. Parsing, transforming, viding for a more efficient MAP estimation than stochastic and generating Go source code is straightforward and gradient descent at the cost of higher memory consumption: effortless. function, gradient := FuncGrad(model) • Type inference (or type checking as it is called in the problem := gonum.Problem{Func: function, Go ecosystem), also provided in the standard library, Grad: gradient} augments parsing, and allows to selectively apply result, err := gonum.Minimize(problem, parameters) transformation-based automatic differentiation based on static expression types. Other third-party implementations of optimization algo- • Go provides reflection capabilities for almost any as- rithms can be used just as easily, thanks to Infergo models pect of the running program, without compromising being pure Go code operating on built-in Go data types. efficiency. Reflection gives the flexibility highly desir- able to implement features such as calling Go library 6 Implementation Highlights functions from the model code. In our journey towards implementation and deployment of • Go compiles and runs fast. Fast compilation and exe- Infergo, we considered options and made implementation cution speeds allow to use the same facility for both DeployableProbabilisticProgramming Onward!2019,October20-25, 2019, Athens, Greece

exploratory design of probabilistic models and for in- This method involves generation of new code, for the func- ference in production environment. tion gradient or for the function itself, ahead of time or on • Go is available on many platforms and for a wide the fly. To achieve the best performance, some implementa- range of operating systems and environments, from tions generate static code for computing the function gradi- portable devices to supercomputers. Go programs ent [20, 57]. This requires a rather elaborated source code can be cross-compiled and reliably developed and de- analysis, and although a significant progress has been made ployed in heterogeneous environments. in this direction, not all language constructs are supported. • Go offers efficient parallel execution as a first-class Alternatively, the code of the function itself can be trans- feature, via so-called goroutines. Goroutines stream- formed by augmenting arithmetics and function calls on the line implementation of sampling-based inference al- floating point type by function calls that record the compu- gorithms. Sample generators and consumers are run tations on the tape, just like in the case of operator overload- in parallel, communicating through channels. Infer- ing. Then, the backward pass is identical to that of operator ence is easy to parallelize in order to exploit hardware overloading. This method allows the use of the built-in float- multi-processing, and samples are retrieved lazily for ing point type along with virtually unrestricted language for postprocessing. model specification, at the cost of slight performance loss. This latter method is used in Infergo: the model is written Considering and choosing Go for development of what in Go and is placed in a separate Go package. A command- later became Infergo also affected the shape of the soft- line tool is run on the package to generate a sub-package ware system in which Infergo was deployed first. Formerly containing a differentiated version of the model. Both the implemented mostly in Python, and both enjoying a wide original and the differentiated model are available simulta- choice of libraries for machine learning and data processing neously in the calling code, so that model methods that do and suffering from limitations of Python as a language for not require computation of gradients can be invoked with- implementation of computation-intensive algorithms, the out the overhead of automatic differentiation. However, the system has been gradually drifting towards adopting Go, performance loss is rarely a major issue — differentiated ver- with performance-critical parts re-implemented and run- sions of model methods are only 2–4 times slower on mod- ning faster due to better hardware and network utilization, els from the case studies (Section 7), and the backward pass, and simpler and more maintainable code for data process- which computes the gradient, is roughly as fast as the orig- ing, analysis, and decision making. inal, undifferentiated version. At least partially, fast back- ward pass is due to careful implementation and thorough 6.2 Automatic Differentiation profiling of the tape. Infergo employs reverse-mode automatic differentiation via source code transformation [18]. The reverse mode is the default choice in machine learning applications, because it Automatic differentiation of models It may seem de- works best with differentiating a scalar return value with sirable to implement or use an automatic differentiation respect to multiple parameters. Automatic differentiation is library that is more general than needed for probabilistic usually implemented using either operator overloading or programming. However the use of such library, in particu- source code transformation. Operator overloading is often lar with source code transformation, may make implemen- the method of choice [10, 11, 34] due to relative simplicity, tation restricted and less efficient: first, if a model com- but the differentiated code must use a special type (on which prises several functions, all functions must have a signa- the operators are overloaded) for representing floating point ture suitable for differentiation (that is, accept and return numbers. Regardless of the issue of computational overhead, floating point scalars or vectors); second, the gradient of this restricts the language used to specify the model to op- every nested call must be fully computed first and then erations on that special type and impairs interoperability: used to compute the gradient of the caller. In Infergo, only library functions cannot be directly called but have to use the gradient of Observe is needed, and any other differenti- an adaptor, or a ‘lifted’ version of the standard library must ated model method can only be called from Observe or an- be supplied. Similarly, the data must be cast into the special other differentiated method. Consequently, only the gradi- type before it is passed to the model for inference. ent of Observe is computed explicitly, and other differenti- ated methods may have arbitrary arguments. In particular, Reverse mode source code transformation Source code derivatives may propagate through fields of the model ob- transformation allows to automatically differentiate pro- ject and structured arguments. gram code operating on the built-in floating point type. Onward! 2019, October 20-25, 2019, Athens, Greece David Tolpin

1 type A struct {Data []float64} 1 // Exponential distribution. 2 func (model A) Observe(x []float64) { 2 type expon struct{} 3 ... 3 4 } 4 // Exponential distribution, singleton instance. 5 5 var Expon expon 6 type struct {Data []float64} 6 7 func (model B) Observe(x []float64) { 7 // Observe implements the Model interface. The 8 ... 8 // parameter vector is lambda, observations. 9 } 9 func (dist expon) Observe(x []float64) 10 10 float64 { 11 type AB struct {Data []float64} 11 lambda, y := x[0], x[1:] 12 func (model AB) Observe(x []float64) float64 { 12 if len(y) == 1 { 13 return A{model.Data}.Observe(x[:1]) + 13 return dist.Logp(lambda, y[0]) 14 B{model.Data}.Observe(x[1:]) 14 } else { 15 } 15 return dist.Logps(lambda, y...) 16 } 17 } Figure 1. Model composition. Model AB is the product of A 18 and B. 19 // Logp computes the log pdf of a single 20 // observation. 21 func (_ expon) Logp(lambda float64, y float64) 6.3 Model Composition 22 float64 { 23 logl := math.Log(lambda) Just like it is possible to compose library and user-defined 24 return logl - lambda*y functions into more complex functions, it should be possi- 25 } ble to compose models to build other models [43, 52]. In- 26 fergo supports model composition organically: a differenti- 27 // Logps computes the log pdf of a vector 28 // of observations. ated model method can be called from another differentiated 29 func (_ expon) Logps(lambda float64, y ...float64) method, not necessarily a method of the same model object 30 float64 { or type. Derivatives can propagate through methods of dif- 31 ll := 0. ferent models without restriction. Each model is a building 32 logl := math.Log(lambda) block for another model. For example, if there are two in- 33 for i := range y { 34 ll += logl - lambda*y[i] dependent single-parameter models A and B describing the 35 } data, then the product model AB can be obtained by compo- 36 return ll sition of A and B (Figure 1). Combinators such as product, 37 } mixture, or hierarchy can be realized as functions operating on Infergo models. Figure 2. The exponential distribution in Infergo. Distributions A probabilistic programming facility should provide a library of distributions for definition and conditioning of random variables. To use a distribution is just a model package which is automatically differenti- in a differentiable model, the distribution’s log probabil- ated like any other model. New distributions can be added ity density function must be differentiable by both the in any package. Figure 2 shows the Infergo definition of the distribution’s parameters and by the value of the random exponential distribution. variable. Distributions are usually supplied as a part of the probabilistic programming framework [10, 11, 51], and, if at all possible, adding a user-defined distribution requires 6.4 Stochasticity and Streaming programming at a lower level of abstraction than while Perhaps somewhat surprisingly, most probabilistic pro- specifying a probabilistic model [1, 41, 51]. grams are semantically deterministic [47] — probabilistic In Infergo, distributions are just models, and adding and programs define distributions, and although they manipu- using a custom distribution is indistinguishable from adding late random variables, there is no randomization or non- amodeland callingthe model’smethodfrom anothermodel. determinism per se. Randomization arises in certain infer- By convention, library distributions, in addition to Observe, ence algorithms (notably of the Monte Carlo family) for also have methods Logp and Logps for the scalar and vec- tractable approximate inference, but this does not intro- tor version of log probability density functions, for conve- duce non-determinism into the programs themselves. Non- nience of use in models. The distribution library in Infergo deterministic probabilistic programs arise in applications. DeployableProbabilisticProgramming Onward!2019,October20-25, 2019, Athens, Greece

1 type StreamingModel struct { By no means do the models presented in this section re- 2 Data chan float64 // data is a channel flect the complexity or the size of real-world Infergo applica- 3 N int // batch size tions, the latter reaching hundreds of lines of code in length 4 } 5 and performing inference on tens of thousands of data en- 6 func (m *StreamingModel) tries. Rather, their purpose is to provide a general picture of 7 Observe(x []float64) float64 { probabilistic programming with Infergo. 8 ll := Normal.Logps(0, 1, x...) 9 // observe a batch of data from the channel 7.1 Eight Schools 10 for i := 0; i != m.N; i++ { 11 ll += Normal.Logp(x[0], math.Exp(x[1]), The eight schools problem [12] is an application of the hi- 12 <- m.Data) erarchical normal model that often serves as an introduc- 13 } tory example of Bayesian statistical modeling. Without go- 14 return ll ing into details, in this problem the effect of coaching pro- 15 } grams on test outcomes is compared among eight schools. Figure 4 shows specifications of the same model in Stan and Figure 3. The basic example with the data read from a chan- Infergo, side-by-side. nel. Stan models are written in a probabilistic programming language tailored to statistical modeling. Infergo models are implemented in Go. One can see that the models have For example, van de Meent et al. [53] explored using proba- roughly the same length and structure. The definition of bilistic programming for policy search in non-deterministic the model type in Go (lines 1–5) corresponds to the data domains. To represent non-determinism, appropriate con- section in Stan (lines 1–5). Destructuring of the parame- structs had to be added to the probabilistic programming ters of Observe in Go (lines 9–11) names the parameters language used in the study. just like the parameters section in Stan (lines 8–10). In- In Infergo, non-determinism can be easily introduced troduction of theta in Go (line 16) corresponds to the through data. In the basic example if Section 5.2, the data transformed parameters section in Stan (lines 13–16). The is encapsulated in the model as a vector, populated before only essential difference is the explicit loop over the data inference. However, this is not at all required by Infergo. In- in Go (lines 15–18) versus vectorized computations in Stan stead of being stored in a Go slice, the data can be read from (lines 15 and 20). However, vectorized computations are a Go channel or network connection and generated by an- used in Stan to speed-up execution more than to improve other goroutine or process. Figure 3 shows a modified model readability. If repeated calls to Normal.Logp (line 17)become from the basic example with the data gradually read from a the bottleneck, an optimized vectorized version can be im- channelrather than passed as a wholein a field ofthe model plemented in Go in the same package. While subjective, our object. feeling is that the Go version is as readable as the Stan one. A similar approach can be applied to streaming data: the In addition, a Go programmer is less likely to resist adoption model may read the data gradually as it arrives, and the in- of statistical modeling if the models are implemented in Go. ference algorithm will emit samples from the time process inferred from the data. Work is still underway on algorithms 7.2 Linear Regression ensuring robust and sound inference on streams, but Infergo The model in Figure 5 specifies linear regression on unidi- models connected to data sources via channels are already mensional data: an improper uniform prior is placed on the used in practice. parameters α, β, and log σ (lines 9–10), which are then con- ditioned on observations x,y such that 7 Case Studies y ∼ Normal(α + βx, σ) (1) In this section, we present three case studies of probabilis- (lines 12–17). If only the maximum a posteriori is estimated, tic programs, each addressing a different aspect of Infergo. the parameter values will maximize their likelihood given Eight schools (Section 7.1) is a popular example we use to the observations: compare model specification in Stan and Infergo. Linear re- = + gression (Session 7.2) formulated as a probabilistic program (α, β, σ)MAP argmax Ö pdfNormal(α βxi , σ) (2) lets us explore a model comprising multiple methods and il- i lustrate the simulation-inference pattern. Gaussian mixture Once the values of α and β are estimated, they are used to (Section 7.3) is the most complex model of the three, featur- predict y for unseen values of x. Hence, the linear transfor- ing nested loops and conditionals. mation α + βx is computed in two different contexts: first, Onward! 2019, October 20-25, 2019, Athens, Greece David Tolpin

1 data { 1 type SchoolsModel struct { 2 int J; 2 J int 3 vector[J] y; 3 Y []float64 4 vector[J] sigma; 4 Sigma []float64 5 } 5 } 6 6 7 parameters { 7 8 real mu; 8 func (m *SchoolsModel) Observe(x []float64) float64 { 9 real tau; 9 mu := x[0] 10 vector[J] eta; 10 tau := math.Exp(x[1]) 11 } 11 eta := x[2:] 12 12 13 transformed parameters { 13 ll := Normal.Logp(0, 1, eta) 14 vector[J] theta; 14 15 theta = mu + tau * eta; 15 for i, y := range m.Y { 16 } 16 theta := mu + tau*eta[i] 17 17 ll += Normal.Logp(theta, m.Sigma[i], y) 18 model { 18 } 19 eta ~ normal(0, 1); 19 20 y ~ normal(theta, sigma); 20 return ll 21 } 21 } a. Stan b. Infergo

Figure 4. Eight schools: Stan vs. Infergo. The Go implementation has a similar length and structure to the Stan model.

1 type LRModel struct { to compute the likelihood of the parameter values; then, 2 Data [][]float64 to predict y based on a given x. This is an example of the 3 } simulation-inference pattern (Section 3): method Simulate 4 5 func (m *LRModel) Observe(x []float64) simulates y based on x and is invoked both from Observe 6 float64 { and then directly for prediction. Since both original and dif- 7 ll := 0. ferentiated model are simultaneously available in the source 8 code, Simulate can be called for prediction on the original 9 alpha, beta := x[0], x[1] model, without any differentiation overhead. 10 sigma := math.Exp(x[2]) 11 In this greatly simplified case study, the simulator is a 12 for i := range m.Data { one-line function. In a real-world scenario though, the sim- 13 ll += Normal.Logp( ulator can be laborious to re-implement correctly and effi- 14 m.Simulate(m.Data[i][0], alpha, beta), ciently; for example, a simulator that executes a parameter- 15 sigma, ized policy in an uncertain or dynamic domain, and depends 16 m.Data[i][1]) 17 } on both parameter values and observations [53]. The ability 18 return ll to re-use a single implementation both as a part of the prob- 19 } abilistic program, where the simulator must be augmented 20 for inference (e.g. by automatic differentiation), and for pre- 21 // Simulate simulates y based on x and diction or decision making, where the simulator may be ex- 22 // regression parameters. 23 func (m *LRModel) Simulate(x, alpha, beta float64) ecuted repeatedly and under time pressure, saves the effort 24 float64 { otherwise wasted on rewriting the simulator and removes 25 y := alpha + beta*x the need to maintain consistency between different imple- 26 return y mentations. 27 }

7.3 Gaussian Mixture Model Figure 5. Linear regression: the simulator is re-used in both inference and prediction. The Gaussian mixture model is a probabilistic model of a mixture of components, where each component comes from a Gaussian distribution. The model variant in Figure 6 ac- cepts a vector of single-dimensional observations and the DeployableProbabilisticProgramming Onward!2019,October20-25, 2019, Athens, Greece number of components, and infers the distribution of pa- 1 type GMModel struct { rameters of each component in the mixture. Following Stan 2 Data []float64 // observations User’s Guide [45], component memberships are summed 3 NComp int // number of components 4 } out to make the model differentiable. The model contains 5 most of the steps commonly found in Infergo models. First, a 6 func (m *GMModel) Observe(x []float64) float64 { prior on the model parameters is imposed (lines 8–9). Then, 7 ll := 0.0 the parameter vector is destructured into a form convenient 8 // Impose a prior on component parameters for formulation of conditioning (lines 11–17). Finally, the pa- 9 ll += Normal.Logps(0, 1, x...) 10 rameters are conditioned on the observations (lines 19–34). 11 // Fetch the parameters The conditioning of parameters involves a nested loop 12 mu := make([]float64, m.NComp) over all observations and over all mixture components for 13 sigma := make([]float64, m.NComp) each observation. Computing log-likelihood of the model 14 for j := 0; j != m.NComp; j++ { parameters implies summing up likelihoods over compo- 15 mu[j] = x[2*j] 16 sigma[j] = math.Exp(x[2*j+1]) nents for each observation. In the log-domain, a trick called 17 } ‘log-sum-exp’ is required to avoid loss of precision. The trick 18 is implemented as function logSumExp and called for each 19 // Compute log likelihood of the mixture observation and component (line 28). 20 for i := 0; i != len(m.Data); i++ { 21 var l float64 logSumExp could have been implemented as a model 22 for j := 0; j != m.NComp; j++ { method and differentiated automatically. However, this 23 lj := Normal.Logp(mu[j], sigma[j], function is called on each iteration of the inner loop. Hence, 24 m.Data[i]) efficiency of inference in general is affected by an efficient 25 if j == 0 { implementation of logSumExp (this can also be confirmed 26 l = lj by profiling). To save on the cost of a differentiated call and 27 } else { 28 l = logSumExp(l, lj) speed up gradient computation, logSumExp is implemented 29 } as an elemental (lines 36–43). The hand-coded gradient is 30 } supplied for logSumExp (lines 45–54) to make automatic dif- 31 ll += l ferentiation work. 32 } While still a very simple model compared to real-world In- 33 return ll 34 } fergo models, this case study illustrates common parts and 35 coding patterns of an elaborated model. The model code 36 // logSumExp computes log(exp(x)+exp(y)) robustly. may be quite complicated algorithmically and manipulate 37 func LogSumExp(x, y float64) float64 { Go data structures freely. When efficiency becomes an is- 38 z := x sue, parts of the model can be implemented at a lower level 39 if y>z{ 40 z = y of abstraction, but nonetheless in the same programming 41 } language as the rest of the model. 42 return z + math.Log(math.Exp(x-z)+math.Exp(y-z)) 43 } 8 Performance 44 45 // logSumExp gradient must be supplied. It is customary to include performance evaluation on bench- 46 func init() { mark problems in publications on new probabilistic pro- 47 ad.RegisterElemental(logSumExp, gramming frameworks [11, 32, 51, 62]. We provide here 48 func(_ float64, params ...float64) 49 []float64 { a comparison of running times of Infergo and two other 50 z := math.Exp(params[1] - params[0]) probabilistic programming frameworks: Turing [11] and 51 t := 1 / (1 + z) Stan [10]. Turing is a Julia [6] library and DSL offering 52 return []float64{t, t * z} exploratory statistical modeling and inference composition. 53 }) Stan is a compiled language and a library optimized for high 54 } performance. We use three models of different size and structure for the Figure 6. Gaussian mixture model: an Infergo model with comparison: two of the models, Eight schools and Gaussian dynamic control structure and a user-defined elemental. mixture model, were introduced as a part of the case stud- ies (Section 7); the code for the third model, Latent Dirichlet allocation [8], is included in Appendix A. The code and data Onward! 2019, October 20-25, 2019, Athens, Greece David Tolpin

model time, seconds compilation execution Infergo Turing Stan Infergo Turing Stan Eight schools 0.50 - 50 0.60 2.8 0.12 Gaussian mixture model 0.50 - 54 32 14 4.9 Latent Dirichlet allocation 0.50 - 54 8.9 12 3.7 Table 1. Compilation and execution times for 1000 iterations of HMC with 10 leapfrog steps.

for all models were originally included in Stan examples and times slower than Stan, and faster than Turing, an illustra- translated for use with Turing and Infergo. Eight schools is tion of the benefits of transformation-based differentiation the smallest model. One may expect that most of the time and efficiency of Go. execution time is spent in executing the boilerplate code Infergo programs compile very fast, in particular in com- rather than the model. Gaussian mixture model is a more parison to Stan models. Both automatic differentiation and elaborated model which benefits greatly from vectorization. code generation combined take a fraction of a second. Com- The greater the number of observations (1000 data points pilation time of Stan models is dominated by compilation were used in the evaluation), the more significant is the ben- of the generated C++ code which takes close to a minute efit of vectorized computations. Latent Dirichlet allocation is, on our hardware. The ability to modify a model and then again, a relatively elaborated model, which is, however, not re-run inference almost instantly, and with reasonable per- as easy to vectorize. formance, helps in exploratory model development with In- We report both compilation time (for Stan and Infergo) fergo, which was conceived with server-side, unattended ex- and execution time. Compilation time is an important char- ecution in mind, but fits the exploration and development acteristic of a probabilistic programming framework — sta- stage just as well. tistical modeling often involves many cycles of model mod- ifications and numerical experiments on subsampled data. 9 Conclusion The ability to modify and instantly re-run a model is crucial Building a production software system which employs prob- for exploratory data analysis and model design. Turing is abilistic inference as an integral part of prediction and deci- based on Julia, which is a dynamic language with JIT com- sion making algorithms is challenging. To address the chal- piler, and there is no separate compilation time. For Stan, the lenges, we proposed guidelines for implementation of a de- compilation time includes compiling the Stan model to C++, ployable probabilistic programming facility. Based on the and then compiling the C++ source into executable code guidelines, we introduced Infergo, a probabilistic program- for inference. For Infergo, the compilation time consists of ming facility in Go, as a reference implementation in accor- automatic differentiation of the model, and then building a dance with the guidelines. Infergo allows to program infer- command-line executable that calls an inference algorithm ence models in virtually unrestricted Go, and to perform in- on the model. ference on the model using either supplied or third-party Table 1 shows running time measurements on the mod- algorithms. We demonstrated on case studies how Infergo els. The measurements were obtained on a 1.25GHz Intel(R) overcomes the challenges by following the guidelines. Core(TM)i5 CPU with 8GB of memory for 1000iterations of A probabilistic programming facility similar to Infergo Hamiltonian Monte Carlo with 10 leapfrog steps and aver- can be added to most modern general-purpose program- aged over 10 runs. In execution, Infergo is slower than Stan, ming languages, in particular those used for implemen- which comes as little surprise — Stan is a mature highly opti- tation of large-scale software systems, making probabilis- mized inference and optimization library. However, Infergo tic programming inference more accessible in large scale and Turing show similar execution times, despite Turing re- server-side applications. We described Infergo implementa- lying on Julia’s numerical libraries. The slowest relative ex- tion choices and techniques in intent to encourage broader ecution time is on the Gaussian mixture model, where Stan adoption of probabilistic programming in production soft- is 6–7 times faster, apparently because Infergo model is not ware systems and ease implementation of similar facilities vectorized; the model could be made more efficient at the in other languages and software environments. cost of verbosity and poorer readability, but we opted to use Infergo is an ongoing project. Development of Infergo is idiomatic Infergo models for the comparison. On the Latent influenced both by feedback from the open-source Go com- Dirichlet allocation model, run on Stan example data with munity and by the needs of the software system in which 25 documents and 262 word instances, Infergo is only 2.5 Infergo was initially deployed, and which still supplies the DeployableProbabilisticProgramming Onward!2019,October20-25, 2019, Athens, Greece most demanding use cases in terms of flexibility, scalability, and ease of integration and maintenance. Deploying proba- bilistic programming as a part of production software sys- tem has a mutual benefit of both improving the core algo- rithms of the system and of helping to shape and improve probabilistic modeling and inference, thus further advanc- ing the field of probabilistic programming. Onward! 2019, October 20-25, 2019, Athens, Greece David Tolpin

References arXiv:1111.4246 [1] Jeffrey Annis, Brent J. Miller, and Thomas J. Palmeri. 2017. Bayesian [20] Michael Innes. 2018. Don’t Unroll Adjoint: Differentiating SSA-Form inference with Stan: A tutorial on adding custom distributions. Be- Programs. arXiv:1810.07951 havior Research Methods 49, 3 (01 Jun 2017), 863–886. [21] Michael Innes, Elliot Saba, Keno Fischer, Dhairya Gandhi, Marco Con- [2] Andrew W. Appel. 2007. Compiling with Continuations. Cambridge cetto Rudilosso, Neethu Mariya Joy, Tejan Karmali, Avik Pal, and Viral University Press, New York, NY, USA. Shah. 2018. Fashionable Modelling with Flux. arXiv:1811.01457 [3] A. W. Appel and T. Jim. 1989. Continuation-passing, Closure-passing [22] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Sto- Style. In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium chastic Optimization. In 3rd International Conference for Learning Rep- on Principles of Programming Languages (POPL ’89). ACM, New York, resentations, San Diego. arXiv:1412.6980 NY, USA, 293–302. [23] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and [4] Gonum authors. 2017. Gonum numerical packages. David M. Blei. 2017. Automatic Differentiation Variational Inference. hp://gonum.org/. J. Mach. Learn. Res. 18, 1 (Jan. 2017), 430–474. [5] Atılım Günes Baydin, Barak A. Pearlmutter, Alexey Andreyevich [24] Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. 2014. Effi- Radul, and Jeffrey Mark Siskind. 2017. Automatic Differentiation in cient Mini-batch Training for Stochastic Optimization. In Proceedings Machine Learning: A Survey. J. Mach. Learn. Res. 18, 1 (Jan. 2017), of the 20th ACM SIGKDD International Conference on Knowledge Dis- 5595–5637. covery and Data Mining (KDD ’14). ACM, New York, NY, USA, 661– [6] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 670. 2014. Julia: A Fresh Approach to Numerical Computing. CoRR [25] Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS abs/1411.1607 (2014). arXiv:1411.1607 method for large scale optimization. Mathematical Programming 45, [7] Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, 1 (01 Aug 1989), 503–528. Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul [26] Vikash K. Mansinghka, Daniel Selsam, and Yura N. Perov. 2014. Ven- Horsfall, and Noah D. Goodman. 2019. Pyro: Deep Universal Prob- ture: a higher-order probabilistic programming platform with pro- abilistic Programming. Journal of Machine Learning Research 20, 28 grammable inference. arXiv:1404.0099 (2019), 1–6. [27] Leo A. Meyerovich and Ariel S. Rabkin. 2012. Socio-PLT: Principles [8] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2002. Latent for Programming Language Adoption. In Proceedings of the ACM In- Dirichlet Allocation. Journal of Machine Learning Research 3 (2002), ternational Symposium on New Ideas, New Paradigms, and Reflections 2003. on Programming and Software (Onward! 2012). ACM, New York, NY, [9] Avi Bryant. 2018. Rainier. hps://.com/stripe/rainier. USA, 39–54. [10] Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben [28] Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L. Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Ong, and Andrey Kolobov. 2007. BLOG: Probabilistic Models with Li, and Allen Riddell. 2017. Stan: A Probabilistic Programming Lan- Unknown Objects. In Statistical Relational Learning, Lise Getoor and guage. Journal of Statistical Software, Articles 76, 1 (2017), 1–32. Ben Taskar (Eds.). MIT Press. [11] Hong Ge, Kai Xu, and Zoubin Ghahramani. 2018. Turing: Composable [29] T Minka, J Winn, J Guiver, and D Knowles. 2010. Infer .NET 2.4, Mi- inference for probabilistic programming. In International Conference crosoft Research Cambridge. on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, [30] Radford M. Neal. 2012. MCMC using Hamiltonian dynamics. Pub- Playa Blanca, Lanzarote, Canary Islands, Spain. 1682–1690. lished as Chapter 5 of the Handbook of , [12] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. 2013. Bayesian Data 2011. arXiv:1206.1901 Analysis, Third Edition. Taylor & Francis. [31] Martin Odersky. 2006. The Scala Experiment: Can We Provide Better [13] Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Language Support for Component Systems?. In Conference Record of Bonawitz, and Joshua B. Tenenbaum. 2008. Church: a language for the 33rd ACM SIGPLAN-SIGACT Symposium on Principles of Program- generative models. In Proceedings of Uncertainty in Artificial Intelli- ming Languages (POPL ’06). ACM, New York, NY, USA, 166–167. gence. [32] Brooks Paige and Frank Wood. 2014. A compilation target for prob- [14] N. D. Goodman and A. Stuhlmüller. 2014. The Design and Implemen- abilistic programming languages. In Proceedings of The 31st Interna- tation of Probabilistic Programming Languages. hp://dippl.org/ elec- tional Conference on Machine Learning. 1935–1943. tronic; retrieved 2019/3/29. [33] B. Paige, F. Wood, A. Doucet, and Y.W. Teh. 2014. Asynchronous Any- [15] Noah D Goodman, Joshua B. Tenenbaum, and The Prob- time Sequential Monte Carlo. In Advances in Neural Information Pro- Mods Contributors. 2016. Probabilistic Models of Cognition. cessing Systems. hp://probmods.org/v2. Accessed: 2019-4-29. [34] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, - [16] Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sri- ward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca ram K. Rajamani. 2014. Probabilistic Programming. In International Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. Conference on Software Engineering (ICSE, FOSE track). In NIPS Autodiff Workshop. [17] Maria I. Gorinova, Andrew D. Gordon, and Charles Sutton. 2019. Prob- [35] Yura N. Perov. 2016. Applications of Probabilistic Programming (Mas- abilistic Programming with Densities in SlicStan: Efficient, Flexible, ter’s thesis, 2015). CoRR abs/1606.00075 (2016). and Deterministic. Proceedings of the ACM on Programming Lan- [36] Avi Pfeffer. 2009. Figaro: An object-oriented probabilistic program- guages 3, POPL, Article 35 (Jan. 2019), 30 pages. ming language. Charles River Analytics Technical Report 137 (2009), [18] Andreas Griewank and Andrea Walther. 2008. Evaluating Deriva- 96. tives: Principles and Techniques of Algorithmic Differentiation (second [37] Tom Rainforth. 2017. Automating Inference, Learning, and Design us- ed.). Society for Industrial and Applied Mathematics, Philadelphia, ing Probabilistic Programming. Ph.D. Dissertation. PA, USA. [38] Tom Rainforth, Christian A Naesseth, Fredrik Lindsten, Brooks Paige, [19] Matthew D. Hoffman and Andrew Gelman. 2011. The No-U-Turn Sam- Jan-Willem van de Meent, Arnaud Doucet, and Frank Wood. 2016. In- pler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. teracting Particle Markov Chain Monte Carlo. In Proceedings of the DeployableProbabilisticProgramming Onward!2019,October20-25, 2019, Athens, Greece

33rd International Conference on Machine Learning (JMLR: W&CP), [56] Bart van Merrienboer, Olivier Breuleux, Arnaud Bergeron, and Pas- Vol. 48. cal Lamblin. 2018. Automatic differentiation in ML: Where we are [39] Guido Rossum. 1995. Python Reference Manual. Technical Report. and where we should be going. In Advances in Neural Information Amsterdam, The Netherlands, The Netherlands. Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grau- [40] Sebastian Ruder. 2016. An overview of gradient descent optimization man, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., algorithms. arXiv:1609.04747 8757–8767. [41] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. 2016. [57] Bart van Merrienboer, Dan Moldovan, and Alexander Wiltschko. 2018. Probabilistic programming in Python using PyMC3. PeerJ Computer Tangent: Automatic differentiation using source-code transformation Science 2 (apr 2016), e55. for dynamically typed array programming. In Advances in Neural In- [42] Adam Scibior, Zoubin Ghahramani, and Andrew D. Gordon. 2015. formation Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, Practical probabilistic programming with monads. In Proceedings of K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Asso- the 8th ACM SIGPLAN Symposium on Haskell, Haskell 2015, Vancou- ciates, Inc., 6256–6265. ver, BC, Canada, September 3-4, 2015. 165–176. [58] Wikipedia. 2019. Probabilistic programming language — Wikipedia, [43] Eli Sennesh, Adam Scibior, Hao Wu, and Jan-Willem van de The Free Encyclopedia. [Online; accessed 16-April-2019]. Meent. 2018. Composing Modeling and Inference Operations with [59] David Wingate, Andreas Stuhlmüller, and Noah D. Goodman. 2011. Probabilistic Program Combinators. CoRR abs/1811.05965 (2018). Lightweight Implementations of Probabilistic Programming Lan- arXiv:1811.05965 hp://arxiv.org/abs/1811.05965 guages Via Transformational Compilation. In Proceedings of the 14th [44] Stan Development Team. 2014. Stan: A C++ Library for Probability Artificial Intelligence and Statistics. and Sampling, Version 2.4. (2014). [60] David Wingate and Theophane Weber. 2013. Automated variational [45] Stan Development Team. 2018. Stan Modeling Language User’s Guide inference in probabilistic programming. arXiv:1301.1299 and Reference Manual, Version 2.18.0. hp://mc-stan.org/ [61] John Winn, Christopher M. Bishop, Thomas Diethe, and Yordan Za- [46] S. Staton, H. Yang, C. Heunen, O. Kammar, and F. Wood. 2016. Se- ykov. 2019. Model Based Machine Learning. hp://mbmlbook.com/ mantics for probabilistic programming: higher-order functions, con- electronic; retrieved 2019/4/21. tinuous distributions, and soft constraints. In Thirty-First Annual [62] Frank Wood, Jan-Willem van de Meent, and Vikash Mansinghka. 2014. ACM/IEEE Symposium on Logic In Computer Science. A New Approach to Probabilistic Programming Inference. In Artificial [47] Sam Staton, Hongseok Yang, Frank Wood, Chris Heunen, and Ohad Intelligence and Statistics. Kammar. 2016. Semantics for Probabilistic Programming: Higher- [63] Lingfeng Yang, Pat Hanrahan, and Noah D Goodman. 2014. Gener- order Functions, Continuous Distributions, and Soft Constraints. In ating Efficient MCMC Kernels from Probabilistic Programs. In Pro- Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in ceedings of the Seventeenth International Conference on Artificial Intel- Computer Science (LICS ’16). ACM, New York, NY, USA, 525–534. ligence and Statistics. 1068–1076. hps://doi.org/10.1145/2933575.2935313 [48] Andreas Stefik and Stefan Hanenberg. 2014. The Programming Lan- guage Wars: Questions and Responsibilities for the Programming Lan- guage Community. In Proceedings of the 2014 ACM International Sym- posium on New Ideas, New Paradigms, and Reflections on Programming & Software (Onward! 2014). ACM, New York, NY, USA, 283–299. [49] Q. Sun, A. Laddha, and D. Batra. 2015. Active learning for structured probabilistic models with histogram approximation. In IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR). 3612–3621. [50] The Go team. 2009. The Go programming language. hp://golang.org/. [51] David Tolpin, Jan-Willem van de Meent, Hongseok Yang, and Frank Wood. 2016. Design and Implementation of Probabilistic Program- ming Language Anglican. In Proceedings of the 28th Symposium on the Implementation and Application of Functional Programming Lan- guages (IFL 2016). ACM, New York, NY, USA, Article 6, 12 pages. [52] Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. Deep probabilistic program- ming. In International Conference on Learning Representations. [53] Jan-Willem van de Meent, Brooks Paige, David Tolpin, and Frank Wood. 2016. Black-Box Policy Search with Probabilistic Programs. In Proceedings of the 19th International Conference on Artificial Intelli- gence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016. 1195– 1204. [54] Jan-Willem van de Meent, Brooks Paige, Hongseok Yang, and Frank Wood. 2018. An Introduction to Probabilistic Programming. arXiv:1809.10756 [55] Jan-Willem van de Meent, Hongseok Yang, Vikash Mansinghka, and Frank Wood. 2015. Particle Gibbs with Ancestor Sampling for Probabilistic Programs. In Artificial Intelligence and Statistics. arXiv:1501.06769 Onward! 2019, October 20-25, 2019, Athens, Greece David Tolpin

A Latent Dirichlet Allocation Model

Infergo Stan 1 type LDAModel struct { data { 2 K int // num topics int K; // num topics 3 V int // num words int V; // num words 4 M int // num docs int M; // num docs 5 N int // total word instances int N; // total word instances 6 Word []int // word n int w[N]; // word n 7 Doc []int // doc for word n int doc[N]; // doc for word n 8 Alpha []float64 // topic prior vector[K] alpha; // topic prior 9 Beta []float64 // word prior vector[V] beta; // word prior 10 } } 11 12 func (m *LDAModel) Observe(x []float64) float64 { 13 ll := 0.0 parameters { 14 simplex[K] theta[M]; // topic dist for doc m 15 // Regularize the parameter vector simplex[V] phi[K]; // word dist for topic k 16 ll += Normal.Logps(0, 1, x...) } 17 // Destructure parameters 18 theta := make([][]float64, m.M) 19 m.FetchSimplices(&x, m.K, theta) model { 20 phi := make([][]float64, m.K) for (m in 1:M) 21 m.FetchSimplices(&x, m.V, phi) theta[m] ~ dirichlet(alpha); // prior 22 for (k in 1:K) 23 // Impose priors phi[k] ~ dirichlet(beta); // prior 24 ll += Dirichlet{m.K}.Logps(m.Alpha, theta...) 25 ll += Dirichlet{m.V}.Logps(m.Beta, phi...) for (n in 1:N) { 26 real gamma[K]; 27 // Condition on observations for (k in 1:K) 28 gamma := make([]float64, m.K) gamma[k] <- log(theta[doc[n],k]) 29 for in := 0; in != m.N; in++ { + log(phi[k,w[n]]); 30 for ik := 0; ik != m.K; ik++ { increment_log_prob(log_sum_exp(gamma)); 31 gamma[ik] = math.Log(theta[m.Doc[in]-1][ik]) + } 32 math.Log(phi[ik][m.Word[in]-1]) } 33 } 34 ll += D.LogSumExp(gamma) 35 } 36 return ll 37 } 38 39 func (m *Model) FetchSimplices( 40 px *[]float64, 41 k int, 42 simplices [][]float64, 43 ) { 44 for i := range simplices { 45 simplices[i] = make([]float64, k) 46 D.SoftMax(model.Shift(px, k), simplices[i]) 47 } 48 }