<<

Proceedings of the 11th

Python in Science Conference

July 16–21, 2012 Austin, Texas •

Aron Ahmadia Jarrod Millman Stefan´ van der Walt

PROCEEDINGSOFTHE 11TH PYTHONIN SCIENCE CONFERENCE

Edited by Aron Ahmadia, Jarrod Millman, and Stefan´ van der Walt.

SciPy 2012 Austin, Texas July 16–21, 2012, 2012

Copyright c 2012. The articles in the Proceedings of the Python in Science Conference are copyrighted and owned by their original authors This is an open-access publication and is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. For more information, please see: http://creativecommons.org/licenses/by/3.0/

ISSN:2575-9752 https://doi.org/10.25080/Majora-54c7f2c8-00d

ORGANIZATION

Conference Chairs WARREN WECKESSER, Enthought, Inc. STEFAN VAN DER WALT, UC Berkeley

Program Chairs MATT MCCORMICK, Kitware, Inc. ANDY TERREL, TACC, University of Texas

Program Committee ARON AHMADIA, KAUST W. NATHAN BELL, NVIDIA JOHN HUNTER, TradeLink MATTHEW G.KNEPLEY, University of KYLE MANDLI, University of Texas MIKE MCKERNS, Enthought, Inc. DAN SCHULT, Colgate University MATTHEW TERRY, LLNL MATTHEW TURK, PETER WANG, Continuum Analytics

Tutorial Chairs DHARHAS POTHINA, Texas Water Development Board JONATHAN ROCHER, Enthought, Inc.

Sponsorship Chairs CHRIS COLBERT, Enthought, Inc. WILLIAM SPOTZ, Sandia National Laboratories

Program Staff JODI HAVRANEK, Enthought, Inc. JIM IVANOFF, Enthought, Inc. LAUREN JOHNSON, Enthought, Inc. LEAH JONES, Enthought, Inc. SCHOLARSHIP RECIPIENTS

, ,

JUMP TRADINGAND NUMFOCUSDIVERSITY SCHOLARSHIP RECIPIENTS

, , CONTENTS

Parallel High Performance Bootstrapping in Python 1 Aakash Prasad, David Howard, Shoaib Kamil, Armando Fox

A Computational Framework for Plasmonic Nanobiosensing 6 Adam Hughes

A Tale of Four Libraries 11 Alejandro Weinstein, Michael Wakin

Total Recall: flmake and the Quest for Reproducibility 16 Anthony Scopatz

Python’s Role in VisIt 23 Cyrus Harrison, Harinarayan Krishnan

PythonTeX: Fast Access to Python from within LaTeX 30 Geoffrey M. Poore

Self-driving Lego Mindstorms Robot 37 Iqbal Mohomed

The Reference Model for Disease Progression 41 Jacob Barhak

Fcm - A python library for flow cytometry 46 Jacob Frelinger, Adam Richards, Cliburn Chan

Uncertainty Modeling with SymPy Stats 51 Matthew Rocklin

QuTiP: A framework for the dynamics of open quantum systems using SciPy and Cython 56 Robert J. Johansson, Paul D. Nation cphVB: A System for Automated Runtime Optimization and Parallelization of Vectorized Applications 62 Mads Ruben Burgdorff Kristensen, Simon Andreas Frimann Lund, Troels Blum, Brian Vinter

OpenMG: A New Multigrid Implementation in Python 69 Tom S. Bertalan, Akand W. Islam, Roger B. Sidje, Eric Carlson PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012) 1

Parallel High Performance Bootstrapping in Python

‡ ‡ ‡ ‡ Aakash Prasad ∗, David Howard , Shoaib Kamil , Armando Fox

!

We use a combination of code-generation, code lowering, for common tasks, but the usability and efficiency of an interface and just-in-time compilation techniques called SEJITS (Selec- often varies inversely to its generality. In addition, SciPy’s imple- tive Embedded JIT Specialization) to generate highly performant mentations are sequential, due to both the wide variety of parallel parallel code for Bag of Little Bootstraps (BLB), a statistical programming models and the difficulty of selecting parameters sampling algorithm that solves the same class of problems as such as degree of concurrency, thread fan-out, etc. general bootstrapping, but which parallelizes better. We do this SEJITS [SEJITS] provides the best of both worlds by allowing by embedding a very small domain-specific language into Python very compact Domain-Specific Embedded Languages (DSELs) for describing instances of the problem and using expert-created to be embedded in Python. Specializers are mini-compilers for code generation strategies to generate code at runtime for a these DSELs, themselves implemented in Python, which perform parallel multicore platform. The resulting code can sample giga- code generation and compilation at runtime; the specializers only byte datasets with performance comparable to hand-tuned parallel intervene during those parts of the Python program that use Python code, achieving near-linear strong scaling on a 32-core CPU, classes belonging to the DSEL. BLB is the latest such specializer yet the Python expression of a BLB problem instance remains in a growing collection. source- and performance-portable across platforms. This work ASP ("ASP is SEJITS for Python") is a powerful framework represents another case study in a growing list of algorithms we for bringing parallel performance to Python using targeted just-in- have "packaged" using SEJITS in order to make high-performance time code transformation. The ASP framework provides a skinny implementations of the algorithms available to Python program- waist interface which allows multiple applications to be built mers across diverse platforms. and run upon multiple parallel frameworks by using a single run-time compiler, or specializer. Each specializer is a Python Introduction class which contains the tools to translate a function/functions written in Python into an equivalent function/functions written A common task domain experts are faced with is performing in one or more low-level efficiency languages. In addition to statistical analysis on data. The most prevalent methods for doing providing support for interfacing productivity code to multiple this task (e.g. coding in Python) often fail to take advantage of efficiency code back-ends, ASP includes several tools which help the power of parallelism, which restricts domain experts from the efficiency programmer lower and optimize input code, as well performing analysis on much larger data sets, and doing it much as define the front-end DSL. Several specializers already use faster than they would be able to with pure Python. these tools to solve an array of problems relevant to scientific The rate of growth of scientific data is rapidly outstripping programmers [SEJITS]. the rate of single-core processor speedup, which means that Though creating a compiler for a DSL is not a new problem, scientific productivity is now dependent upon the ability of domain it is one with which efficiency experts may not be familiar. expert, non-specialist programmers (productivity programmers) ASP eases this task by providing accessible interfaces for AST to harness both hardware and software parallelism. However, transformation. The NodeTransformer interface in the ASP parallel programming has historically been difficult for produc- toolkit includes and expands upon CodePy’s [CodePy] C++ AST tivity programmers, whose primary concern is not mastering structure, as well as providing automatic translation from Python platform specific programming frameworks. At the same time, the to C++ constructs. By extending this interface, efficiency program- methods available to harness parallel hardware platforms become mers can define their DSEL by modifying only those constructs increasingly arcane and specialized in order to expose maximum which differ from standard python, or intercepting specialized con- performance potential to efficiency programming experts. Several structs such as special function names. This frees the specializer methods have been proposed to bridge this disparity, with varying writer from re-writing boilerplate for common constructs such as degrees of success. branches or arithmetic operations. High performance natively-compiled scientific libraries (such ASP also provides interfaces for managing source variants as SciPy) seek to provide a portable, high-performance interface and platforms, to complete the task of code lowering. The ASP framework allows the specializer writer to specify Backends, * Corresponding author: [email protected] ‡ University of , Berkeley which represent distinct parallel frameworks or platforms. Each backend may store multiple specialized source variants, and in- Copyright © 2012 Aakash Prasad et al. This is an open-access article dis- cludes simple interfaces for selecting new or best-choice variants, tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- as well as compiling and running the underlying efficiency source vided the original author and source are credited. codes. Couple with the Mako templating language and ASP’s 2 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

AST transformation tools, efficiency programmers are relieved of The next set of problem parameters, the sampling parameters, writing and maintaining platform-specific boilerplate and tools, are not represented directly in the DSEL; In fact, they are not refer- and can focus on providing the best possible performance for their enced anywhere therein. This is because the sampling parameters, specializer. which comprise n, m, and γ, have pattern-level consequences, and have no direct bearing on the executrion of users’ computations. Related Work These values can be passed as keyword arguments to the special- Prior work on BLB includes a serial implementation of the izer object when it is created, or the specializer may be left to algorithm, as described in "The Big Data Bootstrap" and a Scala choose reasonable defaults. implementation that runs on the Spark cluster computing frame- The final components of a problem instance are the input data. work, as described in "A Scalable Bootstrap for Massive Data". Much of the necessary information about the input data is gleaned The first paper shows that the BLB algorithm produces statistically by the specializer without referring to the DSEL. However, a major robust results on a small data set with a linear estimator function. component of what to do with the input data is expressed using The second paper describes how BLB scales with large data sets the DSEL’s annotation capability. Argument annotations, as seen in distributed environments. in figure 1 below, are used to determine whether or not a given input should be subsampled as part of the BLB pattern. This is essential for many tasks, because it allows the user to pass in BLB non-data information (e.g. a machine learning model vector) into BLB ("Bag of Little Bootstraps") is a method to assess the quality the computation. Though the annotations are ultimately removed, of a statistical estimator, θ(X), based upon subsets of a sample the information they provide propagates as changes to the pattern distribution X. θ might represent such quantities as the parameters within the execution template. of a regressor, or the test accuracy of a machine learning classifier. γ An example application of BLB is to do model verification. In order to calculate θ, subsamples of size K , where K = X d Suppose we have trained a classifier π : R C where d is the | | → and γ is a real number between 0 and 1, are drawn n times dimension of our feature vectors and C is the set of classes. We can without replacement from X, creating the independent subsets define θ[Y] to be error[Y]/|Y|, where the error function is 1 if K X1,X2,...,Xn. Next, elements are resampled with replacement π(y) is not the true class of y, and 0 elsewhere. If we then choose from each subset Xi, m times. This procedure of resampling arithmetic mean as a statistical aggregator, the BLB method using with replacement is referred to as bootstrapping. The estimator the γ we defined will provide an estimate of the test error of our θ is applied to each bootstrap. These results are reduced using a statistical aggregator (e.g. mean, variance, margin of error, etc.) to form an intermediate estimate θ 0(Xi).Finally, the mean of θ 0 for each subset is taken as the estimate for θ(X). This method is statistically rigorous, and in fact reduces bias in the estimate compared to other bootstrap methods [BLB]. In addition, its structural properties lend themselves to efficient parallelization.

DSEL for BLB A BLB problem instance is defined by the estimators and reducers it uses, its sampling parameters, and its input data. Our BLB specializer exposes a simple but expressive interface which allows the user to communicate all of these elements using either pure Python or a simple DSEL. The DSEL, which is formally specified in Appendix A, is designed to concisely express the most common features of BLB estimator computations: position-independent iteration over large data sets, and dense linear algebra. The BLB algorithm was designed for statistical and loss-minimization tasks. These tasks share the characteristic of position-independant computation; they classifier. depend only on the number and value of the unique elements Figure 1. User-supplied code for model verification applica- of the argument data sets, and not upon the position of these tion using BLB specializer. data points within the set. For this reason, the DSEL provides a pythonic interface for iteration, instead of a position-oriented style (i.e., subscripts and incrementing index variables) which is The Specializer: A Compiler for the BLB DSEL common in lower-level languages. Because most data sets which The BLB specializer combines various tools, as well as compo- BLB operates on will have high-dimensional data, the ability to nents of the ASP framework and a few thousand lines of custom efficiently express vector operations is an important feature of the code, to inspect and lower productivity code at run time. DSEL. All arithmetic operations and function calls which operate The BLB DSEL is accessed by creating a new Python class on data are replaced in the final code with optimized, inlined which uses the base specializer class, blb.BLB, as a parent. functions which automatically handle data of any size without Specific methods corresponding to the estimator and reducer changes to the source code. In addition to these facilities, common functions are written with the DSEL, allowing the productivity dense linear algebra operations may also be accessed via special programmer to easily express aspects of a BLB computation which function calls in the DSEL. can be difficult to write efficiently. Though much of this code is PARALLEL HIGH PERFORMANCE BOOTSTRAPPING IN PYTHON 3 converted faithfully from Python to C++ by the specializer, two be loaded in and which should not. The specializer determines important sets of constructs are intercepted and rewritten in an this by comparing the size of a subsample to the size of the optimized way when they are lowered to efficiency code. The shared L2 cache; if the memory needed for a single thread would first such construct is the for loop. In the case of the estimator consume more than 40% of the resources, then subsamples will theta, these loops must be re-written to co-iterate over a weight not be loaded in. The value of 40% is empirical, and determined set. As mentioned above, the bootstrap step of the algorithm for the particular experiments herein. In the future, this and samples with replacement a number of data points exponentially other architecture-level optimizations will be made automatically larger than the size of the set. A major optimization of this by specializers by comparing the performance effects of such operation is to re-write the estimator to work with a weight set decisions on past problem instances. the same size as the subsample, who’s weights sum to the size The other major pattern-level decision for a BLB computation of the original data set. This is accomplished within the DSEL is choice of sampling parameters. These constitute the major by automatically converting for loops over subsampled data sets efficiency/accuracy trade-off of the BLB approach. By default, the into weighted loops, with weight sets drawn from an appropriate specializer sets these parameters conservatively, favoring accuracy multinomial distribusion for each bootstrap. When this is done, the heavily over efficiency; The default sampling parameters are n = specializer converts all the operations in the interior of the loop to 25 subsamples, m = 100 bootstraps per subsample, and γ = 0.7. weighted operations, which is why only augmented assignments Though each of these values has clear performance implications, are permitted in the interior of loops Appendix A. The other set of the specializer does not adjust them based on platform parameters constructs handled specially by the specializer are operators and because it does not include a mechanism to evaluate acceptable function calls. These constructs are specialized as described in the losses in accuracy. previous section. Empirical evidence shows that accuracy declines sharply using Introspection begins when a specializer object is instantiated. γ less than 0.5 [BLB], though does not increase much more When this occurs, the specializer uses Python’s inspect module using a higher value than 0.7. A change of .1 in this value to extract the source code from the specializer object’s methods leads to an order-of-magnitude change in subsample size for data named compute_estimate, reduce_bootstraps, and sets in the 10-100 GB range, so the smallest value which will average. The specializer then uses Python’s ast module to attain the desired accuracy should be chosen. The number of generate a Python abstract syntax tree for each method. subsamples taken also has a major impact on performance. The The next stage of specialization occurs when the specialized run time of a specialized computation in these experiments could function is invoked. When this occurs, the specializer extracts be approximated to within 5% error using the formula t = n s , d c e salient information about the problem, such as the size and data where t is the total running time, c is the number of cores in use, type of the inputs, and combines it with information about the and s is the time to compute the bootstraps of a single subsample platform gleaned using ASP’s platform detector. Along with this in serial. Though the result from bootstraps of a given subsample information, each of the three estimator ASTs is passed to a will likely be close to the true estimate, at least 20 subsamples converter object, which transforms the Python ASTs to C++ equiv- were needed in the experiments detailed here to reduce variance alents, as well as performing optimizations. The converter objects in the estimate to an acceptable level. Finally, the number of referred to above perform the most radical code transformations, bootstraps per subsample determines how accurate an estimate and more so than any other part of the specializer might be called is produced for each subsample. In the experiments described a run-time compiler (with the possible exception of the C++ below, 40 bootstraps were used. In experiments not susceptible compiler invoked later on). Once each C++ AST is produced, to noise, as few as 25 were used with acceptable results. Because it is converted into a python string whose contents are a valid the primary effect of additional bootstraps is to reduce the effect C++ function of the appropriate name. These functions-strings, of noise and improve accuracy, care should be taken not to use too along with platform and problem-specific data, are used as inputs few. to Mako templates to generate a C++ source file tailored for the platform and problem instance. Finally, CodePy is used to compile the generate source file and return a reference to the compiled Evaluation function to Python, which can then be invoked. We evaluated the performance gains from using our SEJITS In addition to code lowering and parallelization, the specializer specializer by performing model verification of a SVM classifier is equipped to make pattern-level optimization decisions. These on a subset of the Enron email corpus [ENRON]. We randomly optimizations change the steps of the execution pattern, but do selected 10% (Approximately 120,000 emails) from the corpus to not affect the user’s code. The best example of this in the BLB serve as our data set. From each email, we extracted the counts specializer is the decision of whether or not to load in subsamples. of all words in the email, as well as the user-defined directory Subsamples of the full data set can be accessed by indirection to the email was filed under. We then aggregated the word counts individual elements (a subsample is an array of pointers) or by of all the emails to construct a Bag-of-Words model of our data loading the subsampled elements into a new buffer (loading in). set, and assigned classes based upon directory. In the interest of Loading in subsamples encourages caching, and our experiments classification efficiency, we filtered the emails to use only those showed performance gains of up to 3x for some problem/platform from the 20 most common classes, which preserved approximately combinations using this technique. However, as data sizes grow, 98% of our original data set. In the final count, our test data the time spent moving data or contending for shared resources consisted of approximately 126,000 feature vectors and tags, outweighs the caching benefit. Because the specializer has some with each feature vector composed of approximately 96,000 8- knowledge of the platform and of the input data sizes, it is able to bit features. Using the SVM-Multiclass [SVM] library, we trained make predictions about how beneficial loading in will be, and can a SVM classifier to decide the likeliest storage directory for an modify the efficiency level code to decide which inputs should email based upon its bag of words representation. We trained the 4 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012) classifier on 10% of our data set, reserving the other 90% as a to add is the ability for our specializer to automatically determine test set. We then applied the specialized code shown in figure 1 targets and parameters based on the input data size and platform to estimate the accuracy of the classifier. We benchmarked the specifications. performance and accuracy of the specializer on a system using 4 Intel X7560 processors. Conclusion Our experiments indicate that our specialized algorithm was able to achieve performance gains of up to 31.6x with Using the SEJITS framework, productivity programmers are able regards to the serial version of the same algorithm, and to easily express high level computations while simultaneously up to 22.1x with respect to other verification techniques. gaining order-of-magnitude performance benefits. Because the These gains did not come at the cost of greatly reduced parallelization strategy for a particular pattern of computation and accuracy; the results from repeated runs of the specialized hardware platform is often similar, efficiency expert programmers code were both consistent and very close to the true population can make use of DSLs embedded in higher level languages, such as Python, to provide parallel solutions to large families of similar problems. We were able to apply the ASP framework and the BLB pattern of computation to efficiently perform the high level task of model verification on a large data set. This solution was simple to develop with the help of the BLB specializer, and efficiently took advantage of all available parallel resources. The BLB specializer provides the productivity programmer not only with performance, but with performance portability. Many techniques for bringing performance benefits to scientific pro- gramming, such as pre-compiled libraries, autotuning, or parallel framework languages, tie the user to a limited set of platforms. With SEJITS, productivity programmers gain the performance statistic. benefits of a wide variety of platforms without changes to source Figure 2. Efficiency gains from specialized code. code. As is visible from figure 2 above, our specialized code This specializer is just one of a growing catalogue of such achieved near-perfect strong scaling. In the serial case, the com- tools, which will bring to bear expert parallelization techniques putation took approximately 3478 seconds. By comparison, when to a variety of the most common computational patterns. With utilizing all 32 available hardware contexts, the exact same pro- portable, efficient, high-level interfaces, domain expert program- ductivity level code returned in just under 110 seconds. mers will be able to easily create and maintain code bases in the We also used SVM Multiclass’ native verification utility to in- face of evolving parallel hardware and networking trends. vestigate the relative performance and accuracy of the specializer. SVM Multiclass’ utility differs critically from our own in several Acknowledgements ways: The former uses an optimized sparse linear algebra system, Armando Fox and Shoaib Kamil provided constant guidance in whereas the latter uses a general dense system; the former provides the development of this specializer, as well as the ASP project. only a serial implementation; and the algorithm (traditional cross- Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael validation) is different from ours. All of these factors should be Jordan developed the BLB algorithm, and published the initial kept in mind as results are compared. Nevertheless, the special- paper on the subject, Bootstrapping Big Data. They also consulted izer garnered order-of-magnitude performance improvements once on effective parallelization strategies for that algorithm. John enough cores were in use. SVM Multiclass’ utility determined the Duchi and Yuchen Zhang helped finalize the experiment plan and true population statistic in approximately 2200 seconds, making select appropriate test data sets. Richard Xia and Peter Birsinger it faster than the serial incarnation of our specializer, but less developed the first BLB specializer interface, and continued work efficient than even the dual-threaded version. on the shared-nothing cloud version of this specializer. The native verification utility determined that the true error rate of the classifier on the test data was 67.86%. Our specializers estimates yielded a mean error rate of 67.24%, with a standard Appendix A: Formal Specification of DSEL deviation of 0.36 percentage points. Though the true statistic was ## NAME indicates a valid python name, with the added outside one standard deviation from our estimate’s mean, the ## stipulation it not start with '_blb_' ## INT and FLOAT indicate decimal representations of specializer was still capable of delivering a reasonably accurate ## 64 bit integers and IEEE floating point numbers, estimate very quickly. ## respectively ## NEWLINE, INDENT, and DEDENT stand for the respective ## whitespace elements Limitations and Future Work Some of the limitations of our current specializer are that the tar- P ::= OUTER_STMT* RETURN_STMT AUG ::= '+=' \ '-=' | '*=' | '/=' gets are limited to OpenMP and Cilk. We would like to implement NUM ::= INT | FLOAT a GPU and a cloud version of the BLB algorithm as additional OP ::= '+' | '-' | '*' | '/' | '**' targets for our specializer. We’d like to explore the performance of COMP ::= '>' | '<' | '==' | '!=' | '<=' | '>=' a GPU version implemented in CUDA. A cloud version will allow BRANCH ::= 'if' NAME COMP NAME':' us to apply the BLB sepcializer to problems involving much larger RETURN_STMT ::= 'return' NAME | 'return' CALL data sets than are currently supported. Another feature we’d like PARALLEL HIGH PERFORMANCE BOOTSTRAPPING IN PYTHON 5

CALL ::= 'sqrt(' NAME ')' | 'len(' NAME ')' | 'mean(' NAME ')' | 'pow(' NAME',' INT ')' | 'dim(' NAME [',' INT ] ')' | 'dtype(' NAME ')' | 'MV_solve(' NAME',' NAME',' NAME ')' | NAME OP CALL | CALL OP NAME | CALL OP CALL | NAME OP NAME | NAME '*' NUM | CALL '*' NUM | NAME '/' NUM | CALL '/' NUM | NAME '**' NUM | CALL '**' NUM

INNER_STMT ::= NAME '=' NUM | | NAME = 'vector(' INT [',' INT]*', type='NAME ')' | NAME AUG CALL | NAME '=' 'index('[INT]')' OP NUM | NAME = NUM OP 'index('[INT]')' | BRANCH NEWLINE INDENT INNER_STMT* DEDENT | 'for' NAME[',' NAME]* 'in' NAME[',' NAME]*':' NEWLINE INDENT INNER_STMT* DEDENT

OUTER_STMT ::= NAME '=' NUM | NAME '=' 'vector(' INT [',' INT]*', type='NAME ')' | NAME '=' CALL | NAME AUG CALL | 'for' NAME[',' NAME]* 'in' NAME[',' NAME]*':' NEWLINE INDENT INNER_STMT* DEDENT | BRANCH NEWLINE INDENT OUTER_STMT* DEDENT

REFERENCES [SEJITS]S. Kamil, D. Coetzee, A. Fox. "Bringing Parallel Perfor- mance to Python with Domain-Specific Selective Embed- ded Just-In-Time Specialization". In SciPy 2011. [BLB]A. Kleiner, A. Talwalkar, P. Sarkar, M. Jordan. "Bootstrap- ping Big Data". In NIPS 2011. [CodePy] CodePy Homepage: http://mathema.tician.de/software/codepy [ENRON]B. Klimt and Y. Yang. "The Enron corpus: A new dataset for email classification research". In ECML 2004. [SVM] SVM-Multiclass Homepage: http://svmlight.joachims.org/svm_ multiclass.html [Spark]M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, I. Stoica. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing". In USENIX NSDI 2012. 6 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

A Computational Framework for Plasmonic Nanobiosensing

‡ Adam Hughes ∗

!

Abstract—Basic principles in biosensing and nanomaterials precede the intro- biosensor characterization often leads to asystematic experimen- duction of a novel fiber optic sensor. Software limitations in the biosensing do- tal designs, implementations and conclusions between research main are presented, followed by the development of a Python-based simulation groups. To compound matters, dedicated software is not evolving environment. Finally, the current state of spectral data analysis within the Python fast enough keep up with new biosensing technology. This lends an ecosystem is discussed. advantage to commercial biosensors, which use highly customized software to both control the experimental apparatus and extract Index Terms—gold nanoparticles, fiber optics, biosensor, Python, immunoas- underlying information from the data. Without a general software say, plasmonics, proteins, metallic , IPython, Traits, Chaco, Pandas, SEM, framework to develop similar tools, it is unreasonable to expect the research community to achieve the same breadth in application when pioneering new nanobiosensing technology. Introduction Publications on novel biosensors often belaud improvement in sensitivity and cost over commercial alternatives; however, the Because of their unique optical properties, metallic colloids, espe- aforementioned shortcomings relegate many new biosensors to cially gold nanoparticles (AuNPs), have found novel applications prototype limbo. Until the following two freeware components in biology. They are utilized in the domain of nanobiosensing as are developed, new biosensors, despite any technical advantages platforms for biomolecule recognition. Nanobiosensing refers to over their commercial counterparts, will fall short in applicability: the incorporation of nanomaterials into biosensing instrumenta- tion. Sensors whose primary signal transduction mechanism is the 1) A general and systematic framework for the development interaction of light and metallic colloids are known as plasmonic and objective quantification of nanobiosensors. sensors.1 2) Domain-tailored software tools for conducting simula- Plasmonic sensors are constructed by depositing metallic tions and interpreting experimental data. layers (bulk or colloidal) onto a substrate such as glass, or in In regard to both points, analytical methods have been de- our case, onto a stripped optical fiber. Upon illumination, they veloped to interpret various aspects of plasmonic sensing; [R1] relay continuous information about their surrounding physical and however, they have yet to be translated into a general software chemical environment. These sensors behave similarly to con- framework. Commercial software for general optical system de- ventional assays with the added benefits of increased sensitivity, sign is available; however, it is expensive and not designed to compact equipment, reduced sample size, low cost, and real-time encompass nanoparticles and their interactions with biomolecules. data acquisition. Despite these benefits, nanobiosensing research In the following sections, an effort to begin such computational in general is faced with several hinderances. endeavors is presented. The implications are relevant to plasmonic It is often difficult to objectively compare results between biosensing in general. research groups, and sometimes even between experimental trials. This is mainly because the performance of custom sensors is Optical Setup highly dependent on design specifics as well as experimental conditions. The extensive characterization process found in com- We have developed an operational benchtop setup which records mercial biosensors2 exceeds the resources and capabilities of the rapid spectroscopic measurements in the reflected light from the average research group. This is partially due to a disproportionate end of an AuNP-coated optical fiber. The nanoparticles are de- investment in supplies and manpower; however, it is also due to a posited on the flat endface of the fiber, in contrast to the commonly 3 dearth of computational resources. The ad-hoc of empirical encountered method of depositing the AuNPs axially along an etched region of the fiber [R2], [R3]. In either configuration, only * Corresponding author: [email protected] the near-field interaction affects the signal, with no interference ‡ The George Washington University from far-field effects. The simple design is outlined in Fig.1 (left). Broadband emission from a white LED is focused through a 10 Copyright © 2012 Adam Hughes. This is an open-access article distributed × under the terms of the Creative Commons Attribution License, which permits objective (not shown) into the 125µm core diameter of an optical unrestricted use, distribution, and reproduction in any medium, provided the fiber. AuNP-coated probes are connected into the setup via an original author and source are credited. 1. This exposition is germane to plasmonic sensors, more so than to other 2. Biacore© and ForteBio© are examples of prominent nanobiosensing nanobiosensor subgroups. companies. A COMPUTATIONAL FRAMEWORK FOR PLASMONIC NANOBIOSENSING 7

Fig. 1: Left: Bench-top fiber optic configuration schematic, adapted from [R4]. Right: Depiction from bottom to top of fiber endface, APTMS monolayer, AuNPs, antibody-antigen coating. optical splice. The probes are dipped into solutions containing biomolecules, and the return light is captured by an OceanOptics© USB2000 benchtop spectrometer and output as ordered series data.

Fiber Surface Functionalization Fig. 2: Three size regimes of the optical setup. Top: Optical fiber 16nm gold nanospheres are attached to the optical fiber via a linker with an AuNP-coated endface. Left: Coarse approximation of a molecule, (3-Aminoptopropyl)trimethoxysilane, or APTMS.4 The multilayered material. Right: Individual nanoparticles with protein surface of the gold may be further modified to the spec- shells. ifications of the experiment. One common modification is to co- valently bind a ligand to the AuNPs using Dithiobis[succinimidyl propionate] (Lomant’s reagent), and then use the fiber to study nanoparticle. The magnitude and dispersion of these oscillations is specificity in antibody-antigen interactions. This is depicted in Fig. highly influenced by the dielectric media in direct contact with the 1 (right). particle’s surface. As such, the scattering and absorption properties of the gold particles will change in response to changes in solution, as well as to the binding of biomolecules. Modeling the Optical System in Python To model AuNPs, the complex dielectric function5 of gold The simulation codebase may be found at http://github.com/ is imported from various sources, both from material models hugadams/fibersim. [R5] and datasets [R6]. The optical properties of bare and coated Nanobiosensing resides at an intersection of optics, biology, spheroids are described analytically by Mie theory [R7]. Scat- and material science. To simulate such a system requires back- tering and absorption coefficients are computed using spherical ground in all three fields and new tools to integrate the pieces Bessel functions from the scipy.special library of mathematical seamlessly. Nanobiosensor modeling must describe phenomena at functions. Special routines and packages are available for com- three distinct length scales. In order of increasing length, these puting the optical properties of non-spheroidal colloids; however, are: they have not yet been incorporated in this package. 1) A description of the optical properties of nanoparticles AuNP modeling is straightforward; however, parametric anal- with various surface coatings. ysis is uncommon. Enthought’s Traits and Chaco packages 2) The properties of light transmission through multi- are used extensively to provide interactivity. To demonstrate a use layered materials at the fiber endface. case, consider a gold nanoparticle with a shell of protein coating. 3) The geometric parameters of the optics (e.g. fiber diame- The optical properties of the core-shell particle may be obtained 6 ter, placement of nanoparticle monolayer, etc.). analytically using Mie Theory; however, analysis performed at a coarser scale requires this core-shell system to be approximated as The size regimes, shown in Fig.2, will be discussed separately a single composite particle (Fig.3). With Traits, it is very easy in the following subsections. It is important to note that the for the user to interactively adjust the mixing parameters to ensure computational description of a material is identical at all three that the scattering properties of the approximated composite are as length scales. As such, general classes have been created and inter- close as possible to those of the analytical core-shell particle. In faced to accommodate material properties from datasets [R5] and this example, and in others, interactivity is favorable over complex models [R6]. This allows for a wide variety of experimental and optimization techniques. theoretical materials to be easily incorporated into the simulation environment. Modeling Material Layers The fiber endface at a more coarse resolution resembles a multi- Modeling Nanoparticles layered dielectric stack of homogeneous materials, also referred AuNPs respond to their surrounding environment through a phe- to as a thin film (Fig.5). In the limits of this approximation, the nomenon known as surface resonance. Incoming light reflectance, transmittance, and absorbance through the slab can couples to free electrons and induces surface oscillations on the 5. The dielectric function and shape of the particle are the only parameters 3. Axial deposition allows for more control of the fiber’s optical properties; required to compute its absorption and scattering cross sections. however, it makes probe creation more difficult and less reproducible. 6. Assuming that the shell is perfectly modeled; however, in practice the 4. APTMS is a heterobifunctional crosslinker that binds strongly to glass optical properties of protein mixtures are approximated by a variety of mixing and gold respectively through silane and amine functional groups. models and methods. 8 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

Fig. 3: Left: A nanoparticle with heterogeneous core and shell dielectrics (ε1,ε2), of radius, r = r1 + r2. Right: Composite approx- imation of a homogeneous material, with effective dielectric ε0, and radius, r0. Fig. 5: Left: Electromagnetic field components at each interface of a dielectric slab [R7]. Right: Illustration of a multilayered material whose optical properties would be described by such treatment.

entering the multilayer interface; thus, once the correct multilay- ered environment is established, it easy to compare performance between different fiber optic configurations. Built-in parameters already account for the material makeup and physical dimensions of many commercially available optical fibers. A phase angle has been introduced to distinguish nanomaterial deposition on the fiber endface from axial deposition. This amounts to a 90◦ rotation of the incident light rays at the multilayered interface.8 The entire application was designed for exploratory analysis, so adjusting most parameters will automatically trigger system- wide updates. To run simulations, one merely automates setting Trait attributes in an iterative manner. For example, by iter- ating over a range of values for the index of refraction of the AuNP shells, one effectively simulates materials binding to the AuNPs. After each iteration, Numpy arrays are stored for the updated optical variables such as the extinction spectra of the particles, dielectric functions of the mixed layers, and the total light reflectance at the interface. All data output is formatted as Fig. 4: Screenshot of an interactive TraitsUI program for modeling ordered series to mimic the actual output of experiments; thus, the scenario in Fig.3: the extinction spectra of a protein-coated AuNP simulations and experiments can be analyzed side-by-side without (blue) compared to that of an equivalent core-shell composite (red). further processing. With this work flow, it is quite easy to run experiments and simulations in parallel as well as compare a variety of plasmonic sensors objectively. be calculated recursively for n-layered systems [R8]. Thin film optical software is commercially available and used extensively Data Analysis in optical engineering, for example, in designing coatings for sunglasses. Unfortunately, a free, user-friendly alternative is not Our work flow is designed to handle ordered series spectra available.7 In addition, these packages are usually not designed generated from both experiment and simulation. The Python pack- for compatibility with nanomaterials; therefore, we have begun ages IPython, Traits, and Pandas synergistically facilitate development of an extensible thin film Python API that incorpo- swift data processing and visualization. Biosensing results are rates nanomaterials. This is ideal, for example, in simulating a information-rich, both in the spectral and temporal dimensions. fiber immersed in a solvent with a variable refractive index (e.g. a Molecular interactions on the AuNP’s surface have spectral sig- solution with changing salinity). The program will ensure that as natures discernible from those of environmental changes. For the solvent changes, the surrounding shell of the nanoparticle, and example, the slow timescale of protein binding events is orders of hence its extinction spectra, will update accordingly. magnitude less than the rapid temporal response to environmental changes. Optical Configurations and Simulation Environment Fig.6 illustrates a fiber whose endface has been coated with gold nanoparticles and subsequently immersed in water. The top With the material and multilayer APIs in place, it is straightfor- left plot shows the reflected light spectrum function of time. When ward to incorporate an optical fiber platform. The light source and submerged in water, the signal is very stable. Upon the addition of fiber parameters merely constrain the initial conditions of light 8. The diameter of the optical fiber as well as the angle at which light 7. Open-source thin film software is often limited in scope and seldom rays interact with the material interface has a drastic effect on the system provides a user-interface, making an already complex physical system more because each light mode contributes differently to the overall signal, which is convoluted. the summation over all modes. A COMPUTATIONAL FRAMEWORK FOR PLASMONIC NANOBIOSENSING 9

Fig. 6: Temporal evolution (top) and spectral absorbance (bottom) of the light reflectance at the fiber endface due to a protein-protein interaction (left) as opposed to the stepwise addition of glycerin (right).

Fig. 7: Top: Absorbance plot of the real-time deposition of AuNPs onto an optical fiber. Bottom: Time-slice later in the datasets shows micromolar concentrations of Bovine Serum Albumin (BSA), the that the signal is dominated by signal at the res- signal steadily increases as the proteins in the serum bind to the onance peak for gold, λSPR 520 nm. The exemplifies the correct gold. About an hour after BSA addition, the nanoparticle binding timescale over which spectral≈ events manifest. sites saturate and the signal plateaus. Fig.6 (top right) corresponds to a different situation. Again, an AuNP-coated fiber is immersed in water. Instead of proteins, film density and its orientation on the fiber. The spectral signature glycerin droplets are added. The fiber responds to these refractive of the AuNP’s may be observed by timeslicing the data (yellow index changes in an abrupt, stepwise fashion. Whereas the serum curves) and renormalizing to the first curve in the subset. This is binding event evolves over a timescale of about two hours, the plotted in Fig.7 (bottom), and clearly shows spectral dispersion response to an abrupt environmental change takes mere seconds. with major weight around λ = 520 nm, the surface plasmon This is a simple demonstration of how timescale provides insights resonance peak of our gold nanoparticles. to the physiochemical nature of the underlying process. This approach to monitoring AuNP deposition not only allows one to control coverage,9 but also provides information on depo- The dataset’s spectral dimension can be used to identify phys- sition quality. Depending on various factors, gold nanoparticles iochemical phenomena as well. Absorbance plots corresponding may tend to aggregate into clusters, rather than form a monolayer. to BSA binding and glycerin addition are shown at the bottom When this occurs, red-shifted absorbance profiles appear in the of Fig.6. These profiles tend to depend on the size of the timeslicing analysis. Because simple plots like Fig.7 contain so biomolecules in the interaction. The spectral profile of BSA-AuNP much quantitative and qualitative information about nanoparticle binding, for example, is representative of other large proteins coverage, we have begun an effort to calibrate these curves to binding to gold. Similarly, index changes from saline, buffers and measured particle coverage using scanning electron microscopy other viscous solutions are consistent with the dispersion profile (SEM) (Fig.8). of glycerin. Small biomolecules such as amino acids have yet another spectral signature (not shown), as well as a timestamp that The benefits of such a calibration are two-fold. First, it turns splits the difference between protein binding and refractive index out that the number of AuNP’s on the fiber is a crucial parameter changes. This surprising relationship between the physiochemistry for predicting relevant biochemical quantities such as the binding of an interaction and its temporal and spectral profiles aids in the affinity of two ligands. Secondly, it is important to find several interpretation of convoluted results in complex experiments. coverages that optimize sensor performance. There are situations when maximum dynamic range at low particle coverage is desir- Consistent binding profiles require similar nanoparticle cov- able, for example in measuring non-equilibrium binding kinetics. erage between fibers. If the coating process is lithographic, it is Because of mass transport limitations, estimations of binding easier to ensure consistent coverage; however, many plasmonic affinity tend to be in error for densely populated monolayers. In biosensors are created through a wet crosslinking process similar addition, there are coverages that impair dynamic range. Thus, it to the APTMS deposition described here. Wet methods are more is important to optimize and characterize sensor performance at susceptible to extraneous factors; yet remarkably, we can use various particle coverages. Although simulations can estimate this the binding profile as a tool to monitor and control nanoparticle relationship, it should also be confirmed experimentally. deposition in realtime. Since most non-trivial biosensing experiments contain mul- Fig.7 (top) is an absorbance plot of the deposition of gold tiple phases (binding, unbinding, purging of the sensor surface, nanoparticles onto the endface of an optical fiber (dataset begins etc.), the subsequent data analysis requires the ability to rescale, at y = 1). As the nanoparticles accumulate, they initially absorb resample and perform other manual curations on-the-fly. Pandas signal, resulting in a drop in light reflectance; however, eventually provides a great tool set for manipulating series data in such a the curves invert and climb rapidly. This seems to suggest the ex- istence of a second process; however, simulations have confirmed 9. The user merely removes the fiber from AuNP when the absorbance that this inflection is merely a consequence of the nanoparticle reaches a preset value. 10 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

proving invaluable for interacting with and visualizing results. Unexpected physiochemical identifiers appear consistently within experimental results. These binding profiles not only provide new qualitative insights, but with the help of SEM imaging, may soon open new avenues towards the difficult task of quantifying biosen- sor output. Python has proven invaluable to our research, and just as it has suffused the domains of astronomy and finance, seems primed to emerge as the de-facto design platform in biosensing and its related fields.

Fig. 8: SEM images of fiber endfaces with 25% (left) and 5% (right) AuNP surface coverage at 30,000 X magnification. Acknowledgements I would like to thank my advisor, Dr. Mark Reeves, for his devoted guidance and support. I owe a great debt to Annie Matsko for her manner. For example, slicing a set of ordered series data by rows dedication in the lab and assistance in drafting this document. In (spectral dimension) and columns (temporal dimension) is quite regard to the programming community, I must foremost thank En- simple: thought and the other sponsors of the SciPy2012 conference. Their ## Read series data from tab-delimited generous student sponsorship program made it possible for me to ## file into a pandas DataFrame object attend for that I am gracious. Although indebted to countless mem- from pandas import read_csv bers of the Python community, I must explicitly thank Jonathan data=read_csv(' to file', sep='\t') March, Robert Kern and Stéfan van der Walt for their patience in ## Select data by column index helping me through various programming quandries. Thank you data[['time1','time2']] De-Hao Tsai and Vincent Hackley at the Material Measurement ## Slice data by row label (wavelength) Laboratory at NIST for your helpful discussions and allowing data.ix[500.0:750.0] our group to use your -potential instrumentation. Finally, I must thank the Research Fellowship Program, By interfacing to Chaco, and to the Pandas plotting interface, the Luther Rice Collaborative Research Fellowship program, and one can slice, resample and visualize interesting regions in the the George Washington University for the Knox fellowship for dataspace quite easily. Through these packages, it is possible for generous financial support. non-computer scientists to not just visualize, but to dynamically explore the dataset. The prior examples of BSA and glycerin demonstrated just how much information could be extracted from REFERENCES the data using only simple, interactive methods. [R1] Anuj K. Sharma B.D. Gupta. Fiber Optic Sensor Based on Surface Plas- our interactive approach is in contrast to popular all-in- mon Resonance with Nanoparticle Films. Photonics and Nanostructures one analysis methods. In Two-Dimensional Correlation Analysis - Fundamentals and Applications, 3:30,37, 2005. [R2] Ching-Te Huang Chun-Ping Jen Tzu-Chien Chao. A Novel Design of (2DCA), [R9] for example, cross correlations of the entire dataset Grooved Fibers for Fiber-optic Localized Plasmon Resonance Biosen- are consolidated into two contour plots. These plots tend to be sors., Sensors, 9:15, August 2009. difficult to interpret,10 and become intractable for multi-staged [R3] Wen-Chi Tsai Pi-Ju Rini Pai. Surface Plasmon Resonance-based Im- events. Additionally, under certain experimental conditions they munosensor with Oriented Immobilized Antibody Fragments on a Mixed Self-Assembled Monolayer for the Determination of Staphylococcal cannot be interpreted at all. It turns out that much of the same Enterotoxin B., MICROCHIMICA ACTA, 166(1-2):115–122, February information provided by 2DCA can be ascertained using the 2009. simple, dynamic analysis methods presented here. This is not to [R4] Mitsui Handa Kajikawa. Optical Fiber Affinity Biosensor Based on Localized Surface Plasmon Resonance., Applied Letters, suggest that techniques like 2DCA are disadvantageous, merely 85(18):320–340, November 2004. that some of the results may be obtained more simply. Perhaps [R5] Etchegoin Ru Meyer. An Analytic Model for the Optical Properties of in the future, transparent, interactive approaches will constitute Gold. The Journal of Chemical Physics, 125, 164705, 2006. the core of the spectral data analysis pipeline with sophisticated [R6] Christy, Johnson. Optical Constants of Noble Metals. Physics Review, 6 B:4370-4379, 1972. techniques like 2DCA adopting a complimentary role. [R7] Bohren Huffman. Absorption and Scattering of Light by Small Particles, Wiley Publishing, 1983. [R8] Orfanidis, Sophocles. Electromagnetic Waves and Antennas. 2008 Conclusions [R9] Yukihiro Ozaki Isao Noda. Two-Dimensional Correlation Spectroscopy. A benchtop nanobiosensor has been developed for the realtime Wiley, 2004. detection of biomolecular interactions. It, as well as other emer- gent biosensing technologies, is hindered by a lack of dedicated open-source software. In an effort to remedy this, prototypical simulation and analysis tools have been developed to assist with our plasmonic sensor and certainly have the potential for wider applicability. Scientific Python libraries, especially Chaco and Pandas, reside at the core of our data analysis toolkit and are

10. 2DCA decomposes series data into orthogonal synchronous and asyn- chronous components. By applying the so-called Noda’s rules, one can then analyze the resultant contour maps and infer information about events unfolding in the system. PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012) 11

A Tale of Four Libraries

‡ ‡ Alejandro Weinstein ∗, Michael Wakin

!

Abstract—This work describes the use some scientific Python tools to solve One of the contributions of our research is the idea of rep- information gathering problems using Reinforcement Learning. In particular, resenting the items in the datasets as vectors belonging to a we focus on the problem of designing an agent able to learn how to gather linear space. To this end, we build a Latent Semantic Analysis information in linked datasets. We use four different libraries—RL-Glue, Gensim, (LSA) [Dee90] model to project documents onto a vector space. NetworkX, and scikit-learn—during different stages of our research. We show This allows us, in addition to being able to compute similarities that, by using NumPy arrays as the default vector/matrix format, it is possible to between documents, to leverage a variety of RL techniques that integrate these libraries with minimal effort. require a vector representation. We use the Gensim library to build Index Terms—reinforcement learning, latent semantic analysis, machine learn- the LSA model. This library provides all the machinery to build, ing among other options, the LSA model. One place where Gensim shines is in its capability to handle big data sets, like the entirety of Wikipedia, that do not fit in memory. We also combine the vector Introduction representation of the items as a property of the NetworkX nodes. In addition to bringing efficient array computing and standard Finally, we also use the manifold learning capabilities of mathematical tools to Python, the NumPy/SciPy libraries provide sckit-learn, like the ISOMAP algorithm [Ten00], to perform some an ecosystem where multiple libraries can coexist and interact. exploratory data analysis. By reducing the dimensionality of the This work describes a success story where we integrate several LSA vectors obtained using Gensim from 400 to 3, we are able libraries, developed by different groups, to solve some of our to visualize the relative position of the vectors together with their research problems. connections. Our research focuses on using Reinforcement Learning (RL) to Source code to reproduce the results shown in this work is gather information in domains described by an underlying linked available at https://github.com/aweinstein/a_tale. dataset. We are interested in problems such as the following: given a Wikipedia article as a seed, find other articles that are interesting Reinforcement Learning relative to the starting point. Of particular interest is to find articles The RL paradigm [Sut98] considers an agent that interacts with that are more than one-click away from the seed, since these an environment described by a Markov Decision Process (MDP). articles are in general harder to find by a human. Formally, an MDP is defined by a state space X , an action space In addition to the staples of scientific Python computing A , a transition function P, and a reward function r. At NumPy, SciPy, Matplotlib, and IPython, we use the libraries RL- a given sample time t = 0,1,... the agent is at state x X , and Glue [Tan09], NetworkX [Hag08], Gensim [Reh10], and scikit- t ∈ it chooses action a A . Given the current state x and selected learn [Ped11]. t ∈ action a, the probability that the next state is x is determined by Reinforcement Learning considers the interaction between 0 P(x,a,x ). After reaching the next state x , the agent observes an a given environment and an agent. The objective is to design 0 0 immediate reward r(x ). Figure1 depicts the agent-environment an agent able to learn a policy that allows it to maximize its 0 interaction. In an RL problem, the objective is to find a function total expected reward. We use the RL-Glue library for our RL π : X A , called the policy, that maximizes the total expected experiments. This library provides the infrastructure to connect an 7→ reward environment and an agent, each one described by an independent ∞ t Python program. R = E ∑ γ r(xt ) , We represent the linked datasets we work with as graphs. "t=1 # For this we use NetworkX, which provides data structures to where γ (0,1) is a given discount factor. Note that typically ∈ efficiently represent graphs, together with implementations of the agent does not know the functions P and r, and it must many classic graph algorithms. We use NetworkX graphs to find the optimal policy by interacting with the environment. See describe the environments implemented using RL-Glue. We also Szepesvári [Sze10] for a detailed review of the theory of MDPs use these graphs to create, analyze and visualize graphs built from and the different algorithms used in RL. unstructured data. We implement the RL algorithms using the RL-Glue library Corresponding author: [email protected] [Tan09]. The library consists of the RL-Glue Core program and * 1 ‡ the EECS department of the Colorado School of Mines a set of codecs for different languages to communicate with the library. To run an instance of a RL problem one needs to Copyright © 2012 Alejandro Weinstein et al. This is an open-access article write three different programs: the environment, the agent, and distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, the experiment. The environment and the agent programs match provided the original author and source are credited. exactly the corresponding elements of the RL framework, while 12 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012) x a In the context of a collection of documents, or corpus, word r Agent Environment frequency is not enough to asses the importance of a term. For this reason, we introduce the quantity document frequency dft , defined to be the number of documents in the collection that contain term t. We can now define the inverse document frequency (idf) as N Fig. 1: The agent-environment interaction. The agent observes the idft = log , current state x and reward r; then it executes action π(x) = a. dft where N is the number of documents in the corpus. The idf is a measure of how unusual a term is. We define the tf-idf weight of the experiment orchestrates the interaction between these two. The term t in document d as following code snippets show the main methods that these three tf-idf = tf idf . programs must implement: t,d t,d × t ################# environment.py ################# This quantity is a good indicator of the discriminating power of a class env(Environment): term inside a given document. For each document in the corpus def env_start(self): # Set the current state we compute a vector of length M, where M is the total number of terms in the corpus. Each entry of this vector is the tf-idf weight return current_state for each term (if a term does not exist in the document, the weight is set to 0). We stack all the vectors to build the M N term- def env_step(self, action): × # Set the new state according to document matrix C. # the current state and given action. Note that since typically a document contains only a small fraction of the total number of terms in the corpus, the columns of return reward the term-document matrix are sparse. The method known as Latent #################### agent.py #################### Semantic Analysis (LSA) [Dee90] constructs a low-rank approx- class agent(Agent): imation Ck of rank at most k of C. The value of k, also known def agent_start(self, state): as the latent dimension, is a design parameter typically chosen # First step of an experiment to be in the low hundreds. This low-rank representation induces return action a projection onto a k-dimensional space. The similarity between the vector representation of the documents is now computed after def agent_step(self, reward, obs): projecting the vectors onto this subspace. One advantage of LSA # Execute a step of the RL algorithm is that it deals with the problems of synonymy, where different return action words have the same meaning, and polysemy, where one word has different meanings. ################# experiment.py ################## RLGlue.init() Using the Singular Value Decomposition (SVD) of the term- T RLGlue.RL_start() document matrix C = UΣV , the k-rank approximation of C is RLGlue.RL_episode(100) # Run an episode given by C = U V T , Note that RL-Glue is only a thin layer among these programs, k kΣk k allowing us to use any construction inside them. In particular, as where Uk, Σk, and Vk are the matrices formed by the k first described in the following sections, we use a NetworkX graph to columns of U, Σ, and V, respectively. The tf-idf representation model the environment. of a document q is projected onto the k-dimensional subspace as 1 T qk = Σk− Uk q. Computing the Similarity between Documents Note that this projection transforms a sparse vector of length M To be able to gather information, we need to be able to quantify into a dense vector of length k. how relevant an item in the dataset is. When we work with In this work we use the Gensim library [Reh10] to build the documents, we use the similarity between a given document and vector space model. To test the library we downloaded the top 100 the seed to this end. Among the several ways of computing most popular books from project Gutenberg.2 After constructing similarities between documents, we choose the Vector Space the LSA model with 200 latent dimensions, we computed the Model [Man08]. Under this setup, each document is represented similarity between Moby Dick, which is in the corpus used to by a vector. The similarity between two documents is estimated build the model, and 6 other documents (see the results in Table by the cosine similarity of the document vector representations. 1). The first document is an excerpt from Moby Dick, 393 words The first step in representing a piece of text as a vector is long. The second one is an excerpt from the Wikipedia Moby to build a bag of words model, where we count the occurrences Dick article. The third one is an excerpt, 185 words long, of The of each term in the document. These word frequencies become Call of the Wild. The remaining two documents are excerpts from the vector entries, and we denote the term frequency of term t Wikipedia articles not related to Moby Dick. The similarity values in document d by tft,d. Although this model ignores information we obtain validate the model, since we can see high values (above related to the order of the words, it is still powerful enough to 0.8) for the documents related to Moby Dick, and significantly produce meaningful results. smaller values for the remaining ones.

1. Currently there are codecs for Python, C/C++, Java, Lisp, MATLAB, and 2. As per the April 20, 2011 list, http://www.gutenberg.org/browse/scores/ Go. top. A TALE OF FOUR LIBRARIES 13

Arctic Boot Insect Culture Text description LSA similarity Cuba Flax Shoe Fabric Sandal (footwear) Trenchcoat Silk Deer Weapon Excerpt from Moby Dick 0.87 Tank Body piercing Cane Bullet Shirt Air gun Wool Arrow Leg Excerpt from Wikipedia Moby Dick article 0.83 Makeup Scissors Glove Hunting Military Artificial fibres Permit Cylinder Media Needle Excerpt from The Call of the Wild 0.48 Helicopter Hemp Human Torso Cotton Nylon Nobility Rifle Soldier Purse Excerpt from Wikipedia Jewish Calendar 0.40 Arm Submachine gun Water Polyester Clothing Sewing machine Army Eyeglasses article Gun Country Cloth Machine gun Leather Coat Pants Animal Pistol Scarification Airplane Excerpt from Wikipedia article 0.33 Food Fashion (clothing) Warship Umbrella Cartridge Acrylic Australia Tent Vehicle Sock Spear Shoot Knife Jewelry Tattoo Police Paris Fur Sniper Costume Defense Revolver Body modification Ramie Skirt Thread Olympic Games Machine Wikimedia Commons Perfume Shotgun Hat Jeans TABLE 1: Similarity between Moby Dick and other documents. Accessories Fig. 2: Graph for the "Army" article in the simple Wikipedia with 97 nodes and 99 edges. The seed article is in light blue. The size of Next, we build the LSA model for Wikipedia that allows us to the nodes (except for the seed node) is proportional to the similarity. compute the similarity between Wikipedia articles. Although this In red are all the nodes with similarity greater than 0.5. We found is a lengthy process that takes more than 20 hours, once the model two articles ("Defense" and "Weapon") similar to the seed three links is built, a similarity computation is very fast (on the order of 10 ahead. milliseconds). Results in the next section make use of this model. Note that although in principle it is simple to compute the

LSA model of a given corpus, the size of the datasets we are 21 sim= get_similarity(page_text) interested in make doing this a significant challenge. The two 22 self.G.add_edge(source, main difficulties are that in general (i) we cannot hold the vector 23 dest, 24 weight=sim, representation of the corpus in RAM memory, and (ii) we need to 25 d=dist) compute the SVD of a matrix whose size is beyond the limits of 26 new_links= [(l, dist, page.name) what standard solvers can handle. Here Gensim does stellar work 27 for l in page.links()] 28 links= new_links+ links by being able to handle both these challenges. 29 n+=1 30 31 return G Representing the State Space as a Graph We are interested in the problem of gathering information in We now show the result of running the code above for two domains described by linked datasets. It is natural to describe such different setups. In the first instance we crawl the Simple English 4 domains by graphs. We use the NetworkX library [Hag08] to build Wikipedia using "Army" as the seed article. We set the limit on 5 the graphs we work with. NetworkX provides data structures to the number of articles to visit to 100. The result is depicted in Fig. represent different kinds of graphs (undirected, weighted, directed, 2, where the node corresponding to the seed article is in light blue etc.), together with implementations of many graph algorithms. and the remaining nodes have a size proportional to the similarity NetworkX allows one to use any hashable Python object as a node with respect to the seed. Red nodes are the ones with similarity identifier. Also, any Python object can be used as a node, edge, or bigger than 0.5. We observe two nodes, "Defense" and "Weapon", graph attribute. We exploit this capability by using the LSA vector with similarities 0.7 and 0.53 respectively, that are three links away representation of a Wikipedia article, which is a NumPy array, as from the seed. a node attribute. In the second instance we crawl Wikipedia using the article The following code snippet shows a function3 used to build a "James Gleick"6 as seed. We set the limit on the number of directed graph where nodes represent Wikipedia articles, and the articles to visit to 2000. We show the result in Fig.3, where, edges represent links between articles. Note that we compute the as in the previous example, the node corresponding to the seed LSA representation of the article (line 11), and that this vector is in light blue and the remaining nodes have a size proportional is used as a node attribute (line 13). The function obtains up to to the similarity with respect to the seed. The eleven red nodes n_max articles by breadth-first crawling the Wikipedia, starting are the ones with similarity greater than 0.7. Of these, 9 are more from the article defined by page. than one link away from the seed. We see that the article with the biggest similarity, with a value of 0.8, is about "Robert Wright 1 def crawl(page, n_max): 2 G= nx.DiGraph() (journalist)", and it is two links away from the seed (passing 3 n=0 through the "Slate magazine" article). Robert Wright writes books 4 links= [(page,-1, None)] about sciences, history and religion. It is very reasonable to 5 while n< n_max: 6 link= links.pop() consider him an author similar to James Gleick. 7 page= link[0] Another place where graphs can play an important role in the 8 dist= link[1]+1 RL problem is in finding basis functions to approximate the value- 9 page_text= page.edit().encode('utf-8') 10 # LSI representation of page_text 11 v_lsi= get_lsi(page_text) 3. The parameter page is a mwclient page object. See http://sourceforge. 12 # Add node to the graph net/apps/mediawiki/mwclient/. 13 G.add_node(page.name, v=v_lsi) 4. The Simple English Wikipedia (http://simple.wikipedia.org) has articles 14 if link[2]: written in simple English and has a much smaller number of articles than the 15 source= link[2] standard Wikipedia. We use it because of its simplicity. 16 dest= page.name 17 if G.has_edge(source, dest): 5. To generate this figure, we save the NetworkX graph in GEXF format, 18 # Link already exists and create the diagram using Gephi (http://gephi.org/). 19 continue 6. James Gleick is "an American author, journalist, and biographer, whose 20 else: books explore the cultural ramifications of science and technology". 14 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

Palgrave Macmillan Hewlett Bay Park, Wayne County, New YorkYeshiva University MTA Regional Bus Operations Derby, Connecticut Terraced house Kinshasa Northeast Corridor Dover, New Jersey Commuting Snow Castle Clinton National Monument All in the Family Riverhead (CDP), New York 1 World Trade Center Public Broadcasting ServiceCayuga County, New York Orleans County, New York Rio de Janeiro Taxicabs of George M. Cohan Traffic congestion Hamilton County, New York Empire State Building AMNRL Court of International Trade Westport, Connecticut Congress of the Confederation Wall Street Cooper Union Newark Liberty International Airport Jacksonville Axemen New Netherland United Airlines Seagram Building New Angoulme Oyster Bay Cove, New York Greater Hartford Mount Vernon, New York Indianapolis metropolitan area Queens County, New York Utica, New York Precipitation (meteorology) Hudson Valley

1972 Summer Paralympics Tourism in New York City George II of Great Britain Upstate New York Torrington, Connecticut John Keese Oneida County, New York Huntington (CDP), New York U.S. state BirminghamHoover Metropolitan Area IndianapolisCleveland, Ohio Cairo New York School Sussex County, New Jersey Outer barrier San Francisco Gay rights movement Montreal New York Giants (MLB) Peter Minuit Massapequa, New York List of places known as the capital of the world Columbia County, New YorkMuttontown, New York Niagara Falls, New York Freeway 1910 United States Census Archie Bunker Indian-American Gateway National Recreation Area National Institutes of HealthSons of Liberty Baltimore Metropolitan Area Islip (town), New York Brightwaters, New York Rhotic and non-rhotic accents Thomas Jefferson University World economy Louisville, Kentucky Thomas Menino 1850 United States Census Lloyd Harbor, New York Juilliard School New York City, New York Architecture of New York City 1996 Summer Paralympics New York in film American International Group Dodgers New Orleans People's Republic of China Lower Daily News (New York) Matinecock, New York Charlotte, North Carolina National Invitation Tournament Holland Purchase Spain Mill Neck, New York Bloods Basingstoke Munsey Park, New York Old Brookville, New York 1870 United States Census Lawrence, Nassau County, New York New York, New York (disambiguation) Peekskill, New York Metro Detroit Upper West Side The New School Seneca County, New York Borough (New York City) Wantagh, New York Metropolitan Museum of Art Stickball Staten Island Greenbelt Colombia College town Arnhem Bronx Zoo Huntington, New York Israel New York Knicks Kant region Ridge-and-Valley Appalachians WNET List of urban areas by population Jersey City Dallas, Texas North Jersey Rome, New York East Rand The Record (Bergen County) Upper Manhattan Head of the Harbor, New York Southbury, Connecticut Niall of the Nine Hostages New Jersey Falafel Great Neck Estates, New York Giovanni da Verrazzano The Statesman's Yearbook Southtowns Staten Island New York Amsterdam NewsPrudential Center Jazz at Lincoln Center Oakland, California Denver, ColoradoCouncillor New York metropolitan area Lynbrook, New York hydride bomb Washington Square Park, New York Democratic Party (United States) Delaware County, New York September 11, 2001 attacks Comedy Central 1810 United States Census Freedom of the press Colson Whitehead List of tallest buildings in New York City Morgan Stanley Silicon Alley New York Supreme Court American Express Memphis, Tennessee San Francisco GiantsMaine Natural logarithm Hoboken, New Jersey Mercer County, New Jersey Culture of New York City Freeport, New York Amityville, New YorkLevittown, New York Los Alamos National Laboratory North Haven, Connecticut Kingston, New York Geography of New York John F. Kennedy International Airport Asbury Park, New Jersey Saskia Sassen Saratoga County, New York Trumbull, Connecticut Central Park Zoo New Meadowlands Stadium Connecticut Jewish American literature Massapequa Park, New York Manhattan Chinatown Vivian Beaumont TheatreNiagara Frontier Tramway Amtrak Wyoming County, New York Battery Weed Percy Williams Bridgman Tioga County, New York Second Anglo-Dutch War Lower East Side, ManhattanGrand Central Terminal Abstract expressionism Rensselaer County, New York Johannes NBC Studios Richard E. Taylor New York State Unified Court System Karachi Seattle, Washington Dominican Republic Hybrid electric vehicle Evacuation Day (New York) Saint Lawrence Seaway DallasFort Worth metroplex New York City Department of City Planning New York Botanical Garden Independent (politician) New York-style pizza Feynman Long Division Puzzles Statue of Liberty National MonumentLaw enforcement in New York City Psychometrics York, Pennsylvania Hip hop music Otsego County, New York General Grant National Memorial National Football League Johannes Diderik van der Waals Gun politics Alexander Hamilton Great Fire of New York Roslyn Estates, New York List of municipalities on Long Island Hungary Ulster County, New York Federal Writers' Project Subway Series Co-op City, BronxBridgeport, Connecticut New Haven County, Connecticut Babylon (village), New York John L. Hall West New York, New Jersey Demographics of New York City Tuberculosis Seal of New York City New York Metropolitan Area Corning (city), New York Max 1830 United States Census Kuala Lumpur New York Supreme Court, Appellate Division Tammany Hall Thomas A. Welton Nissequogue, New York North Hills, New York Intelligence quotient Brazil History of Long Island Fixing Broken Windows Monroe County, New York New York State Legislature Conference House Park Island Park, New York Valley Stream, New York Drainage basin Brookings Institute Brookfield, Connecticut Nathaniel Parker Willis Greater Orlando Art Deco Middlesex County, New Jersey Los Angeles metropolitan area 1840 United States Census Tri-State Region Santiago, ChileNew York Jets Town twinning John CockcroftMartin Ryle East Orange, New Jersey National Hockey League Glen Cove, New York Jacob Riis Park Theodor W. Hnsch Latin America Binghamton, New York Dutch guilder Central Park Conservatory Garden Cellular automata Weak decay 1976 Summer Paralympics Central Jersey Great Neck, New York List of U.S. cities with most pedestrian commuters East Rutherford, New Jersey New Canaan, Connecticut Nazi GermanyRobert Woodrow Wilson Dsseldorf Hamden, Connecticut Glens Falls, New York East Rockaway, New York Cheshire, Connecticut New York Bay Atomic bombings of Hiroshima and Nagasaki Uniondale, New York Broome County, New York Hewlett Harbor, New York El Paso, Texas List of corporations based in New York City The Pleasure of Finding Things Out Van Cortlandt Park Hudson River Kurt Gdel Morris County, New Jersey Yorkshire YangMills NASDAQ New York City Department of Education Hunterdon County, New Jersey Harrison, New York Central Park SummerStage Economy of Long Island CincinnatiGothic Revival architecture Passaic County, New Jersey Messenger Lectures Cedarhurst, New York Washington Irving Litchfield Hills New York Post Continental Airlines Kensington, New York Las Vegas, Nevada Federal Hall National Memorial Richmond County, New York John Peter Zenger Disco JetBlue Independent film Fresh water East Hills, New York Greater Danbury Sea Cliff, New York Racial and ethnic history of New York City 1988 Summer Paralympics Murray Gell-Mann Dering Harbor, New York Yonkers, New YorkHarlem Success Academy Governors Island National Monument Bellerose Terrace, New York Millrose Games Thomaston, New York Public Prep Bibliography of New York Don QuixoteIntegral calculus American Mafia Education in New York Lazio Upper Brookville, New York Metro systems by annual passenger rides Open Library Monroe, Connecticut Italian people Shinichiro Tomonaga Woolworth Building Newtown, Connecticut History of New York Neighborhoods of New York City Madison Square Garden Nutation Delhi El Diario La Prensa Transatlantic telephone cable Nashville Metropolitan Statistical Area Nicolaas BloembergenStckelbergFeynman interpretationToshihide Maskawa J. J. Thomson Run (island) Arlington, Texas Queens Metonymy Homicide NYCTV Newburgh (city), New York Ocean County, New Jersey Las Vegas metropolitan area New York (Anthony Burgess) Kansas City Metropolitan Area Orange, Connecticut Albany County, New YorkHewlett Neck, New York Hybrid vehicle Atlantic Beach, New York Virginia Beach, Virginia Clean diesel Space Shuttle Challenger disaster Encyclopdia Britannica Eleventh Edition Madison Avenue (Manhattan) Nanotechnology Humid subtropical climate United States District Court for the Eastern District of New York Madrid Metro 1964 New York World's Fair RobertErnest Oppenheimer Walton Gustaf Daln Journal of the American Planning Association Manhatta Huntington Bay, New York Geneticist Denver-Aurora-Broomfield,Athens CO Metropolitan Statistical Area Fort Lee, New Jersey World Trade Center Memorial Party platform LaGuardia Airport Oklahoma City metropolitan area J.P. Morgan Chase& Co. DonaldAfrican A. Burial Glaser Ground National Monument Hempstead (village), New York John Kerry List of regions of the United States Transportation in New York VerizonElections in New Communications York American English Essex County, New Jersey Western New York Milwaukee metropolitan area Cuisine of New York City Metric ton Waldenstrm's macroglobulinemia Kings County, New York Asian (U.S. Census) Minor league baseball Merrick, New York New York's congressional districts South Shore (Long Island) Phoenix, Arizona Economy of New York Bagel Kansas City, Missouri New York City Hall Tom Wolfe Pelham Bay Park Prohibition in the United States Riverbank State Park Mesa Lewis County, New York Sony Music Entertainment Taxation in Flushingthe United MeadowsCorona States Park New Brunswick, New Jersey Elmira, New York Bachelor's degree Liquid helium Ontario County, New York United States Census, 2000 Erie County, New York Louisville Jefferson County, KYIN Metropolitan Statistical Area C. V. Raman Mineola, New YorkParis George W. Bush Bangkok Hampton Roads Northeastern United States Replica plating National Historic Landmark Mayors Against Illegal Guns Coalition Guyana Oyster Bay (town), New York Fundamental force County (US) Manhattan Rye (city), New York Constitution of the United States Irish Americans in New York City Food and water in New York City Atlanta, Georgia Supreme Court of the United States Marie CurieF. L. Vernon, Jr. 24/7 Baldwin, Nassau County, New York Superfluid SynesthesiaWomen's National Basketball Association Albert Abraham Michelson 2012 Summer Paralympics Peter GrnbergParks and recreation in New York City New York Botanical Gardens Thomson Reuters Corporation Freestyle music Black Spades Weston, Connecticut John Robert Schrieffer Staten IslandCombined Yankees statistical area Geography of New YorkTelephone City numbering plan Lists of New York City topics Genghis Blues Baghdad Port Jefferson, New York 2004 Summer Paralympics Feynman point Red Bull Arena (Harrison) New York City Ballet Politics of New York Unisphere Transportation on Long Island City University of New York Hess Corporation List of cities in New York Lawrence M. Krauss Massachusetts Institute of Technology Schenectady County, New York Union County, New Jersey () Water treatment Rugby league City (New York) Taipei Jacksonville, FloridaMemorial Sloan-Kettering Cancer Center RappingTownhouse List of neighborhoods HarlemBethpage, New York Moscow Native Americans in the United States IBM Music industry Bohai Economic Rim Kyoto Naugatuck, Connecticut Viacom California Institute of Technology Raymond Davis, Jr. Gotham Gazette Lincoln Center for the Performing Arts Scissor SistersBen Roy Mottelson South Floral Park, New York Tulsa, Oklahoma Barnard College Pierre-Gilles de GennesLeon M. Lederman ModelMass (abstract) transit in New York City Wilton, Connecticut design Syracuse, New York Gentrification Gross metropolitan product Buffalo, New YorkDistrito Nacional Great Fire of New York (1776)North America College of Mount Saint Vincent Manorhaven, New York Indianapolis, Indiana New York Philharmonic Atomic bomb Big Apple Barcelona Prospect Park (Brooklyn) Sin-Itiro Tomonaga Plandome Manor,Mumbai New York OsakaResearch Triangle Madrid Tehran Washington Heights, Manhattan Punched card Weill Cornell Medical College of Cornell UniversitySalt marsh Martin Lewis PerlThousandGovernor Islands of New York Malverne, New York Electro-weak Westhampton Beach, New York Ridgefield, Connecticut Jazz World Series Geographic coordinate system Plymouth, Connecticut 1964 Summer Paralympics Hybrid taxi Norwalk, Connecticut World War II Smithtown, New York Darien, Connecticut Brooklyn Bridge New York City Fire Department They Might Be Giants Central New York Language delay Broadway theatre General Slocum Sacramento metropolitan area GeneticsRobert B. Leighton (physicist) Paterson, New Jersey Skyscraper NBC Washington Metro ChennaiInternational style (architecture) United States Senate Rockaway Peninsula (physics) Connection Machine Le Szilrd Non-Hispanic whiteGuangzhou Mohawk Valley Park Avenue (Manhattan) Negative probability Goldman Sachs Group Wichita, Kansas Greater Waterbury J. Hans D. JensenCentralNew Park York City Police Department Plandome Heights, New York Northern New Jersey Andre GeimErnest Orlando Lawrence Award HellmannFeynman theorem Geography of New York Harbor Hicksville, New York 1968 Summer ParalympicsMTVKlaus von Klitzing Onondaga County, New York Hong Kong Biotechnology AIA Guide to New York City Floral Park, New York New Jersey Nets Soccer USA Today M-theory Urban heat island Dutch Republic Universal Music Group Stonewall Rebellion Cornell University Farmingdale, New York Ocean Beach, New York Queens, New York Saddle Rock, New York New York City (disambiguation) Entertainment industry Church (New York City) Bellport, New York Tuva or Bust! New York City OperaRichmondPetersburg Atlanta metropolitanProfessor area World Almanac Provo, Utah Central Hungary Erwin Schrdinger Nassau County, New York World Trade Center site Trinity (nuclear test) Manhattan Municipal Building Advertising Age Charles Hard Townes FeynmanKac formula Orange County, New York Madison, Connecticut Demonym West Hampton Dunes, New York Belvedere Castle Hubert UnitedHumphrey States Postal Service Louis NelE. C. George Sudarshan Manuel Sandoval Vallarta African American Crips Russell Alan HulseChicago metropolitan area Egypt CompStat Environmental issues in New York City John G. Kemeny Newburgh, New York Ruhr Delaware Valley Variational perturbation theory Foresight Nanotech Institute FeynmanFlower Hill,Prize New York List of Long Island recreational facilities American Jews Community college Pittsburgh metropolitan area Jamaica Bay Wildlife Refuge Aaron Novick Patchogue, New York Southampton (village), New York Ralph LeightonJersey City, New Jersey 1916 Zoning Resolution Kobe Jefferson County, New York Financial centre WarJohn Bardeen Plainfield, New JerseyFlorence National Basketball Association William Bellerose, New York Carl Feynman De jure Middletown, Orange County, New York Putnam County, New York Rahway, New Jersey Trenton, New Jersey Chicago Schrdinger equationUnited Nations Headquarters Hyderabad, India Theodore Roosevelt Birthplace National Historic Site Asharoken, New York Manhattanhenge Government of New York Baja California PubMed Central Clifford Shull Diffraction United Nations Transvaal Province United States Atomic Energy Commission New York City Pseudoscience Princeton, New Jersey Phoenix metropolitan area Five Boroughs Los AngelesGreenhouse gas British EmpireQED (book)Statue of Liberty Delta Air Lines Mike Wallace (historian) New York Public Library Holland Tunnel Herman Melville Maria Goeppert-Mayer Gerard't Hooft List of people from New York City Brownstone North Fork, Suffolk County, New York Positron Hidden track Greater Boston Indie rock , Berkeley Oak Ridge National Laboratory Old Field, NewPittsburgh York Williston Park, New York 1890 United States Census Lord Howe Charles Thomson Rees Wilson Tokyo 1950 United States Census Westchester County, New York George Washington Bridge Karl Alexander Mller Yokohama Mathematical operatorNew York State Constitution SUNY Downstate Medical Center United States Court of Appeals for the Second Circuit Criticality accidentNeodesha, Kansas Immigration to the United States Nagoya Chrysler Building List of counties in New York West Islip, New YorkWilliam Cullen Bryant LSD Spacetime New Hyde Park, New York Hollywood, Los Angeles, California Greater Cleveland Subrahmanyan Chandrasekhar Rucker Park Giovanni Rossi Lomanitz Samba school List of museums and cultural institutions in New York City Tompkins County, New York The Bravery Columbia University American Revolution Seattle metropolitan area Organized crime White American Greater Austin Ric Burns Table of United States Metropolitan Statistical Areas NucleonPlandome, New York Metropolitan municipality Digital object identifier Georges CharpakWilhelm Wien Skyscrapers International Standard Book Number Los Angeles, California PubMed Identifier UTC-5 Tom Newman (scientist) Niagara County, New York New York 1820 United States Census Robert Barro Institute for Advanced Study Raleigh, North Carolina Russia Dallas Union City, New Jersey Sustainable design Asia Chicago"L" Quantum Robert Lane Greene Land reclamation White Plains, New York Puerto Rican people Port Authority Bus Terminal Beijing Independence Party of New York Arista National Honor Society Frontline (magazine) 1984 Summer Paralympics South Fork, Suffolk County, New York New York's Village Halloween Parade Southern Tier Christopher Ma Willard H. Wells Charles douard Guillaume Surely You're Joking, Mr. Feynman! Wikitravel Jackson Heights, Queens Bergen County, New Jersey Metropolitan area The Bronx Transportation in New York City Hannes Alfvn Watertown, Connecticut Stonewall Inn The Wall Street Journal Poquott, New York Ellis Island BabylonLong (town), Beach,New NewYork York New City York Public Advocate 1960 Summer Paralympics Centre Island, New York Symphony of Science Ithaca, New York Buffalo Niagara Falls metropolitan area Garden city movement New York University Garden City, New York New York Life Insurance Parkway New Jersey Devils Stoke Mandeville Clay Pit Ponds State Park Edward Harrigan Elementary particle Albuquerque, New Mexico Toronto TIAA-CREF Daniel C. Tsui So Paulo New York City Department of Parks& Recreation Norman Foster Ramsey, Jr. Bibcode Center of the Philip Warren Anderson Fort Worth, Texas Pfizer Path integral formulation George Zweig Chautauqua County, New York Online Computer Library Center LagosDay to Day Jerusalem District Infinity (film) OklahomaScarsdale, City, Oklahoma NewPhys.Suffolk Rev.York County, New York KolkataHowell, New Jersey Schenectady, New York New Milford, Connecticut North Branford, Connecticut Herkimer County, New York Erie Canal QED (play) Brooklyn Public LibraryEmpire State Flag of New York City Verrazano-Narrows Bridge Steuben County, New York MS-13 Gang United States Department of Energy American Civil ChenWar Ning YangFive Families Atlanta Hamilton, Ontario United States Bill of Rights Dhaka Max Delbruck Working Families Party Stewart International Airport Utne Reader Fox News Channel East Hampton (village), New York Martinus J. G. Veltman Project Tuva Cheesecake Eugene WignerEmilio G. Segr Quantum computer Rockland County, New York Budapest Major League Soccer Quantum computingVictor Weisskopf Gerd BinnigWilliam McLellan (nanotechnology) The Feynman Lectures on PhysicsOutlook (magazine) Winchester, Connecticut Yellow feverBeacon, New YorkList of New York City gardens New York Knights RLFC Jet Propulsion Laboratory Washington, D.C. Public-access television Rochester, New York Federal government of the United States Above mean sea level Austin, Texas Cattaraugus County, New York Frederic de Hoffmann The New York UTC-4 Electron Samuel C. C. Ting The Sunday Times Magazine New York City Marathon Jamaica Software development Commissioners' Plan of 1811 Taylor series Saltaire, New York Mother Jones (magazine) Kppen climate classification Mayor of New York City PATCO Speedline Differential calculus Russell Gardens, New York Houston, Texas Joseph Hooton Taylor, Jr. Salt Lake City metropolitan area Alice Tully Hall Livingston County, New York E (number) Alternative newspaper Greene County, New York Microbiologist Westfield, New Jersey Water tower Essen New York City mayoral elections Stamford, Connecticut Yates County, New York Long Branch, New Jersey String theoryMorristown, New Jersey Greenwich Village Chinatown, Manhattan List of sovereign states Loyalist (American Revolution) New York Mets Albany, New York Massachusetts v. Environmental Protection Agency Robert MarshakAntony Hewish National Medal of ScienceNorth Hempstead, New York Parallel computing Roach Guards Avant-garde Cove Neck, New York Politics of Long Island Brian David Josephson 2016 Summer Paralympics Herald (Pakistan) Incheon Royal Society Musical theatre 1980 United States Census Quogue, New York 1990 United States Census National Broadcasting Company Crime in New York City Owen Willans Richardson National Endowment for the Arts Feynman (disambiguation)Emily YoffeParis MtroKaplan Law School Metro Manila Open Directory Project SuperconductivityMax von Laue Half-derivative Greater St. Louis News Corporation United Nations headquarters MetLife 1960 United States Census Philip Morris International New York GiantsDeep inelastic scattering Long Beach, California Japan Gauteng British Crown The Walrus Samba , Pennsylvania Latium The Stationery Office Douglas D. Osheroff Flexagon Green building Estuary Oceanside, New York Times Square Intercity bus International Ladies' Garment Workers'Southwestern Union Connecticut North Shore (Long Island) Portland, Oregon Mayor-council government Baxter Estates, New York Henry Way Kendall Internet Archive 2000 United States Census Capital District Yankee StadiumGothamFive Points, Manhattan Quantum electrodynamicsWalther Bothe New York dialectBethel, Connecticut Monmouth County, New Jersey Reason (magazine) San Diego, California Haiti National Oceanic and Atmospheric Administration Jean-Marie Colombani Time Warner Center Major League Baseball 1992 Summer Paralympics Kings Park, New YorkPresident of the United States List of New York state symbols The Character of PhysicalBachelor Law of Science Clinton County, New York Red Bull New York Baltimore, Maryland Cecil Frank Powell Ho Chi Minh City Hackensack, New Jersey Sullivan County, New York 1980 Summer Paralympics Einstein field equation Mount Sinai School of MedicineEastern Time Zone (North America) Westbury, New York Appalachians Stewart Manor, New York Roy J. Glauber Political machine Milwaukee, Wisconsin Conference House First Continental Congress WorldThe Trade Center Encyclopedia of New York City Warren County, New York Central New York Region Lahore KPRC-TV Sydney Kongar-ool Ondar Baruch Samuel Blumberg Kebab Champlain Valley Fiorello H. LaGuardia Market capitalization Largest cities in the Americas New Jersey Transit rail operations United States District Court for the Southern District of NewFordham York University Troy Patterson Chinese people New York City ethnic enclaves Association of International Marathons and Road Races Sticky bead argumentViscosity Sagaponack, New YorkPort Authority of New York and New Jersey Danbury, Connecticut Edward Victor Appleton Veronica Dillon Long Island SoundList of references to Long Island places in popular culture San Francisco Bay Area Salsa music Lindenhurst, New York Charles K. Kao Fire Island Rockefeller University Situation comedy Alma mater Roslyn Harbor, New York Liposarcoma Chemung County, New York Fort Tompkins New York City Council Riverhead (town), New York The Wave (newspaper) Shanghai Wolcott, Connecticut Bayonne, New Jersey Compressed natural gas Lancaster, Pennsylvania Greater New Haven , Baron Blackett Broadway theater The Phoenix (magazine) Super Bowl XLVIII North Country, New York 2000 Summer Paralympics School of Visual Arts Pace University Tokyo Subway Housing cooperative Southampton (town), New York Neutrino Boisfeuillet Jones, Jr. 2010 United States Census 2008 Summer Paralympics Greater Jacksonville Metropolitan Area Setback (architecture) New York Rangers List of U.S. cities with high transit ridershipDoonesbury David Weigel Summer Paralympic Games Northport, New York St. John's University (Jamaica, NY) Esther Lederberg Omega-minus particle India Today Sum-over-paths Tin Pan Alley New York County Ski country List of United States cities by population density St. Patrick's DayKingdom of EnglandThe Progressive 1 E9 m Victor Stabin Principle of stationary actionMedia in New York City Monocle (2007 magazine) Hamilton Grange National Memorial Hudson Highlands Manhattan Neighborhood Network Bayville, New York Columbus, Ohio metropolitan area Will Leitch Megacity The Week (Indian magazine)Rob Walker (journalist) Gustav Ludwig HertzAlbert Einstein 1880 United States Census Music of Long Island Wilhelm Rntgen Wanamaker Mile New Statesman 1790 United States Census Minneapolis, Minnesota Rogers Commission List of hospitals in New York City Providence metropolitan area Cannabis (drug) Community Service Society of New York New York City Landmarks Preservation CommissionAnsonia, Connecticut Greenport, Suffolk County, New York Citigroup Indigenous peoples of the Americas Adirondack MountainsNew York State Colorado Springs, Colorado Simon Doonan The Mesa, Arizona Cairo Governorate James II of England Pike County, Pennsylvania Harlem Renaissance Greater Houston Roosevelt Island ZIP code Kings Point, New York Makoto Kobayashi (physicist) United States Congress Johann Hari Cond Nast Building Greater San Antonio United States Army Passaic, New Jersey Rufus Wilmot Griswold Miami Republican Party (United States) Santo Domingo John William Strutt, 3rd Baron RayleighArXiv Jews Neural networkFreshman W. Daniel Hillis Butterfly EffectJohannesburg Unicameralism Fort Wadsworth Tel Aviv Guilford, Connecticut Jamestown, New York New York City arts organizations Trinidad and Tobago Old Westbury, New York Bogot Advertising agency The OldieGreat Neck Plaza, New York Kenneth T. Jackson Encyclopedia of New York City Alexei Alexeyevich Abrikosov Boston Area code 646 Newark, New Jersey Brooklyn Cyclones Rush hour North Haven, New York NYC (disambiguation) Allegheny Plateau Schuyler County, New York Kearny, New Jersey Belle Terre, New York Political independent Massachusetts Bay Colony News WeeklyAsthma Guttenberg, New Jersey Josiah Willard GibbsFeynman diagram Wikiquote National Journal PlaNYC Washington Metropolitan AreaDemographics of New York Shelter Island (town), New York George Washington Isolation tank Washington County, New York September 11 attacks Haute cuisine CBS New York Islanders Parton (particle physics) Hudson County, New Jersey Whitaker's Almanack Litchfield County, Connecticut Milford, Connecticut North China Battery Park City, Manhattan Marble Hill, Manhattan Matthew SandsKenneth G. Wilson Second Continental CongressAllegany County, New York Roslyn, New York Sands Point, New YorkInland Empire (California) Leon Neil Cooper Apartment building Village of the Branch, New York Southold, New York New York Yankees New Rochelle, New York New York County, New York East Hampton (town), New York Warner Music Group 1900 United States Census Rockville Centre,Tribeca NewFilm Festival York Membership of the New York City Council Port Washington North, New York The Great White Way Greater Buenos AiresJakarta Memphis metropolitan area Alberta Views Omaha, Nebraska Minneapolis Saint Paul Rockefeller CenterPhiladelphia San Diego metropolitan area Portland metropolitan area Quantum cellular automata Biology Fred Kaplan Garfield, New Jersey Triangle Shirtwaist Factory fireShawangunk Ridge Brookville, New York Hans BetheMeriden, Connecticut Anthony Burgess List of Long Islanders Lima Essex County, New York Physicist Uniform Resource Locator Articles of Confederation Melting pot Parallel processing Tianjin Edward Manukyan The Village Voice Jerusalem Interpol (band) China Hip hop culture George E. Smith Martian Polykarp KuschBranford, Connecticut Grand Slam (tennis) American Broadcasting Company Baltimore International Standard Serial NumberOld master print Madison County, New York St. Lawrence County, New York Chonqing Real Time Opera List of theoretical Hempstead (town), New York Michael BloombergFur tradePerth Amboy, New Jersey Istanbul Poughkeepsie, New York Western Hemisphere Putnam Fellow BetheFeynman formula Nashville, Tennessee Jamaica Bay Albert Hibbs Matthew BroderickFermiDirac statistics Time zone Elizabeth, New Jersey IT Week James Surowiecki Science (journal) List of countries by homicide rate GuglielmoAltadena, California Marconi Prediction Battle of Long Island Doctoral advisor Fortnight MagazineStamp Act CongressGreat Migration (African American) Montgomery County, New York American Airlines American Museum of Natural History List of capitals in the United States Magill LenapeLondon Underground Somerset County, New Jersey New Fairfield, Connecticut Classified Ventures Shenzhen Linden, New Jersey Books on New York City John C. Mather Critical exponent Amandla (magazine) Beijing Review Stratford, Connecticut Ltd Westchester County Noseweek Schoharie County, New York MasatoshiPhilosophiae KoshibaNaturalis Principia Mathematica Administrative divisions of NewGoogle York Book Search Troy, New York Waterbury, Connecticut Great Irish Famine New York Stock Exchange Hearst Tower (New York City) Ketamine Shelton, Connecticut Rapid transit Tsung-Dao LeeFrits ZernikeBoseEinstein statistics Paul Krugman Heidelberg Music of New York City New Haven, Connecticut Ripponden Time Warner History of New York TheCity New York Times Barbara McClintock Tuva or Bust Hudson-Fulton Celebration Todt Hill Carnegie Hall Tampa Bay AreaPunk subculture New York Liberty Seoul United States Census Bureau List of people from New York New York Draft Riots Dongguan Christopher Hitchens Hectare The WeekForward Magazine Rome Long Island List of United States cities by population Bacteriophage lambda U.S. News& World Report United States East Williston, New York Islandia, New York Game design List of law enforcement agencies in Long Island Microsoft Excel Sports in New York City Timeline of New York City crimes and disasters New Amsterdam Charles II of England John C. Lilly Brooklyn Islip (hamlet), New York Columbus, Ohio Broadway (New York City) Forth magazine Tucson, Arizona Irish people 1860 United States Census South Africa Austan Goolsbee US Open (tennis) EcuadorSag Harbor, New York HBO Harper's Magazine Adobe Flash Latin Kings (gang) E. B. White Fifth Avenue (Manhattan) Seymour, Connecticut Dutchess County, New York Government of New York City West Haven, Connecticut New York Times Laurel Hollow, New York New York Harbor Feynman sprinkler Lake Success, New York Bruce ReedNew Orleans metropolitan area The American Prospect San Jose, California Pennsylvania Station (New York City) Prospect (magazine) Times Herald-Record Area code 718 The New Republic Miami, Florida San Antonio, Texas The Jerusalem Report County (United States) Forty Thieves (New York gang) Overland (literary journal) Boston, MassachusettsEdwin G. Burrows Investigate (magazine) JSTOR Wallingford, Connecticut Silicon Valley Shoreham, New York Nicholas Metropolis Genetics (journal) Cortland County, New York Secaucus, New Jersey FermilabClaude Cohen-TannoudjiPhilipp Lenard List of Nobel laureates in Physics Barclays Center Light cone Area code 347 List of cities with most skyscrapers The EconomistClifton, New Jersey Rudolf Mssbauer The New York Review of Books Chenango County, New York Engineering Robert R. Wilson Sabbatical Horst Ludwig Strmer UrbanSacramento, area California Superfluidity Rev. Mod. Phys. The Washington Post Queens Borough Public Library South Florida metropolitan area Port Authority Trans-Hudson One-electron Ernstuniverse Ruska Community of Madrid Poland Cincinnati Northern Kentucky metropolitan area Economy of New York City Robert B. Laughlin Burial Ridge1939 New York World's Fair Bandung Caltech European American Physical ReviewHuman computer Punk rock General American Pulitzer Prize Charter school Lattingtown, New York TWA Oswego County, New York There'sEliot Spitzer Plenty of Room at the BottomMetropolitan Opera Geography of Long Island Summit, New Jersey East Haven, Connecticut MagnumSwedish Photos Cottage Marionette Theatre Metro-North Railroad Award Simon van der FreemanMeer Dyson Harrison, New Jersey Finger Lakes WNYC Tech Valley Capital Ethiopia Wikipedia Honsh John A. Wheeler Phil Carter Flash VideoBASIC Parity (physics) Melvin SchwartzLeo Esaki African Americans Lake Grove, New York The Strokes Tel Aviv District 1800 United States Census Ann Hulbert WheelerFeynman absorber theoryJacques Attali Montreal Metro Norval White Thomas Curtright Feynman diagrams Far Rockaway, QueensJohannes Stark Jewish quota Education in New York City Coney Island 1970 United States Census Michael Steinberger Maclean's Guido Pontecorvo Macy's Thanksgiving Day Parade Robert Coleman Richardson List of metropolitan areas by population Tim Wu Office of Scientific and Technical InformationGerald M. Rosberg Far Rockaway High School Bangalore The Middle East (magazine) Google VideosBCS theory Physics Richard Chace Tolman Ultraviolet David Edelstein Peter Parnell El Tiempo LatinoNiels BohrWhat Do You Care What Other People Think? AtheismTuvaPieter ZeemanNASA Kaplan, Inc. John Gribbin The Monthly Amazon.com Village (magazine) Slate (magazine) WKMG-TV Hal S. Jones WJXT Post-Newsweek Stations Key West Literary Seminar The Star (Bangladesh) U. S. Department of Justice Boston Phoenix Dahlia Lithwick John Alderman (publishing) Ann L. McDaniel Fairfax County Times Jack Shafer FT Magazine The Washington Post Writers Group Newsline (magazine) Chaos theory James Gleick Atlantic Annie Lowrey Eric Leser Steven Landsburg The Fray (Internet forum) Salon.com PrivateSeth StevensonEye Washington Post Company New Zealand Listener Dhaka Courier KSAT-TV National ReviewStephen Jay Gould Sasha Frere-Jones Jonah Weiner Wilson Quarterly Linguistics William Saletan Robert Wright (journalist)Harvard College Jacob Weisberg David Plotz Ethiopian Review Andy Bowers Foreign AffairsThe Nation Human Events E!Sharp Dana Stevens (critic) The Atlantic The New Yorker WDIV-TV Adam Kirsch Podcast Dear Prudence (advice column)Virginia Heffernan Douglas Hofstadter Timothy Noah The Liberal The Herald (Everett) Asia Pacific Management Institute Farhad Manjoo WPLG The American Spectator BBC Focus on Africa David Helvarg Donald E. Graham Fareed Zakaria MSN Queue Magazine Literary Review of CanadaForeign Policy Greater Washington Publishing Ron Rosenbaum Anne Applebaum

Vijay Ravindran Tom VanderbiltNews magazine Nina Shen Rastogi Microsoft London Review of Books European Commission Authors Guild Newsweek EUobserver Daniel Radosh New African New York (magazine)Harvard Crimson Kaplan College Paaras National Book Award North& South (magazine) Franklin Foer Environment described by three rooms connected The Big Issue Fig. 4: 16 20 Policy Review Michael Kinsley Dissent (Australian magazine) Meghan O'Rourke Autodesk Forum (Bangladesh) The Drouth CableOne David Greenberg Standpoint (magazine) × Robert PinskyExpress (newspaper) Griffith Review SCORE! Educational Centers The Spectator The Washington Post Company Antitrust Emily Bazelon Chief Executive Officer The Bulletin (Brussels weekly) Concord Law School Southern Maryland Newspapers through the middle row. Ian Bremmer The PipelineTehelka The Gazette (Maryland) Minneapolis Maisonneuve (magazine) Washington Post This (Canadian magazine) Quadrant (magazine) National Public Radio The Weekly Standard Time (magazine) John Dickerson (journalist) Variant (magazine) Canadian Dimension Eliot Porter National Observer (Australia) Jody Rosen USA MagazineIndependent station (North America)

Fig. 3: Graph for the "James Gleick" Wikipedia article with 1975 nodes and 1999 edges. The seed article is in light blue. The size of the nodes (except for the seed node) is proportional to the similarity. In red are all the nodes with similarity bigger than 0.7. There are several articles with high similarity more than one link ahead. 0.06 0.04

0.02 π function. The value-function is the function V : X R defined 0.00 7→ as 0.02 ∞ π t 0.04 V (x) = E ∑ γ r(xt ) x0 = x,at = π(xt ) , "t=1 # 0.06 and plays a key role in many RL algorithms [Sze10]. When the 14 π 12 dimension of X is significant, it is common to approximate V (x) 10 8 rows 10 by 20 6 π 30 4 V Vˆ = Φw, columns40 2 ≈ 50 where Φ is an n-by-k matrix whose columns are the basis functions Second to fourth eigenvectors of the laplacian of the three used to approximate the value-function, n is the number of states, Fig. 5: rooms graph. Note how the eigendecomposition automatically cap- and w is a vector of dimension k. Typically, the basis functions tures the structure of the environment. are selected by hand, for example, by using polynomials or radial basis functions. Since choosing the right functions can be difficult, Mahadevan and Maggioni [Mah07] proposed a framework where Visualizing the LSA Space these basis functions are learned from the topology of the state We believe that being able to work in a vector space will allow us space. The key idea is to represent the state space by a graph and to use a series of RL techniques that otherwise we would not be use the k smoothest eigenvectors of the graph laplacian, dubbed available to use. For example, when using Proto-value functions, Proto-value functions, as basis functions. Given the graph that it is possible to use the Nyström approximation to estimate the represents the state space, it is very simple to find these basis value of an eigenvector for out-of-sample states [Mah06]; this is functions. As an example, consider an environment consisting of only possible if states can be represented as points belonging to a three 16 20 grid-like rooms connected in the middle, as shown × Euclidean space. in Fig.4. Assuming the graph is stored in G, the following code7 How can we embed an entity in Euclidean space? In the computes the eigenvectors of the laplacian: previous section we showed that LSA can effectively compute L = nx.laplacian(G, sorted(G.nodes())) the similarity between documents. We can take this concept one evalues, evec = np.linalg.eigh(L) step forward and use LSA not only for computing similarities, but Figure5 shows 8 the second to fourth eigenvectors. Since in general also for embedding documents in Euclidean space. value-functions associated to this environment will exhibit a fast To evaluate the soundness of this idea, we perform an ex- change rate close to the room’s boundaries, these eigenvectors ploratory analysis of the simple Wikipedia LSA space. In order provide an efficient approximation basis. to be able to visualize the vectors, we use ISOMAP [Ten00] to reduce the dimension of the LSA vectors from 200 to 3 (we use 7. We assume that the standard import numpy as np and import the ISOMAP implementation provided by scikit-learn [Ped11]). networkx as nx statements were previously executed. We show a typical result in Fig.6, where each point represents the 8. The eigenvectors are reshaped from vectors of dimension 3 16 20 = 3 × × LSA embedding of an article in R , and a line between two points 960 to a matrix of size 16-by-60. To get meaningful results, it is necessary to represents a link between two articles. We can see how the points build the laplacian using the nodes in the grid in a row major order. This is why the nx.laplacian function is called with sorted(G.nodes()) as close to the "Water" article are, in effect, semantically related the second parameter. ("Fresh water", "Lake", "Snow", etc.). This result confirms that the A TALE OF FOUR LIBRARIES 15

[Ped11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and Rain E. Duchesnay. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12:2825-2830, 2011 Sea 0.3 [Reh10]R. Reh˚uˇ rekˇ and P. Sojka. Software Framework for Topic Modelling Snow 0.4 with Large Corpora, in Proceedings of the LREC 2010 Workshop on

0.5 New Challenges for NLP Frameworks, pp. 45-50 May 2010 [Sze10] C. Szepesvári. Algorithms for Reinforcement Learning. San Rafael, 0.6 CA, Morgan and Claypool Publishers, 2010. [Sut98] R.S. Sutton and A.G. Barto. Reinforcement Learning. , Drink 0.7 Massachusetts, The MIT press, 1998. River Lake 0.8 [Mah07] S. Mahadevan and M. Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in Markov deci- Fresh water Salt water sion processes. Journal of Machine Learning Research, 8:2169-2231, 0.20 2007. Water 0.25 [Man08] C.D. Manning, P. Raghavan and H. Schutze. An introduction to 0.2 0.30 0.1 information retrieval. Cambridge, England. Cambridge University 0.0 0.35 0.1 Press, 2008 0.2 0.40 0.3 [Ten00] J.B Tenenbaum, V. de Silva, and J.C. Langford. A global geo- 0.4 0.45 metric framework for nonlinear dimensionality reduction . Science, 290(5500), 2319-2323, 2000 Fig. 6: ISOMAP projection of the LSA space. Each point represents [Mah06] S. Mahadevan, M. Maggioni, K. Ferguson and S.Osentoski. Learn- the LSA vector of a Simple English Wikipedia article projected ing representation and control in continuous Markov decision pro- onto R3 using ISOMAP. A line is added if there is a link between cesses. National Conference on Artificial Intelligence, 2006. the corresponding articles. The figure shows a close-up around the [Dee90] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer and R. "Water" article. We can observe that this point is close to points Harshman, R. (1990). Indexing by latent semantic analysis. Journal associated to articles with a similar semantic. of the American Society for Information Science, 41(6), 391-407.

LSA representation is not only useful for computing similarities between documents, but it is also an effective mechanism for embedding the information entities into a Euclidean space. This result encourages us to propose the use of the LSA representation in the definition of the state. Once again we emphasize that since Gensim vectors are NumPY arrays, we can use its output as an input to scikit-learn without any effort.

Conclusions We have presented an example where we use different elements of the scientific Python ecosystem to solve a research problem. Since we use libraries where NumPy arrays are used as the standard vector/matrix format, the integration among these components is transparent. We believe that this work is a good success story that validates Python as a viable scientific programming language. Our work shows that in many cases it is advantageous to use general purposes languages, like Python, for scientific computing. Although some computational parts of this work might be some- what simpler to implement in a domain specific language,9 the breadth of tasks that we work with could make it hard to integrate all of the parts using a domain specific language.

Acknowledgment This work was partially supported by AFOSR grant FA9550-09- 1-0465.

REFERENCES [Tan09] B. Tanner and A. White. RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments, Journal of Machine Learn- ing Research, 10(Sep):2133-2136, 2009 [Hag08] A. Hagberg, D. Schult and P. Swart, Exploring Network Structure, Dynamics, and Function using NetworkX, in Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena, CA USA), pp. 11-15, Aug 2008

9. Examples of such languages are MATLAB, Octave, SciLab, etc. 16 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

Total Recall: flmake and the Quest for Reproducibility

‡ Anthony Scopatz ∗

!

Abstract—FLASH is a high-performance computing (HPC) multi-physics code entirely in Makefiles.) However, none of the prior attempts have which is used to perform astrophysical and high-energy density physics simula- placed reproducibility as their primary concern. This is in part tions. To run a FLASH simulation, the user must go through three basic steps: because fully capturing the setup metadata required alterations to setup, build, and execution. Canonically, each of these tasks are independently the build system. handled by the user. However, with the recent advent of flmake - a Python The development of flmake started by rewriting the existing workflow management utility for FLASH - such tasks may now be performed in a build system to allow FLASH to be run outside of the mainline fully reproducible way. To achieve such reproducibility a number of developments and abstractions were needed, some only enabled by Python. These methods subversion repository. It separates out a project (or simulation) di- are widely applicable outside of FLASH. The process of writing flmake opens rectory independent of the FLASH source directory. This directory many questions to what precisely is meant by reproducibility in computational is typically under its own version control. science. While posed here, many of these questions remain unanswered. For each of the important tasks (setup, build, run, etc), a sidecar metadata description file is either initialized or modified. Index Terms—FLASH, reproducibility This is a simple dictionary-of-dictionaries JSON file which stores the environment of the system and the state of the code when Introduction each flmake command is run. This metadata includes the version information of both the FLASH mainline and project repositories. FLASH is a high-performance computing (HPC) multi-physics However, it also may include all local modifications since the last code which is used to perform astrophysical and high-energy commit. A patch is automatically generated using standard posix density physics simulations [FLASH]. It runs on the full range utilities and stored directly in the description. of systems from laptops to workstations to 100,000 processor Along with universally unique identifiers, logging, and Python super computers, such as the Blue Gene/P at Argonne National run control files, the flmake utility may use the description files Laboratory. to fully reproduce a simulation by re-executing each command in Historically, FLASH was born from a collection of uncon- its original state. While flmake reproduce makes a useful nected legacy codes written primarily in Fortran and merged debugging tool, it fundamentally increases the scientific merit of into a single project. Over the past 13 years major sections FLASH simulations. have been rewritten in other languages. For instance, I/O is now The methods described herein may be used whenever source implemented in C. However building, testing, and documentation code itself is distributed. While this is true for FLASH (uncommon are all performed in Python. amongst compiled codes), most Python packages also distribute FLASH has a unique architecture which compiles simulation their source. Therefore the same reproducibility strategy is appli- specific executables for each new type of run. This is aided by cable and highly recommended for Python simulation codes. Thus an object-oriented-esque inheritance model that is implemented flmake shows that reproducibility - which is notably absent from by inspecting the file system directory tree. This allows FLASH most computational science projects - is easily attainable using to compile to faster machine code than a compile-once strategy. only version control, Python standard library modules, and ever- However it also places a greater importance on the Python build present command line utilities. system. To run a FLASH simulation, the user must go through three basic steps: setup, build, and execution. Canonically, each of these New Workflow Features tasks are independently handled by the user. However with the As with many predictive science codes, managing FLASH simu- recent advent of flmake - a Python workflow management utility lations may be a tedious task for both new and experienced users. for FLASH - such tasks may now be performed in a repeatable The flmake command line utility eases the simulation burden and way [FLMAKE]. shortens the development cycle by providing a modular tool which Previous workflow management tools have been written for implements many common elements of a FLASH workflow. At FLASH. (For example, the "Milad system" was implemented each stage this tool captures necessary metadata about the task which it is performing. Thus flmake encapsulates the following * Corresponding author: scopatz@flash.uchicago.edu operations: ‡ The FLASH Center for Computational Science, The setup/configuration, • Copyright © 2012 Anthony Scopatz. This is an open-access article distributed building, under the terms of the Creative Commons Attribution License, which permits • execution, unrestricted use, distribution, and reproduction in any medium, provided the • original author and source are credited. logging, • TOTAL RECALL: FLMAKE AND THE QUEST FOR REPRODUCIBILITY 17

analysis & post-processing, simulations. Thus simulations/ effectively aliases • and others. source/Simulation/SimulationMain/. Continuing • with the previous Sedov example the following directories are It is highly recommended that both novice and advanced users searched in order of precedence for simulation units, if they exist: adopt flmake as it enables reproducible research while simultane- ously making FLASH easier to use. This is accomplished by a few key abstractions from previous mechanisms used to set up, build, and execute FLASH. The implementation of these abstractions 1) ~/proj/simulations/Sedov/ are critical flmake features and are discussed below. Namely they 2) ~/proj/source/Simulation/ are the separation of project directories, a searchable source path, SimulationMain/Sedov/ logging, dynamic run control, and persisted metadata descriptions. 3) ${FLASH_SRC_DIR}/source/ Simulation/SimulationMain/Sedov/ Independent Project Directories Without flmake, FLASH must be setup and built from within Therefore, it is common for a project directory to have the the FLASH source directory (FLASH_SRC_DIR) using the setup following structure if the project requires many modifications to script and make [GMAKE]. While this is sufficient for single runs, FLASH that are - at least in the short term - inappropriate for such a strategy fails to separate projects and simulation campaigns mainline inclusion: from the source code. Moreover, keeping simulations next to the source makes it difficult to track local modifications independent ~/proj $ ls of the mainline code development. flash_desc.json setup/ simulations/ source/ Because of these difficulties in running suites of simulations from within FLASH_SRC_DIR, flmake is intended to be run external to the FLASH source directory. This is known as the Logging project directory. The project directory should be managed by In many ways computational simulation is more akin to experi- its own version control systems. By doing so, all of the project- mental science than theoretical science. Simulations are executed specific files are encapsulated in a repository whose history is to test the system at hand in analogy to how physical experiments independent from the main FLASH source. Here this directory probe the natural world. Therefore, it is useful for computational is called proj/ though in practice it takes the name of the scientists to adopt the time-tested strategy of a keeping a lab simulation campaign. This directory may be located anywhere on notebook or its electronic analogy. the user’s file system. Various example of virtual lab notebooks exist [VLABNB] as a way of storing information about how an experiment was con- Source & Project Paths Searching ducted. The resultant data is often stored in conjunction with the After creating a project directory, the simulation source files must notebook. Arguably the corollary concept in software development be assembled using the flmake setup command. This is analogous is logging. Unfortunately, most simulation science makes use of to executing the traditional setup script. For example, to run the neither lab notebooks nor logging. Rather than using an external classic Sedov problem: rich- or web-client, flmake makes use of the built-in Python logger. Every flmake command has the ability to log a message. This ~/proj $ flmake setup Sedov -auto follows the -m convention from version control systems. These [snip] messages and associated metadata are stored in a flash.log SUCCESS file in the project directory. ~/proj $ ls flash_desc.json setup/ Not every command uses logging; for trivial commands which do not change state (such as listing or diffing) log entries are not needed. However for more serious commands (such as building) This command creates symbolic links to the the FLASH logging is a critical component. Understanding that many users source files in the setup/ directory. Using the normal cannot be bothered to create meaningful log messages at each FLASH setup script, all of these files must live within step, sensible and default messages are automatically generated. ${FLASH_SRC_DIR}/source/. However, the flmake setup Still, it is highly recommended that the user provide more detailed command searches additional paths to find potential source files. messages as needed. E.g.: By default if there is a local source/ directory in the project directory then this is searched first for any potential FLASH ~/proj $ flmake -m "Run with 600 J " run -n 10 units. The structure of this directory mirrors the layout found in ${FLASH_SRC_DIR}/source/. Thus if the user wanted to write a new or override an existing driver unit, they could place The flmake log command may then be used to display past all of the relevant files in ~/proj/source/Driver/. Units log messages: found in the project source directory take precedence over units with the same name in the FLASH source directory. ~/proj $ flmake log -n1 Run id: b2907415 The most commonly overridden units, however, are Run dir: run-b2907415 simulations. Yet specific simulations live somewhat Command: run deep in the file system hierarchy as they reside User: scopatz within source/Simulation/SimulationMain/. Date: Mon Mar 26 14:20:46 2012 Log id: 6b9e1a0f-cfdc-418f-8c50-87f66a63ca82 To make accessing simulations easier a local project simulations/ directory is first searched for any possible Run with 600 J laser 18 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

The flash.log file should be added to the version control include setup, build, run, and merge. Each of these keys gets of the project. Entries in this file are not typically deleted. added when the corresponding flmake command is called. Note that restart alters the run value and does not generate its own top- Dynamic Run Control level key. Many aspects of FLASH are declared in a static way. Such During setup and build, flash_desc.json is modified in declarations happen mainly at setup and runtime. For certain build the project directory. However, each run receives a copy of this and run operations several parameters may need to be altered in a file in the run directory with the run information added. Restarts consistent way to actually have the desired effect. Such repetition and merges inherit from the file in the previous run directory. can become tedious and usually leads to less readable inputs. These sidecar files enable the flmake reproduce command To make the user input more concise and expressive, flmake which is capable of recreating a FLASH simulation from only the introduces a run control flashrc.py file in the project directory. flash_desc.json file and the associated source and project This is a Python module which is executed, if it exists, in an empty repositories. This is useful for testing and verification of the same namespace whenever flmake is called. The flmake commands may simulation across multiple different machines and platforms. It is then choose to access specific data in this file. Please refer to generally not recommended that users place this file under version individual command documentation for an explanation on if/how control as it may change often and significantly. the run control file is used. The most important example of using flashrc.py is that Example Workflow the run and restart commands will update the flash.par file The fundamental flmake abstractions have now been explained with values from a parameters dictionary (or function which above. A typical flmake workflow which sets up, builds, runs, returns a dictionary). restarts, and merges a fork of a Sedov simulation is now demon- strated. First, construct the project repository: Initial flash.par ~ $ mkdir my_sedov order=3 ~ $ cd my_sedov/ slopeLimiter= "minmod" ~/my_sedov $ mkdir simulations/ charLimiting= .true. ~/my_sedov $ cp -r ${FLASH_SRC_DIR}/source/\ RiemannSolver= "hll" Simulation/SimulationMain/Sedov simulations/ ~/my_sedov $ # edit the simulation flashrc.py ~/my_sedov $ nano simulations/Sedov/\ Run Control Simulation_init.F90 ~/my_sedov $ git init . parameters={"slopeLimiter":"mc", ~/my_sedov $ git add . "use_flattening": False} ~/my_sedov $ git commit -m "My Sedov project"

Final flash.par Next, create and run the simulation:

RiemannSolver= "hll" ~/my_sedov $ flmake setup -auto Sedov charLimiting= .true. ~/my_sedov $ flmake build -j 20 order=3 ~/my_sedov $ flmake -m "First run of my Sedov" \ slopeLimiter= "mc" run -n 10 use_flattening= .false. ~/my_sedov $ flmake -m "Oops, it died." restart \ run-5a4f619e/ -n 10 ~/my_sedov $ flmake -m "Merging my first run." \ merge run-fc6c9029 first_run Description Sidecar Files ~/my_sedov $ flmake clean1 As a final step, the setup command generates a flash_desc.json file in the project directory. This is the description file for the FLASH simulation which is currently being worked on. This description is a sidecar file whose purpose Why Reproducibility is Important is to store the following metadata at execution of each flmake True to its part of speech, much of ‘scientific computing’ has the command: trappings of science in that it is code produced to solve problems in (big-‘S’) Science. However, the process by which said programs the environment, • are written is not typically itself subject to the rigors of the the version of both project and FLASH source repository, • scientific method. The vaulted method contains components of local source code modifications (diffs), • prediction, experimentation, duplication, analysis, and openness the run control files (see above), • [GODFREY-SMITH]. While software engineers often engage in run ids and history, • such activities when programming, scientific developers usually and FLASH binary modification times. • forgo these methods, often to their detriment [WILSON]. Thus the description file is meant to be a full picture of the Whatever the reason for this may be - ignorance, sloth, or way FLASH code was generated, compiled, and executed. Total other deadly sins - the impetus for adopting modern software reproducibility of a FLASH simulation is based on having a well- development practices only increases every year. The evolution of formed description file. tools such as version control and environment capturing mecha- The contents of this file are essentially a persisted dictionary nisms (such as virtual machines/hypervisors) enable researchers which contains the information listed above. The top level keys to more easily retain information about software during and TOTAL RECALL: FLMAKE AND THE QUEST FOR REPRODUCIBILITY 19 after production. Furthermore, the apparent end of Silicon-based build, and run commands originally executed. It has the following Moore’s Law has necessitated a move to more exotic architectures usage string: and increased parallelism to see further speed increases [MIMS]. This implies that code that runs on machines now may not be able flmake reproduce[options] to run on future processors without significant refactoring. Therefore the scientific computing landscape is such that there For each command, reproduction works by cloning both source are presently the tools and the need to have fully reproducible and project repositories at a the point in history when they were simulations. However, most scientists choose to not utilize these run into temporary directories. Then any local modifications which technologies. This is akin to a chemist not keeping a lab notebook. were present (and not under version control) are loaded from The lack of reproducibility means that many solutions to science the description file and applied to the cloned repository. It then problems garnered through computational means are relegated to copies the original run control file to the cloned repositories and the realm of technical achievements. Irreproducible results may be performs any command-specific modifications needed. Finally, it novel and interesting but they are not science. Unlike the current executes the appropriate command from the cloned repository paradigm of computing-about-science, or periscientific comput- using the original arguments provided on the command line. ing, reproducibility is a keystone of diacomputational science Figure1 presents a flow sheet of this process. (computing-throughout-science). In periscientific computing there may exist a partition between expert software developers and expert scientists, each of whom must learn to partially speak the language of the other camp. Alternatively, when expert software engineers are not available, canonically ill-equipped scientists perform only the bare minimum development to solve computing problems. On the other hand, in diacomputational science, software exists as a substrate on top of which science and engineering prob- lems are solved. Whether theoretical, simulation, or experimental problems are at hand the scientist has a working knowledge of computing tools available and an understanding of how to use them responsibly. While the level of education required for dia- computational science may seem extreme in a constantly changing Fig. 1: The reproduce command workflow. software ecosystem, this is in fact no greater than what is currently expect from scientists with regard to Statistics [WILSON]. Thus the flmake reproduce recreates the original simu- With the extreme cases illustrated above, there are some lation using the original commands (and not the versions currently notable exceptions. The first is that there are researchers who present). The reproduce command has the following limitations: are cognizant and respectful of these reproducibility issues. The efforts of these scientists help paint a less dire picture than the one 1) Source directory must be version controlled, framed here. 2) Project directory must be version controlled, The second exception is that while reproducibility is a key 3) The FLASH run must depend on only the runtime param- feature of fundamental science it is not the only one. For example, eters file, the FLASH executable and FLASH datafiles, openness is another point whereby the statement "If a result is 4) and the FLASH executable must not be modified between not produced openly then it is not science" holds. Open access to build and run steps. results - itself is a hotly contested issue [VRIEZE] - is certainly a component of computational science. Though having open and available code is likely critical for pure science, it often lies outside The above restrictions enforce that the run is not considered the scope of normal research practice. This is for a variety of reproducible if at any point FLASH depends on externalities or reasons, including the fear that releasing code too early or at all alterations not tracked by version control. Critical to this process will negatively impact personal publication records. Tackling the are version control abstractions and the capability to execute openness issue must be left for another paper. historical commands. These will be discussed in the following In summary, reproducibility is important because without it subsections. any results generated are periscientific. To achieve diacompu- tational science there exist computational tools to aid in this Meta-Version Control endeavor, as in analogue science there are physical solutions. Every user and developer tends towards one version control system Though it is not the only criticism to be levied against modern or another. The mainline FLASH development team operates in research practices, irreproducibility is one that affects computation subversion [SVN] though individual developers may prefer git acutely and uniquely as compared to other spheres. [GIT] or mercurial [HG]. As mentioned previously, some users do not employ any source control management software. In the case where the user lacks a sophisticated version control The Reproduce Command system, it is still possible to obtain reproducibility if a clean The flmake reproduce command is the key feature enabling directory tree of a recent release is available. This clean tree must the total reproducibility of a FLASH simulation. This takes a be stored in a known place, typically the .clean/ subdirectory of description file (e.g. flash_desc.json) and implicitly the the FLASH_SRC_DIR. This is known as the ‘release’ versioning FLASH source and project repositories and replays the setup, system and is managed entirely by flmake. 20 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

To realize reproducibility in this environment, it is neces- every relevant local change to the repository prior to running a sary for the reproduce command to abstract core source control simulation. What is more is that the flmake implementation of management features away from the underlying technology (or these abstractions is only a handful of functions. These total less absence of technology). For flmake, the following operations than 225 lines of code in Python. Though small, this capability is define version control in the context of reproducibility: vital to the reproduce command functioning as intended.

info, • checkout or clone, Command Time Machine • diff, • Another principal feature of flmake reproducibility is its ability to and patch. • execute historical versions of the key commands (setup, build, and run) as reincarnated by the meta-version control. This is akin to The info operation provides version control information that the bootstrapping problem whereby all of the instruction needed points to the current state of the repository. For all source control to reproduce a command are contained in the original information management schemes this includes a unique string id for the provided. Without this capability, the most current versions of the versioning type (e.g. ‘svn’ for subversion). In centralized version flmake commands would be acting on historical versions of the control this contains the repository version number, while for for repository. While such a situation would be a large leap forward distributed systems info will return the branch name and the hash for the reproducibility of FLASH simulations, it falls well short of the current HEAD. In the release system, info simply returns of total reproducibility. In practice, therefore, historical flmake the release version number. The info data that is found is then commands acting on historical source are needed. This maybe be stored in the description file for later use. termed the ‘command time machine,’ though it only travels into The checkout (or sometimes clone) operation is effectively the the past. inverse operation to info. This operation takes a point in history, as The implementation of the command time machine requires described by the data garnered from info, and makes a temporary the highly dynamic nature of Python, a bit of namespace slight- copy of the whole repository at this point. Thus no matter what of-hand, and relative imports. First note that module and package evolution the code has undergone since the description file was which are executing the flmake reproduce command may not be written, checkout rolls back the source to its previous incarnation. deleted from the sys.modules cache. (Such a removal would For centralized version control this operation copies the existing cause sudden and terrifying runtime failures.) This effectively tree, reverts it to a clean working tree of HEAD, and performs means that everything under the flash package name may not a reverse merge on all commits from HEAD to the historical be modified. target. For distributed systems this clones the current repository, Nominally, the historical version of the package would be checkouts or updates to the historical position, and does a hard under the flash namespace as well. However, the name flash reset to clean extraneous files. The release system is easiest in that is only given at install time. Inside of the source directory, checkout simply copies over the clean subdirectory. This operation the package is located in tools/python/. This allows the is performed for the setup, build, and run commands at reproduce current reproduce command to add the checked out and patched time. {temp-flash-src-dir}/tools/ directory to the front of The diff operation may seem less than fundamental to version sys.path for setup, build, and run. Then the historical flmake control. Here however, diff is used to capture local modifications may be imported via python.flmake because python/ is a to the working trees of the source and project directories. This subdirectory of {temp-flash-src-dir}/tools/. diffing is in place as a fail-safe against uncommitted changes. For centralized and distributed systems, diffing is performed Modules inside of python or flmake are guaranteed to im- through the selfsame command name. In the release system (where port other modules in their own package because of the exclusive committing is impossible), diffing takes on the heavy lifting not use of relative imports. This ensures that the old commands import provided by a more advanced system. Thus for the release system old commands rather then mistakenly importing newer iterations. diff is performed via the posix diff tool with the recursive Once the historical command is obtained, it is ex- switch between the FLASH_SRC_DIR and the clean copy. The ecuted with the original arguments from the descrip- diff operation is executed when the commands are originally run. tion file. After execution, the temporary source direc- The resultant diff string is stored in the description file, along with tory {temp-flash-src-dir}/tools/ is removed from the corresponding info. sys.path. Furthermore, any module whose name starts with The inverse operation to diff is patch. This is used at reproduce python is also deleted from sys.modules. This cleans the time after checkout to restore the working trees of the temporary environment for the next historical command to be run in its own repositories to the same state they were in at the original execution temporal context. of setup, build, and run. While each source control management In effect, the current version of flmake is located in the system has its own patching mechanism, the output of diff always flmake namespace and should remain untouched while the re- returns a string which is compatible with the posix patch utility. produce command is running. Simultaneously, the historic flmake Therefore, for all systems the patch program is used. commands are given the namespace python. The time value of The above illustrates how version control abstraction may python changes with each command reproduced but is fully in- be used to define a set of meta-operations which capture all dependent from the current flmake code. This method of renaming versioning information provided. This even included the case a package namespace on the file system allows for one version of where no formal version control system was used. It also covers flmake to supervise the execution of another in a manner relevant the case of the ‘forgetful’ user who may not have committed to reproducibility. TOTAL RECALL: FLMAKE AND THE QUEST FOR REPRODUCIBILITY 21

A Note on Replication Conclusions & Future Work

A weaker form of reproducibility is known as replication By capturing source code and the environment at key stages - [SCHMIDT]. Replication is the process of recreating a result when setup, build, and run - FLASH simulations may be fully repro- "you take all the same data and all the same tools" [GRAHAM] duced in the future. Doing so required a wrapper utility called which were used in the original determination. Replication is a flmake. The writing of this tool involved an overhaul of the weaker determination than reproduction because at minimum the existing system. Though portions of flmake took inspiration from original scientist should be able to replicate their own work. With- previous systems none were as comprehensive. Additionally, to the out replication, the same code executed twice will produce distinct author’s knowledge, no previous system included a mechanism to results. In this case no trust may be placed in the conclusions non-destructively execute previous command incarnations similar whatsoever. to flmake reproduce. The creation of flmake itself was done as an exercise in Much as version control has given developers greater control reproducibility. What has been shown here is that it is indeed over reproducibility, other modern tools are powerful instruments possible to increase the merit of simulation science through a of replicability. Foremost among these are hypervisors. The ease- relatively small, though thoughtful, amount of code. It is highly of-use and ubiquity of virtual machines (VM) in the software encouraged that the methods described here be copied by other ecosystem allows for the total capture and persistence of the software-in-science project*. environment in which any computation was performed. Such en- Moreover, in the process of determining what flmake should vironments may be hosted and shared with collaborators, editors, be, several fundamental questions about reproducibility itself were reviewers, or the public at large. If the original analysis was raised. These point to systemic issues within the realm of compu- performed in a VM context, shared, and rerun by other scientists tational science. With the increasing importance of computing, then this is replicability. Such a strategy has been proposed by C. soon science as a whole will also be forced to reconcile these T. Brown as a stop-gap measure until diacomputational science is reproducibility concerns. Unfortunately, there does not appear to realized [BROWN]. be an obvious and present solution to the problems posed. However, as Brown admits (see comments), the delineation As with any software development project, there are further between replication and reproduction is fuzzy. Consider these improvements and expansions that will continue to be added to questions which have no clear answers: flmake. More broadly, the questions posed by reproducibility will be the subject of future work on this project and others. Additional issues (such as openness) will also figure into subsequent attempts Are bit-identical results needed for replication? • to bring about a global state of diacomputational science. How much of the environment must be reinstated for • replication versus reproduction? How much of the hardware and software stack must be Acknowledgements • recreated? Dr. Milad Fatenejad provided a superb sounding board in the con- What precisely is meant by ‘the environment’ and how • ception of the flmake utility and aided in outlining the constraints large is it? of reproducibility. For codes depending on stochastic processes, is reusing • The software used in this work was in part developed by the same random seed replication or reproduction? the DOE NNSA-ASC OASCR Flash Center at the University of Chicago. Without justifiable answers to the above, ad hoc definitions have governed the use of replicability and reproducibility. Yet to REFERENCES the quantitatively minded, an I-know-reproducibility-when-I-see- [BROWN] C. Titus Brown, "Our approach to replication in com- it approach falls short. Thus the science of science, at least in the putational science," Living in an Ivory Basement, April computational sphere, has much work remaining. 2012, http://ivory.idyll.org/blog/replication-i.html. [FLASH] FLASH Center for Computational Science, FLASH Even with the reproduction/replication dilemma, the flmake re- User’s Guide, Version 4.0-beta, http://flash.uchicago. produce command is a reproducibility tool. This is because it takes edu/site/flashcode/user_support/flash4b_ug.pdf, the opposite approach to Brown’s VM-based replication. Though University of Chicago, February 2012. the environment is captured within the description file, flmake [FLMAKE] A. Scopatz, flmake: the flash workflow utility, http://flash.uchicago.edu/site/flashcode/user_support/ reproduce does not attempt to recreate this original environment tools4b/usersguide/flmake/index.html, The University at all. The previous environment information is simply there for of Chicago, June 2012. posterity, helping to uncover any discrepancies which may arise. [GIT] Scott Chacon, "Pro Git," Apress (2009) DOI: User specific settings on the reproducing machine are maintained. 10.1007/978-1-4302-1834-0 [GMAKE] Free Software Foundation, The GNU Make Manual This includes but is not limited to which compiler is used. for version 3.82, http://www.gnu.org/software/make/, The claim that Brown’s work and flmake reproduce represent 2010. paragons of replicability and reproducibility respectively may [GODFREY-SMITH] Godfrey-Smith, Peter (2003), Theory and Reality: An introduction to the philosophy of science, University of be easily challenged. The author, like Brown himself, does not Chicago Press, ISBN 0-226-30063-3. presuppose to have all - or even partially satisfactory - answers. [GRAHAM] Jim Graham, "What is ‘Reproducibility,’ Anyway?", What is presented here is an attempt to frame the discussion and Scimatic, April 2010, http://www.scimatic.com/node/ 361. bound the option space of possible meanings for these terms. Doing so with concrete code examples is preferable to debating *. Please contact the author if you require aid in any reproducibility this issue in the abstract. endeavours. 22 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

[HG] Bryan O’Sullivan, "Mercurial: The Definitive Guide," O’Reilly Media, Inc., 2009. [MIMS] C. Mims, Moore’s Law Over, Supercomputing "In Triage," Says Expert, http://www.technologyreview. com/view/427891/moores-law-over-supercomputing- in-triage-says/ May 2012, Technology Review, MIT. [SCHMIDT] Gavin A. Schmidt, "On replication," RealCli- mate, Feb 2009, http://www.realclimate.org/index.php/ archives/2009/02/on-replication/langswitch_lang/in/. [SVN] Ben Collins-Sussman, Brian W. Fitzpatrick, C. Michael Pilato (2011). "Version Control with Subversion: For Subversion 1.7". O’Reilly. [VLABNB] Rubacha, M.; Rattan, A. K.; Hosselet, S. C. (2011). A Review of Electronic Laboratory Notebooks Available in the Market Today. Journal of Laboratory Automation 16 (1): 90–98. DOI:10.1016/j.jala.2009.01.002. PMID 21609689. [VRIEZE] Jop de Vrieze, Thousands of Scientists Vow to Boycott Elsevier to Protest Journal Prices, Science Insider, February 2012. [WILSON] G.V. Wilson, Where’s the real bottleneck in scientific computing? Am Sci. 2005;94:5. PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012) 23 Python’s Role in VisIt

‡ § Cyrus Harrison ∗, Harinarayan Krishnan

!

Abstract—VisIt is an open source, turnkey application for scientific data anal- ysis and visualization that runs on a wide variety of platforms from desktops to Python GUI CLI petascale class supercomputers. VisIt’s core software infrastructure is written in Clients C++, however Python plays a vital role in enabling custom workflows. Recent work has extended Python’s use in VisIt beyond scripting, enabling custom State Control Python UIs and Python filters for low-level data manipulation. The ultimate (Python Client goal of this work is to evolve Python into a true peer to our core C++ plugin Viewer Interface) infrastructure. This paper provides an overview of Python’s role in VisIt with a (State Manager) focus on use cases of scripted rendering, data analysis, and custom application Local Components Components Local development. network connection Index Terms—visualization, hpc, python

Direct Data Introduction Compute Manipulation VisIt [VisIt05], like EnSight [EnSight09] and ParaView (Python Filter MPI Engine [ParaView05], is an application designed for post processing of Runtime) mesh based scientific data. VisIt’s core infrastructure is written in C++ and it uses VTK [VTK96] for its underlying mesh data model. Its distributed-memory parallel architecture is tailored to Cluster Parallel Data process domain decomposed meshes created by simulations on large-scale HPC clusters. Early in development, the VisIt team adopted Python as the Fig. 1: Python integration with VisIt’s components. foundation of VisIt’s primary scripting interface. The scripting interface is available from both a standard Python interpreter and a custom command line client. The interface provides access to via VTK’s Python wrappers and leverage packages such all features available through VisIt’s GUI. It also includes support as NumPy [NumPy] and SciPy [SciPy]. Current support for macro recording of GUI actions to Python snippets and full includes the ability to create derived mesh quantities and control of windowless batch processing. execute data summarization operations. While Python has always played an important scripting role This paper provides an overview of how VisIt leverages Python in VisIt, two recent development efforts have greatly expanded in its software architecture, outlines these two recent Python VisIt’s Python capabilities: feature enhancements, and introduces several examples of use 1) We now support custom UI development using Qt via cases enabled by Python. PySide [PySide]. This allows users to embed VisIt’s visualization windows into their own Python applications. Python Integration Overview This provides a path to extend VisIt’s existing GUI and VisIt employs a client-server architecture composed of several for rapid development of streamlined UIs for specific use interacting software components: cases. A viewer process coordinates the state of the system and 2) We recently enhanced VisIt by embedding Python in- • terpreters into our data flow network pipelines. This provides the visualization windows used to display data. A set of client processes, including a Qt-based GUI and provides fine grained access, allowing users to write • custom algorithms in Python that manipulate mesh data Python-based command line interface (CLI), are used to setup plots and direct visualization operations. Corresponding author: [email protected] A parallel compute engine executes the visualization * • ‡ Lawrence Livermore National Laboratory pipelines. This component employs a data flow network § Lawrence Berkeley National Laboratory design and uses MPI for communication in distributed- Copyright © 2012 Cyrus Harrison et al. This is an open-access article memory parallel environments. distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, Client and viewer proceses are typically run on a desktop ma- provided the original author and source are credited. chine and connect to a parallel compute engine running remotely 24 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012) on a HPC cluster. For smaller data sets, a local serial or parallel compute engine is also commonly used. Figure1 outlines how Python is integrated into VisIt’s compo- nents. VisIt both extends and embeds Python. State control of the viewer is provided by a Python Client Interface, available as Python/C extension module. This interface is outlined in the Python Client Interface section, and extensions to support custom UIs written in Python are described in the Custom Python UIs section. Direct access to low-level mesh data structures is provided by a Python Filter Runtime, embedded in VisIt’s compute engine processes. This runtime is described in the Python Filter Runtime section.

Python Client Interface VisIt clients interact with the viewer process to control the state of visualization windows and data processing pipelines. Internally the system uses a collection of state objects that rely on a publish/subscribe design pattern for communication among com- ponents. These state objects are wrapped by a Python/C extension module to expose a Python state control API. The function calls are typically imperative: Add a new plot, Find the maximum value of a scalar field, etc. The client API is documented extensively in Fig. 2: Pseudocolor and Streamline plots setup using the script in the VisIt Python Interface Manual [VisItPyRef]. To introduce the Listing 1. API in this paper we provide a simple example script, Listing 1, that demonstrates VisIt’s five primary visualization building blocks: plot is created to display the scalar field named ’hardyglobal’. : File readers and data sources. The mesh is transformed by a ThreeSlice operator to limit the • Databases : Data set renderers. volume displayed by the Pseudocolor plot to three external faces. • Plots : Filters implementing data set transformations. We use a query to obtain and print the maximum value of the • Operators : Framework enabling the creation of derived ’hardyglobal’ field. An expression is defined to extract the gradient • Expressions quantities from existing mesh fields. of the ’hardyglobal’ scalar field. Finally, this gradient vector is Queries: Data summarization operations. used as the input field for a second plot, which traces streamlines. • Figure2 shows the resulting visualization which includes both the Listing 1: Trace streamlines along the gradient of a scalar field. Pseudocolor and Streamline plots. # Open an example file OpenDatabase("noise.silo") Accessing the Python Client Interface # Create a plot of the scalar field 'hardyglobal' For convenience, you can access the client interface from a custom AddPlot("Pseudocolor","hardyglobal") binary or a standalone Python interpreter. # Slice the volume to show only three # external faces. VisIt provides a command line interface (CLI) binary that AddOperator("ThreeSlice") embeds a Python interpreter and automatically imports the client tatts= ThreeSliceAttributes() interface module. There are several ways to access this binary: tatts.x=-10 tatts.y=-10 From VisIt’s GUI, you can start a CLI instance from the tatts.z=-10 • SetOperatorOptions(tatts) "Launch CLI" entry in the "Options" menu. DrawPlots() Invoking VisIt from the command line with the -cli • # Find the maximum value of the field 'hardyglobal' option starts the CLI and launches a connected viewer Query("Max") val= GetQueryOutputValue() process: print "Max value of'hardyglobal'=", val >visit -cli # Create a streamline plot that follows # the gradient of 'hardyglobal' For batch processing, the -nowin option launches the DefineVectorExpression("g","gradient(hardyglobal)") viewer in an offscreen mode and you can select a Python AddPlot("Streamline","g") script file to run using the -s option: satts= StreamlineAttributes() >visit -cli -nowin -s satts.sourceType= satts.SpecifiedBox • satts.sampleDensity0=7 satts.sampleDensity1=7 satts.sampleDensity2=7 You can also import the interface into a standalone Python satts.coloringMethod= satts.ColorBySeedPointID interpreter and use the module to launch and control a new SetPlotOptions(satts) instance of VisIt. Listing 2 provides example code for this use DrawPlots() case. The core implementation of the VisIt module is a Python/C In this example, the Silo database reader is automatically selected extension module, so normal caveats for binary compatibly with to read meshes from the input file ’noise.silo’. A Pseudocolor your Python interpreter apply. PYTHON’S ROLE IN VISIT 25

The features of the VisIt interface are dependent on the version The ability to reuse VisIt’s existing set of GUI widgets. • of VisIt selected, so the import process is broken into two steps. The ability to utilize renderers as Qt widgets allows VisIt’s First, a small front end module is imported. This module allows visualization windows to be embedded in custom PySide GUIs you to select the options used to launch VisIt. Examples include: and other third party applications. Re-using VisIt’s existing generic using -nowin mode for the viewer process, selecting a specific widget toolset, which provides functionally such as remote filesys- version of VisIt, -v 2.5.1, etc. After these options are set tem browsing and a visualization pipeline editor, allows custom the Launch() method creates the appropriate Visit components. applications to incorporate advanced features with little difficulty. During the launch, the interfaces to the available state objects are One important note, a significant number of changes went enumerated and dynamically imported into the visit module. into adding Python UI support into VisIt. Traditionally, VisIt uses a component-based architecture where the Python command Listing 2: Launch and control VisIt from a standalone Python inter- line interface, the graphical user interface, and the viewer exist preter. as separate applications that communicate over sockets. Adding import sys import os Python UI functionality required these three separate components from os.path import join as pjoin to work together as single unified application. This required vpath="path/to/visit///" components that once communicated only over sockets to also # or for an OSX bundle version be able to directly interact with each other. Care is needed when # "path/to/VisIt.app/Contents/Resources//" vpath= pjoin(vpath,"lib","site-packages") sharing data in this new scenario, we are still refactoring parts of sys.path.insert(0,vpath) VisIt to better support embedded use cases. import visit To introduce VisIt’s Python UI interface, we start with Listing visit.Launch() # use the interface 3, which provides a simple PySide visualization application that visit.OpenDatabase("noise.silo") utilizes VisIt under the hood. We then describe two complex visit.AddPlot("Pseudocolor","hardyglobal") applications that use VisIt’s Python UI interface with several embedded renderer windows.

Macro Recording Listing 3: Custom application that animates an Isosurface with a VisIt’s GUI provides a Commands window that allows you to sweep across Isovalues. record GUI actions into short Python snippets. While the client in- class IsosurfaceWindow(QWidget): terface supports standard Python introspection methods (dir(), def __init__(self): help(), etc), the Commands window provides a powerful learn- super(IsosurfaceWindow,self).__init__() ing tool for VisIt’s Python API. You can access this window from self.__init_widgets() # Setup our example plot. the "Commands" entry in the "Options" menu. From this window OpenDatabase("noise.silo") you can record your actions into one of several source scratch pads AddPlot("Pseudocolor","hardyglobal") and convert common actions into macros that can be run using the AddOperator("Isosurface") self.update_isovalue(1.0) Marcos window. DrawPlots() def __init_widgets(self): Custom Python UIs # Create Qt layouts and widgets. vlout= QVBoxLayout(self) VisIt provides 100+ database readers, 60+ operators, and over 20 glout= QGridLayout() different plots. This toolset makes it a robust application well self.title= QLabel("Iso Contour Sweep Example") self.title.setFont(QFont("Arial", 20, bold=True)) suited to analyze problem sets from a wide variety of scientific self.sweep= QPushButton("Sweep") domains. However, in many cases users would like to utilize self.lbound= QLineEdit("1.0") only a specific subset of VisIt’s features and understanding the self.ubound= QLineEdit("99.0") intricacies of a large general purpose tool can be a daunting task. self.step= QLineEdit("2.0") self.current= QLabel("Current%=") For example, climate scientists require specialized functionality f= QFont("Arial",bold=True,italic=True) such as viewing information on Lat/Long grids bundled with self.current.setFont(f) computations of zonal averages. Whereas, scientists in the fusion self.rwindow= pyside_support.GetRenderWindow(1) # Add title and main render winodw. energy science community require visualizations of interactions vlout.addWidget(self.title) between magnetic and particle velocity fields within a tokomak vlout.addWidget(self.rwindow,10) simulation. To make it easier to target specific user communities, glout.addWidget(self.current,1,3) we extended VisIt with ability to create custom UIs in Python. # Add sweep controls. glout.addWidget(QLabel("Lower%"),2,1) Since we have an investment in our existing Qt user interface, glout.addWidget(QLabel("Upper%"),2,2) we choose PySide, an LGPL Python Qt wrapper, as our primary glout.addWidget(QLabel("Step%"),2,3) Python UI framework. Leveraging our existing Python Client glout.addWidget(self.lbound,3,1) Interface along with new PySide support allows us to easily and glout.addWidget(self.ubound,3,2) glout.addWidget(self.step,3,3) quickly create custom user interfaces that provide specialized glout.addWidget(self.sweep,4,3) analysis routines and directly target the core needs of specific user vlout.addLayout(glout,1) communities. Using Python allows us to do this in a fraction of self.sweep.clicked.connect(self.exe_sweep) self.resize(600,600) the time it would take to do so using our C++ APIs. def update_isovalue(self,perc): VisIt provides two major components to its Python UI inter- # Change the % value used by face: # the isosurface operator. iatts= IsosurfaceAttributes() The ability to embed VisIt’s render windows. iatts.contourMethod= iatts.Percent • 26 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

iatts.contourPercent= (perc) SetOperatorOptions(iatts) txt="Current%="+" %0.2f"% perc self.current.setText(txt) def exe_sweep(self): # Sweep % value accoording to # the GUI inputs. lbv= float(self.lbound.text()) ubv= float(self.ubound.text()) stpv= float(self.step.text()) v= lbv while v< ubv: self.update_isovalue(v) v+=stpv

# Create and show our custom window. main= IsosurfaceWindow() main.show()

In this example, a VisIt render window is embedded in a QWidget to provide a Pseudocolor view of an Isosurface of the scalar field ’hardyglobal’. We create a set of UI controls that allow the user to select values that control a sweep animation across a range of Isovalues. The sweep button initiates the animation. To run this example, the -pysideviewer flag is passed to VisIt at startup to select a unified viewer and CLI process. > visit -cli -pysideviewer This example was written to work as standalone script to illustrate the use of the PySide API for this paper. For most custom applications, developers are better served by using QtDesigner for UI design, in lieu of hand coding the layout of UI elements. Listing 4 provides a small example showing how to load a QtDesigner UI file using PySide.

Listing 4: Loading a custom UI file created with Qt Designer. from PySide.QtUiTools import * # example slot def on_my_button_click(): print "myButton was clicked"

# Load a UI file created with QtDesigner loader= QUiLoader() uifile= QFile("custom_widget.ui") uifile.open(QFile.ReadOnly) main= loader.load(uifile) # Use a string name to locate # objects from Qt UI file. Fig. 3: Climate Skins for the Global Cloud Resolving Model Viewer. button= main.findChild(QPushButton,"myButton") # After loading the UI, we want to # connect Qt slots to Python functions button.clicked.connect(on_my_button_click) main.show()

Advanced Custom Python UI Examples To provide more context for VisIt’s Python UI interface, we now discuss two applications that leverage this new infrastructure: the Global Cloud Resolving Model (GCRM) [GCRM] Viewer and Ultra Visualization - Climate Data Analysis Tools (UV-CDAT) [UVCDAT], VisIt users in the climate community involved with the global cloud resolving model project (GCRM) mainly required a custom NetCDF reader and a small subset of domain specific plots and operations. Their goal for climate analysis was to quickly visualize models generated from simulations, and perform specialized anal- Fig. 4: Example showing integration of VisIt’s components in UV- ysis on these modules. Figure3 shows two customized skins for CDAT. the GCRM community developed in QtDesigner and loaded using PYTHON’S ROLE IN VISIT 27

PySide from VisIt’s Python UI client. The customized skins embed To create a custom filter, the user writes a Python script that VisIt rendering windows and reuse several of VisIt’s GUI widgets. implements a class that extends a base filter class for the desired We also wrote several new analysis routines in Python for custom VisIt building block. The base filter classes mirror VisIt’s existing visualization and analysis tasks targeted for the climate com- C++ class hierarchy. The exact execution pattern varies according munity. This included providing Lat/Long grids with continental the to selected building block, however they loosely adhere to the outlines and computing zonal means. Zonal mean computation following basic data-parallel execution pattern: was achieved by computing averages across the latitudes for each Submit requests for pipeline constraints or optimizations. layer of elevation for a given slice in the direction of the longitude. • Initialize the filter before parallel execution. UV-CDAT is a multi-institutional project geared towards ad- • Process mesh data sets in parallel on all MPI tasks. dressing the visualization needs of climate scientists around the • Run a post-execute method for cleanup and/or summariza- world. Unlike the GCRM project which was targeted towards • tion. one specific group and file format, for UV-CDAT all of VisIt’s functionality needs to be exposed and embedded alongside several To support the implementation of distributed-memory algo- other visualization applications. The goal of UV-CDAT is to bring rithms, the Python Filter Runtime provides a simple Python MPI together all the visualization and analysis routines provided within wrapper module, named mpicom. This module includes support several major visualization frameworks inside one application. for collective and point-to-point messages. The interface provided This marks one of the first instances where several separate fully- by mpicom is quite simple, and is not as optimized or extensive as featured visualization packages, including VisIt, ParaView, DV3D, other Python MPI interface modules, as such mpi4py [Mpi4Py]. and VisTrails all function as part of one unified application. Figure We would like to eventually adopt mpi4py for communication, 4 shows an example of using VisIt plots, along with plots from either directly or as a lower-level interface below the existing several other packages, within UV-CDAT. The core UV-CDAT mpicom API. application utilizes PyQt [PyQt] as its central interface and Python VisIt’s Expression and Query filters are the first constructs as the intermediate bridge between the visualization applications. exposed by the Python Filter Runtime. These primitives were The infrastructure changes made to VisIt to support custom Python selected because they are not currently extensible via our C++ UIs via PySide also allowed us to easily interface with PyQt. plugin infrastructure. Python Expressions and Queries can be Apart from creating PyQt wrappers for the project, we also made invoked from VisIt’s GUI or the Python Client Interface. To significant investments in working out how to effectively share introduce these filters, this paper will outline a simple Python resources created within Python using NumPy & VTK Python Query example and discuss how a Python Expression was used to data objects. research a new OpenCL Expression Framework.

Python Filter Runtime Listing 5: Python Query filter that calculates the average of a cell centered scalar field. The Python Client Interface allows users to assemble visualization pipelines using VisIt’s existing building blocks. While VisIt pro- class CellAverageQuery(SimplePythonQuery): def __init__(self): vides a wide range of filters, there are of course applications that # basic initialization require special purpose algorithms or need direct access to low- super(CellAverageQuery,self).__init__() level mesh data structures. VisIt’s Database, Operator, and Plot self.name="Cell Average Query" self.description="Calculating scalar average." primitives are extendable via a C++ plugin infrastructure. This def pre_execute(self): infrastructure allows new instances of these building blocks to be # called just prior to main execution developed against an installed version of VisIt, without access to self.local_ncells=0 VisIt’s full source tree. Whereas, creating new Expression and self.local_sum= 0.0 def execute_chunk(self,ds_in,domain_id): Query primitives in C++ currently requires VisIt’s full source # called per mesh chunk assigned to tree. To provide more flexibility for custom work flows and # the local MPI task. special purpose algorithms, we extended our data flow network ncells= ds_in.GetNumberOfCells() if ncells ==0: pipelines with a Python Filter Runtime. This extension provides return two important benefits: vname= self.input_var_names[0] varray= ds_in.GetCellData().GetArray(vname) Enables runtime prototyping/modification of filters. • self.local_ncells+= ncells Reduces development time for special purpose/one-off • for i in range(ncells): filters. self.local_sum+= varray.GetTuple1(i) def post_execute(self): To implement this runtime, each MPI process in VisIt’s com- # called after all mesh chunks on all pute engine embeds a Python interpreter. The interpreter coordi- # processors have been processed. nates with the rest of the pipeline using Python/C wrappers for tot_ncells= mpicom.sum(self.local_ncells) tot_sum= mpicom.sum(self.local_sum) existing pipeline control data structures. These data structures also avg= tot_sum/ float(tot_ncells) allow requests for pipeline optimizations, for example a request if mpicom.rank() ==0: to generate zones. VisIt’s pipelines use VTK mesh data vname= self.input_var_names[0] msg="Average value of %s = %s" structures internally, allowing us to pass VTK objects zero-copy msg= msg% (vname,str(avg)) between C++ and the Python interpreter using Kitware’s existing self.set_result_text(msg) VTK Python wrapper module. Python instances of VTK data self.set_result_value(avg) arrays can also be wrapped zero-copy into ndarrays, opening up # Tell the Python Filter Runtime which class to use access to the wide range of algorithms available in NumPy and # as the Query filter. SciPy. py_filter= CellAverageQuery 28 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

Listing 6: Python Client Interface code to invoke the Cell Average data arrays as ndarrays using the VTK Python wrapper Python Query on a example data set. module. # Open an example data set. OpenDatabase("multi_rect3d.silo") We are in the process of conducting performance studies and # Create a plot to query writing a paper with the full details of the framework. For this AddPlot("Pseudocolor","d") paper we provide a high-level execution overview and a few DrawPlots() # Execute the Python query performance highlights: PythonQuery(file="listing_5_cell_average.vpq", Execution Overview: vars=["default"]) An input expression, defining a new mesh field, is parsed Listing 5 provides an example Python Query script, and Listing 6 by a PLY front-end and translated into a data flow network provides example host code that can be used to invoke the Python specification. The data flow network specification is passed to Query from VisIt’s Python Client Interface. In this example, the a Python data flow module that dynamically generates a single pre_execute method initializes a cell counter and a variable to hold OpenCL kernel for the operation. By dispatching a single kernel the sum of all scalar values provided by the host MPI task. After that encapsulates several pipeline steps to the GPU, the framework initialization, the execute_chunk method is called for each mesh mitigates CPU-GPU I/O bottlenecks and boosts performance over chunk assigned to the host MPI task. execute_chunk examines both existing CPU pipelines and a naive dispatch of several small these meshes via VTK’s Python wrapper interface, obtaining GPU kernels. the number of cells and the values from a cell centered scalar Performance Highlights: field. After all chunks have been processed by the execute_chunk Demonstrated speed up of up to ~20x vs an equivalent method on all MPI tasks, the post_execute method is called. This • VisIt CPU expression, including transfer of data arrays to method uses MPI reductions to obtain the aggregate number of and from the GPU. cells and total scalar value sum. It then calculates the average Demonstrated use in a distributed-memory parallel setting, value and sets an output message and result value on the root MPI • processing a 24 billion zone rectilinear mesh using 256 process. GPUs on 128 nodes of LLNL’s Edge cluster. Using a Python Expression to host a new OpenCL Expression Python, PLY, PyOpenCL, and VisIt’s Python Expression capa- Framework. bility allowed us to create and test this framework with a much The HPC compute landscape is quickly evolving towards ac- faster turn around time than would have been possible using celerators and many-core CPUs. The complexity of porting ex- C/C++ APIs. Also, since the bulk of the processing was executed isting codes to the new programming models supporting these on GPUs, we were able to demonstrate impressive speedups. architectures is a looming concern. We have an active research effort exploring OpenCL [OpenCL] for visualization and analysis applications on GPU clusters. Conclusion One nice feature of OpenCL is the that it provides runtime kernel compilation. This opens up the possibility of assembling In this paper we have presented an overview of the various roles custom kernels that dynamically encapsulate multiple steps of a that Python plays in VisIt’s software infrastructure and a few ex- visualization pipeline into a single GPU kernel. A subset of our amples of visualization and analysis use cases enabled by Python OpenCL research effort is focused on exploring this concept, with in VisIt. Python has long been an asset to VisIt as the foundation the goal of creating a framework that uses OpenCL as a backend of VisIt’s scripting language. We have recently extended our for user defined expressions. This research is joint work with infrastructure to enable custom application development and low- Maysam Moussalem and Paul Navrátil at the Texas Advanced level mesh processing algorithms in Python. Computing Center, and Ming Jiang at Lawrence Livermore Na- For future work, we are refactoring VisIt’s component infras- tional Laboratory. tructure to better support unified process Python UI clients. We For productivity reasons we chose Python to prototype this also hope to provide more example scripts to help developers framework. We dropped this framework into VisIt’s existing data bootstrap custom Python applications that embed VisIt. We plan parallel infrastructure using a Python Expression. This allowed to extend our Python Filter Runtime to allow users to write us to test the viability of our framework on large data sets in new Databases and Operators in Python. We would also like to a distributed-memory parallel setting. Rapid development and provide new base classes for Python Queries and Expressions that testing of this framework leveraged the following Python modules: encapsulate the VTK to ndarray wrapping process, allowing users to write streamlined scripts using NumPy. PLY [PLY] was used to parse our expression language • For more detailed info on VisIt and its Python interfaces, we grammar. PLY provides an easy to use Python lex/yacc recommend: the VisIt Website [VisItWeb], the VisIt Users’ Wiki implementation that allowed us to implement a front-end [VisItWiki], VisIt’s user and developer mailing lists, and the VisIt parser and integrate it with the Python modules used to Python Client reference manual [VisItPyRef]. generate and execute OpenCL kernels. PyOpenCL [PyOpenCL] was used to interface with • OpenCL and launch GPU kernels. PyOpenCL provides a Acknowledgments wonderful interface to OpenCL that saved us an untold amount of time over the OpenCL C-API. PyOpenCL also This work performed under the auspices of the U.S. DOE by uses ndarrays for data transfer between the CPU and GPU, Lawrence Livermore National Laboratory under Contract DE- and this was a great fit because we can easily access our AC52-07NA27344. LLNL-CONF-564292. PYTHON’S ROLE IN VISIT 29

REFERENCES [VisIt05] Childs, H. et al, 2005. A Contract Based System For Large Data Visualization. VIS ’05: Proceedings of the conference on Visualization ’05 [ParaView05] Ahrens, J. et al, 2005. Visualization in the ParaView Frame- work. The Visualization Handbook, 162-170 [EnSight09] EnSight User Manual. Computational Engineering Interna- tional, Inc. Dec 2009. [OpenCL] Kronos Group, OpenCL parallel programming framework. http://www.khronos.org/opencl/ [PLY] Beazley, D., Python Lex and Yacc. http://www.dabeaz.com/ply/ [NumPy] Oliphant, T., NumPy Python Module. http://numpy.scipy.org [SciPy] Scientific Tools for Python. http://www.scipy.org. [PyOpenCL] Klöckner, A., Python OpenCL Module. http://mathema.tician.de/software/pyopencl [PySide] PySide Python Bindings for Qt. http://www.pyside.org/ [Mpi4Py] Dalcin, L., mpi4py: MPI for Python. http://mpi4py.googlecode.com/ [VTK96] Schroeder, W. et al, 1996. The design and implementation of an object-oriented toolkit for 3D graphics and visualization. VIS ’96: Proceedings of the 7th conference on Visualization ’96 [VisItPyRef] Whitlock, B. et al. VisIt Python Reference Manual. http://portal.nersc.gov/svn/visit/trunk/releases/2.3.0/ VisItPythonManual.pdf [PyQt] PyQt Python Bindings for Qt. http://www.riverbankcomputing.co.uk/software/pyqt/ [UVCDAT] Ultra Visualization - Climate Data Analysis Tools. http://uv-cdat.llnl.gov [GCRM] Global Cloud Resolving Model. http://kiwi.atmos.colostate.edu/gcrm/ [VisItWiki] VisIt Users’ Wiki. http://www.visitusers.org/ [VisItWeb] VisIt Website. https://wci.llnl.gov/codes/visit/ 30 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

PythonTeX: Fast Access to Python from within LaTeX

‡ Geoffrey M. Poore ∗

!

Abstract—PythonTeX is a new LaTeX package that provides access to the full Several methods of including executable code in LaTeX docu- power of Python from within LaTeX documents. It allows Python code entered ments ultimately function as preprocessors or templating systems. within a LaTeX document to be executed, and provides access to the output. A document might contain a mix of LaTeX and code, and the PythonTeX also provides syntax highlighting for any language supported by the preprocessor replaces the code with its output. The original docu- Pygments highlighting engine. ment would not be valid LaTeX; only the preprocessed document PythonTeX is fast and user-friendly. Python code is separated into user- would be. Sweave, knitr [Xie], the Python-based Pweave [Pastell], defined sessions. Each session is only executed when its code is modified. When code is executed, sessions run in parallel. The contents of stdout and and template libraries such as Mako [MK] function in this manner. stderr are synchronized with the LaTeX document, so that printed content is More recently, the IPython notebook has provided an interactive easily accessible and error messages have meaningful line numbering. browser-based interface in which text, code, and code output may PythonTeX simplifies scientific document creation with LaTeX. Plots can be interspersed [IPY]. Since the notebook can be exported as be created with matplotlib and then customized in place. Calculations can be LaTeX, it functions similarly to the preprocessor-style approach. performed and automatically typeset with NumPy. SymPy can be used to auto- The preprocessor/templating-style approach has a significant matically create mathematical tables and step-by-step mathematical derivations. advantage. All of the examples mentioned above are compatible with multiple document formats, not just LaTeX. This is par- Index Terms—LaTeX, document preparation, document automation, matplotlib, ticularly true in the case of templating libraries. One significant NumPy, SymPy, Pygments drawback is that the line numbers of the preprocessed document, which LaTeX receives, do not correspond to those of the original Introduction document. This makes it difficult to debug LaTeX errors, partic- ularly in longer documents. It also breaks standard LaTeX tools Scientific documents and presentations are often created with the such as forward and inverse search between a document and its LaTeX document preparation system. Though some LaTeX tools PDF (or other) output; only Sweave and knitr have systems to exist for creating figures and performing calculations, external work around this. An additional issue is that it is difficult for scientific software is typically required for these tasks. This can LaTeX code to interact with code in other languages, when the result in an inefficient workflow. Every time a calculation requires code in other languages has already been executed and removed modification or a figure needs tweaking, the user must switch before LaTeX runs. between LaTeX and the scientific software. The user must locate In an alternate approach to including executable code in LaTeX the code that created the calculation or figure, modify it, and documents, the original document is valid LaTeX, containing execute it before returning to LaTeX. code wrapped in special commands and environments. The code One way to streamline this process is to include non-LaTeX is extracted by LaTeX itself during compilation, then executed code within LaTeX documents, with a means to execute this and replaced by its output. Such approaches with Python go code and access the output. That approach has connections to back to at least 2007, with Martin R. Ehmsen’s python.sty style Knuth’s concept of literate programming, in which code and its file [Ehmsen]. Since 2008, SageTeX has provided access to the documentation are combined in a single document [Knuth]. The Sage system from within LaTeX [Drake]. Because noweb literate programming tool extended Knuth’s work to ad- Sage is largely based on Python, it also provides Python ac- ditional document formats and arbitrary programming languages cess. SympyTeX (2009) is based on SageTeX [Molteno]. Though [Ramsey]. Sweave subsequently built on noweb by allowing the SympyTeX is primarily intended for accessing the SymPy library output of individual chunks of R code to be accessed within for symbolic mathematics [SymPy], it provides general access to the document [Leisch]. This made possible dynamic reports that Python. Since these packages begin with a valid LaTeX document, are reproducible since they contain the code that generated their they automatically work with standard LaTeX editing tools and results. As such, Sweave and similar tools represent an additional, also allow LaTeX code to interact with Python. complementary approach to reproducibility compared to makefile- Python.sty, SageTeX, and SympyTeX illustrate the potential based approaches [Schwab]. of a close Python-LaTeX integration. At the same time, they leave much of the possible power of the Python-LaTeX combination un- * Corresponding author: [email protected] ‡ Union University tapped. Python.sty requires that all Python code be executed every time the document is compiled. SageTeX and SympyTeX separate Copyright © 2012 Geoffrey M. Poore. This is an open-access article distributed code execution from document compilation, but because all code under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the is executed in a single session, everything must be executed original author and source are credited. whenever anything changes. None of these packages provides PYTHONTEX: FAST ACCESS TO PYTHON FROM WITHIN LATEX 31 comprehensive syntax highlighting. SageTeX and SympyTeX do The printed content is not automatically included. Typically, the not provide access to stdout or stderr. They do synchronize error user wouldn’t want the printed content immediately after the messages with the document, but synchronization is performed by typeset code—explanation of the code, or just some space, might executing a try/except statement on every line of the user’s be desirable before showing the output. Two equivalent commands code. This reduces performance and fails in the case of syntax are provided for including the printed content generated by a block errors. environment: \printpythontex and \stdoutpythontex. PythonTeX is a new LaTeX package that provides access to These bring in any printed content created by the most recent Python from within LaTeX documents. It emphasizes performance PythonTeX environment and interpret it as LaTeX code. Both and usability. commands also take an optional argument to bring in content \printpythontex[v] Python-generated content is always saved, so that the La- as verbatim text. For example, brings • TeX document can be compiled without running Python. in the content in a verbatim form suitable for inline use, while \printpythontex[verb] Python code is divided into user-defined sessions. Each brings in the content as a verbatim • session is only executed when it is modified. When code block. is executed, sessions run in parallel. All code entered within code and block environments is Both stdout and stderr are easily accessible. executed within the same Python session (unless the user specifies • All Python error messages are synchronized with the otherwise, as discussed below). This means that there is continuity • LaTeX document, so that error line numbers correctly among environments. For example, since myvar has already been correspond to the document. created, it can now be modified: Code may be typeset with highlighting provided by Pyg- • \begin{pycode} ments [Pyg]—this includes any language supported by myvar += 4 Pygments, not just Python. Unicode is supported. print('myvar = ' + str(myvar)) Native Python code is fully supported, including imports \end{pycode} • from __future__. No changes to Python code are This produces required to make it compatible with PythonTeX. While PythonTeX lacks the rapid interactivity of the IPython myvar = 127 notebook, as a LaTeX package it offers much tighter Python- The verb environment typesets its contents, without executing LaTeX integration. It also provides greater control over what is it. This is convenient for simply typesetting Python code. Since displayed (code, stdout, or stderr) and allows executable code to the verb environment has a parallel construction to the code and be included inline within normal text. block environments, it can also be useful for temporarily disabling This paper presents the main features of PythonTeX and the execution of some code. Thus considers several examples. It also briefly discusses the internal workings of the package. \begin{pyverb} myvar = 123 print('Greetings from Python!') PythonTeX environments and commands \end{pyverb} PythonTeX provides four LaTeX environments and four LaTeX commands for accessing Python. These environments and com- results in the typeset content mands save code to an external file and then bring back the output myvar= 123 once the code has been processed by PythonTeX. print('Greetings from Python!') The code environment simply executes the code it contains. By default, any printed content is brought in immediately after without any code actually being executed. the end of the environment and interpreted as LaTeX code. For The final environment is different. The console environment example, the LaTeX code emulates a Python interactive session, using Python’s code mod- \begin{pycode} ule. Each line within the environment is treated as input to an myvar = 123 interactive interpreter. The LaTeX code print('Greetings from Python!') \end{pycode} \begin{pyconsole} myvar = 123 creates a variable myvar and prints a string, and the printed myvar content is automatically included in the document: print('Greetings from Python!') Greetings from Python! \end{pyconsole} The block environment executes its contents and also typesets creates it. By default, the typeset code is highlighted using Pygments. Reusing the Python code from the previous example, >>> myvar= 123 >>> myvar \begin{pyblock} 123 myvar = 123 >>> print('Greetings from Python!') print('Greetings from Python!') Greetings from Python! \end{pyblock} PythonTeX provides options for showing and customizing a ban- creates ner at the beginning of console environments. The content of all myvar= 123 console environments is executed within a single Python session, print('Greetings from Python!') providing continuity, unless the user specifies otherwise. 32 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

While the PythonTeX environments are useful for executing \begin{pycode}[slowsession] and typesetting large blocks of code, the PythonTeX commands myvar = 123 print('Greetings from Python!') are intended for inline use. Command names are based on abbrevi- \end{pycode} ations of environment names. The code command simply executes its contents. For example, \pyc{myvar = 123}. Again, any Each session is only executed when its code has changed, printed content is automatically included by default. The block and sessions run in parallel (via Python’s multiprocessing command typesets and executes the code, but does not automati- package), so careful use of sessions can significantly increase cally include printed content (\printpythontex is required). performance. Thus, \pyb{myvar = 123} would typeset All PythonTeX environments also accept a second optional fancyvrb myvar= 123 argument. This consists of settings for the LaTeX (Fancy Verbatims) package [FV], which PythonTeX uses for in a form suitable for inline use, in addition to executing the typesetting code. These settings allow customization of the code’s code. The verb command only typesets its contents. The command appearance. For example, a block of code may be surrounded by \pyv{myvar = 123} would produce a colored frame, with a title. Or line numbers may be included. myvar=123 Plotting with matplotlib without executing anything. If Pygments highlighting for inline The PythonTeX commands and environments can greatly simplify code snippets is not desired, it may be turned off. the creation of scientific documents and presentations. One exam- The final inline command, \py, is different. It provides a ple is the inclusion of plots created with matplotlib [MPL]. simple way to typeset variable values or to evaluate short pieces of All of the commands and environments discussed above be- code and typeset the result. For example, \py{myvar} accesses gin with the prefix py. PythonTeX provides a parallel set of the previously created variable myvar and brings in a string commands and environments that begin with the prefix pylab. representation: 123. Similarly, \py{2 8 + 1} converts its ** These behave identically to their py counterparts, except that argument to a string and returns 257. matplotlib’s pylab module is automatically imported via from It might seem that the effect of \py could be achieved using pylab import *. The pylab commands and environments \pyc combined with print. But \py has significant advantages. can make it easier to keep track of code dependencies and separate First, it requires only a single external file per document for content that would otherwise require explicit sessions; the default bringing in content, while print requires an external file for each pylab session is separate from the default py session. environment and command in which it is used. This is discussed in Combining PythonTeX with matplotlib significantly simplifies greater detail in the discussion of PythonTeX’s internals. Second, plotting. The commands for creating a plot may be included the way in which \py converts its argument to a valid LaTeX directly within the LaTeX source, and the plot may be edited in string can be specified by the user. This can save typing when place to get the appearance just right. Matplotlib’s LaTeX option several conversions or formatting operations are needed. The may be used to keep fonts consistent between the plot and the examples below using SymPy illustrate this approach. document. The code below illustrates this approach. Notice that All of the examples of inline commands shown above use the plot is created in its own session, to increase performance. opening and closing curly brackets to delimit the code. This system breaks down if the code itself contains an unmatched \begin{pylabcode}[plotsession] curly bracket. Thus, all inline commands also accept arbitrary rc('text', usetex=True) rc('font', **{'family':'serif', 'serif':['Times']}) matched characters as delimiters. This is similar to the behavior of rc('font', size=10.0) LaTeX’s \verb macro. For example, \pyc!myvar = 123! rc('legend', fontsize=10.0) and \pyc#myvar = 123# are valid. No such consideration is x = linspace(0, 3*pi) required for environments, since they are delimited by \begin figure(figsize=(3.25,2)) plot(x, sin(x), label='$\sin(x)$') and \end commands. plot(x, sin(x)**2, label='$\sin^2(x)$', linestyle='dashed') xlabel(r'$x$-axis') Options: Sessions and Fancy Verbatims ylabel(r'$y$-axis') xticks(arange(0, 4*pi, pi), ('$0$', PythonTeX commands and environments take optional arguments. '$\pi$', '$2\pi$', '$3\pi$')) These determine the session in which the code is executed and axis([0, 3*pi, -1, 1]) provide additional formatting options. legend(loc='lower right') savefig('myplot.pdf', bbox_inches='tight') By default, all code and block content is executed within a \end{pylabcode} single Python session, and all console content is executed within a separate session. In many cases, such behavior is desired because The plot may be brought in and positioned using the standard of the continuity it provides. At times, however, it may be useful LaTeX commands: to isolate some independent code in its own session. A long calculation could be placed in its own session, so that it only \begin{figure} \centering runs when its code is modified, independently of other code. \includegraphics{myplot} PythonTeX provides such functionality through user- \caption{\label{fig:matplotlib} A plot defined sessions. All commands and environments take created with PythonTeX.} a session name as an optional argument. For example, \end{figure} \pyc[slowsession]{myvar = 123} and The end result is shown in Figure1. PYTHONTEX: FAST ACCESS TO PYTHON FROM WITHIN LATEX 33

1.0 The roots of$\pylab{eq}$ are $[\pylab{latex_roots}]$.

0.5 This yields The roots of 4x2 + 2x 4 = 0 are [ 1.281,+0.781]. 0.0 − − -axis The automated generation of LaTeX code on the Python side y begins to demonstrate the full power of PythonTeX. 0.5 sin(x) − 2 sin (x) Solving equations with SymPy 1.0 Several examples with SymPy further illustrate the potential of − π 0 2π 3π Python-generated LaTeX code [SymPy]. x-axis To simplify SymPy use, PythonTeX provides a set of com- mands and environments that begin with the prefix sympy. These Fig. 1: A matplotlib plot created with PythonTeX. are identical to their py counterparts, except that SymPy is automatically imported via from sympy import *. SymPy is ideal for PythonTeX use, because its Solving equations with NumPy LatexPrinter class and the associated latex() function PythonTeX didn’t require any special modifications to the Python provide LaTeX representations of objects. For example, returning code in the previous example with matplotlib. The code that to solving the same polynomial, created the plot was the same as it would have been had an \begin{sympycode} external script been used to generate the plot. In some situations, x = symbols('x') however, it can be beneficial to acknowledge the LaTeX context myeq = Eq(4*x**2 + 2*x - 4) of the Python code. This may be illustrated by solving an equation print('The roots of the equation ') with NumPy [NP]. print(latex(myeq, mode='inline')) print(' are ') Perhaps the most obvious way to solve an equation using print(latex(solve(myeq), mode='inline')) PythonTeX is to separate the Python solving from the LaTeX \end{sympycode} typesetting. Consider finding the roots of a polynomial using NumPy. creates 2 \begin{pylabcode} The roots of the equation 4x + 2x 4 = 0 are 1 √ 1 1 1 √ − coeff = [4, 2, -4] 4 17 4 , 4 + 4 17 r = roots(coeff) − − − \end{pylabcode} Notice that the printed content appears as a single uninter- rupted line, even though it was produced by multiple prints. This The roots of$4x^2+2x-4=0$ are is because the printed content is interpreted as LaTeX code, and $\pylab{r[0]}$ and$\pylab{r[1]}$. in LaTeX an empty line is required to end a paragraph. This yields The \sympy command provides an alternative to printing. While the \py and \pylab commands attempt to convert their The roots of 4x2 + 2x 4 = 0 are 1.2807764064 \sympy − − arguments directly to a string, the command converts its and 0.780776406404. argument using SymPy’s LatexPrinter class. Thus, the output Such an approach works, but the code must be modified sig- from the last example could also have been produced using nificantly whenever the polynomial changes. A more sophisticated \begin{sympycode} approach automatically generates the LaTeX code and perhaps x = symbols('x') rounds the roots as well, for an arbitrary polynomial. myeq = Eq(4*x**2 + 2*x - 4) \end{sympycode} \begin{pylabcode} coeff = [4, 2, -4] The roots of the equation$\sympy{myeq}$ # Build a string containing equation are$\sympy{solve(myeq)}$. eq = '' for n, c in enumerate(coeff): The \sympy command uses a special interface to if n == 0 or str(c).startswith('-'): the LatexPrinter class, to allow for context-dependent eq += str(c) LatexPrinter settings. PythonTeX includes a utilities class, else: eq += '+' + str(c) and an instance of this class called pytex is created within each if len(coeff) - n - 1 == 1: PythonTeX session. The formatter() method of this class is eq += 'x' responsible for converting objects into strings for \py, \pylab, elif len(coeff) - n - 1 > 1: eq += 'x^' + str(len(coeff) - n - 1) and \sympy. In the case of SymPy, pytex.formatter() eq += '=0' provides an interface to LatexPrinter, with provision for # Get roots and format for LaTeX context-dependent customization. In LaTeX, there are four possi- r = ['{0:+.3f}'.format(root) ble math styles: displaystyle (regular equations), textstyle (inline), for root in roots(coeff)] latex_roots = ','.join(r) scriptstyle (superscripts and subscripts), and scriptscriptstyle (su- \end{pylabcode} perscripts and subscripts, of superscripts and subscripts). Separate 34 PROC. OF THE 11th PYTHON IN SCIENCE CONF. (SCIPY 2012)

LatexPrinter settings may be specified for each of these \begin{sympycode} styles individually, using a command of the form x, y = symbols('x, y') f = x + sin(y) pytex.set_sympy_latex(style, **kwargs) step1 = Integral(f, x, y) step2 = Integral(Integral(f, x).doit(), y) For example, by default \sympy is set to create normal-sized step3 = step2.doit() matrices in displaystyle and small matrices elsewhere. Thus, the \end{sympycode} following code \begin{align*} \sympy{step1}&= \sympy{step2} \\ \begin{sympycode} &= \sympy{step3} m = Matrix([[1,0], [0,1]]) \end{align } \end{sympycode} *

The matrix in inline is small:$\sympy{m}$ This produces

The matrix in an equation is of normal size: \[ \sympy{m}\] 1 x + sin(y) dxdy = x2 + xsin(y) dy ZZ Z 2 produces 1 = x2y xcos(y) 1 0 2 − The matrix in inline is small: 0 1 The matrix in an equation is of normal size:  Automated mathematical tables with SymPy 1 0 The creation of mathematical tables is another traditionally tedious 0 1   task that may be automated with PythonTeX and SymPy. Consider As another example, consider customizing the appearance of the following code, which automatically creates a small integral inverse trigonometric functions based on their context. and derivative table.

\begin{sympycode} \begin{sympycode} x = symbols('x') x = symbols('x') sineq = Eq(asin(x/2)-pi/3) funcs = ['sin(x)', 'cos(x)', 'sinh(x)', 'cosh(x)'] pytex.set_sympy_latex('display', ops = ['Integral', 'Derivative'] inv_trig_style='power') print('\\begin{align*}') pytex.set_sympy_latex('text', for func in funcs: inv_trig_style='full') for op in ops: \end{sympycode} obj = eval(op + '(' + func + ', x)') left = latex(obj) Inline:$\sympy{sineq}$ right = latex(obj.doit()) if op != ops[-1]: Equation:\[ \sympy{sineq}\] print(left + '&=' + right + '&') else: This creates print(left + '&=' + right + r'\\') print('\\end{align*}') Inline: arcsin 1 x 1 π = 0 \end{sympycode} 2 − 3 Equation:  ∂ sin(x) dx = cos(x) sin(x) = cos(x) 1 1 1 − x sin− x π = 0 Z ∂ 2 − 3 ∂   cos(x) dx = sin(x) cos(x) = sin(x) x − Notice that in both examples above, the \sympy com- Z ∂ ∂ mand is simply used—no information about context must be sinh(x) dx = cosh(x) sinh(x) = cosh(x) x passed to Python. On the Python side, the context-dependent Z ∂ ∂ LatexPrinter settings are used to determine whether the cosh(x) dx = sinh(x) cosh(x) = sinh(x) x LaTeX representation of some object is context-dependent. If not, Z ∂ Python creates a single LaTeX representation of the object and returns that. If the LaTeX representation is context-dependent, This code could easily be modified to generate a page or more then Python returns multiple LaTeX representations, wrapped of integrals and derivatives by simply adding additional function in LaTeX’s \mathchoice macro. The \mathchoice macro names to the funcs list. takes four arguments, one for each of the four LaTeX math styles display, text, script, and scriptscript. The correct argument is typeset by LaTeX based on the current math style. Debugging and access to stderr PythonTeX commands and environments save the Python code they contain to an external file, where it is processed by Python- Step-by-step derivations with SymPy TeX. When the Python code is executed, errors may occur. The With SymPy’s LaTeX functionality, it is simple to automate tasks line numbers for these errors do not correspond to the document that could otherwise be tedious. Instead of manually typing step- line numbers, because only the Python code contained in the by-step mathematical solutions, or copying them from an external document is executed; the LaTeX code is not present. Furthermore, program, the user can generate them automatically from within the error line numbers do not correspond to the line numbers LaTeX. that would be obtained by only counting the Python code in PYTHONTEX: FAST ACCESS TO PYTHON FROM WITHIN LATEX 35 the document, because PythonTeX must execute some boilerplate that was executed may be shown, or simply a name based on management code in addition to the user’s code. This presents a the session ("errorsession.py" in this case), or the more challenge for debugging. generic "" or "

Web Analytics