Introduction to Python for Econometrics, Statistics and Data Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Load more

Introduction to Python for Econometrics, Statistics and Data Analysis 3rd Edition, 1st Revision Kevin Sheppard University of Oxford Monday 9th September, 2019 2 - ©2019 Kevin Sheppard Changes since the Third Edition • Verified that all code and examples work correctly against 2019 versions of modules. The notable packages and their versions are: – Python 3.7 (Preferred version) – NumPy: 1.16 – SciPy: 1.3 – pandas: 0.25 – matplotlib: 3.1 • Python 2.7 support has been officially dropped, although most examples continue to work with 2.7. Do not Python 2.7 in 2019 for numerical code. • Small typo fixes, thanks to Marton Huebler. • Fixed direct download of FRED data due to API changes, thanks to Jesper Termansen. • Thanks for Bill Tubbs for a detailed read and multiple typo reports. • Updated to changes in line profiler (see Ch. 24) • Updated deprecations in pandas. • Removed hold from plotting chapter since this is no longer required. • Thanks for Gen Li for multiple typo reports. • Tested all code on Pyton 3.6. Code has been tested against the current set of modules installed by conda as of February 2018. The notable packages and their versions are: – NumPy: 1.13 – Pandas: 0.22 ii Notes to the 3rd Edition This edition includes the following changes from the second edition (August 2014): • Rewritten installation section focused exclusively on using Continuum’s Anaconda. • Python 3.5 is the default version of Python instead of 2.7. Python 3.5 (or newer) is well supported by the Python packages required to analyze data and perform statistical analysis, and bring some new useful features, such as a new operator for matrix multiplication (@). • Removed distinction between integers and longs in built-in data types chapter. This distinction is only relevant for Python 2.7. • dot has been removed from most examples and replaced with @ to produce more readable code. • Split Cython and Numba into separate chapters to highlight the improved capabilities of Numba. • Verified all code working on current versions of core libraries using Python 3.5. • pandas – Updated syntax of pandas functions such as resample. – Added pandas Categorical. – Expanded coverage of pandas groupby. – Expanded coverage of date and time data types and functions. • New chapter introducing statsmodels, a package that facilitates statistical analysis of data. statsmod- els includes regression analysis, Generalized Linear Models (GLM) and time-series analysis using ARIMA models. iv Changes since the Second Edition • Fixed typos reported by a reader – thanks to Ilya Sorvachev • Code verified against Anaconda 2.0.1. • Added diagnostic tools and a simple method to use external code in the Cython section. • Updated the Numba section to reflect recent changes. • Fixed some typos in the chapter on Performance and Optimization. • Added examples of joblib and IPython’s cluster to the chapter on running code in parallel. • New chapter introducing object-oriented programming as a method to provide structure and orga- nization to related code. • Added seaborn to the recommended package list, and have included it be default in the graphics chapter. • Based on experience teaching Python to economics students, the recommended installation has been simplified by removing the suggestion to use virtual environment. The discussion of virtual environments as been moved to the appendix. • Rewrote parts of the pandas chapter. • Changed the Anaconda install to use both create and install, which shows how to install additional packages. • Fixed some missing packages in the direct install. • Changed the configuration of IPython to reflect best practices. • Added subsection covering IPython profiles. • Small section about Spyder as a good starting IDE. vi Notes to the 2nd Edition This edition includes the following changes from the first edition (March 2012): • The preferred installation method is now Continuum Analytics’ Anaconda. Anaconda is a complete scientific stack and is available for all major platforms. • New chapter on pandas. pandas provides a simple but powerful tool to manage data and perform preliminary analysis. It also greatly simplifies importing and exporting data. • New chapter on advanced selection of elements from an array. • Numba provides just-in-time compilation for numeric Python code which often produces large per- formance gains when pure NumPy solutions are not available (e.g. looping code). • Dictionary, set and tuple comprehensions • Numerous typos • All code has been verified working against Anaconda 1.7.0. viii Contents 1 Introduction 1 1.1 Background ............................................................ 1 1.2 Conventions ............................................................ 2 1.3 Important Components of the Python Scientific Stack ................................ 3 1.4 Setup ................................................................ 4 1.5 Using Python ........................................................... 6 1.6 Exercises .............................................................. 12 1.A Additional Installation Issues ................................................. 13 2 Python 2.7 vs. 3 (and the rest) 19 2.1 Python 2.7 vs. 3.x ........................................................ 19 2.2 Intel Math Kernel Library and AMD’s GPUOpen Libraries ............................. 19 2.3 Other Variants .......................................................... 20 2.A Relevant Differences between Python 2.7 and 3 ................................... 20 3 Built-in Data Types 23 3.1 Variable Names ......................................................... 23 3.2 Core Native Data Types .................................................... 24 3.3 Additional Container Data Types in the Standard Library .............................. 34 3.4 Python and Memory Management ............................................. 35 3.5 Exercises .............................................................. 37 4 Arrays and Matrices 39 4.1 Array ................................................................. 39 4.2 Matrix ................................................................ 41 4.3 1-dimensional Arrays ...................................................... 42 4.4 2-dimensional Arrays ...................................................... 43 4.5 Multidimensional Arrays .................................................... 43 4.6 Concatenation .......................................................... 43 4.7 Accessing Elements of an Array .............................................. 44 4.8 Slicing and Memory Management ............................................. 49 4.9 import and Modules ...................................................... 51 x CONTENTS 4.10 Calling Functions ........................................................ 52 4.11 Exercises .............................................................. 54 5 Basic Math 57 5.1 Operators ............................................................. 57 5.2 Broadcasting ........................................................... 57 5.3 Addition (+) and Subtraction (-) ............................................... 59 5.4 Multiplication (*) ......................................................... 59 5.5 Matrix Multiplication (@) .................................................... 59 5.6 Array and Matrix Division (=) ................................................. 60 5.7 Exponentiation (**) ....................................................... 60 5.8 Parentheses ............................................................ 60 5.9 Transpose ............................................................. 61 5.10 Operator Precedence ..................................................... 61 5.11 Exercises .............................................................. 62 6 Basic Functions and Numerical Indexing 63 6.1 Generating Arrays and Matrices .............................................. 63 6.2 Rounding .............................................................. 66 6.3 Mathematics ........................................................... 67 6.4 Complex Values ......................................................... 69 6.5 Set Functions ........................................................... 69 6.6 Sorting and Extreme Values ................................................. 70 6.7 Nan Functions .......................................................... 72 6.8 Functions and Methods/Properties ............................................. 73 6.9 Exercises .............................................................. 74 7 Special Arrays 77 7.1 Exercises .............................................................. 78 8 Array and Matrix Functions 79 8.1 Views ................................................................ 79 8.2 Shape Information and Transformation .......................................... 80 8.3 Linear Algebra Functions ................................................... 87 8.4 Exercises .............................................................. 90 9 Importing and Exporting Data 93 9.1 Importing Data using pandas ................................................ 93 9.2 Importing Data without pandas ............................................... 94 9.3 Saving or Exporting Data using pandas ......................................... 99 9.4 Saving or Exporting Data without pandas ........................................ 100 9.5 Exercises .............................................................. 101 CONTENTS xi 10 Inf, NaN and Numeric Limits 103 10.1 inf and NaN ...........................................................
Recommended publications
  • Python on Gpus (Work in Progress!)

    Python on Gpus (Work in Progress!)

    Python on GPUs (work in progress!) Laurie Stephey GPUs for Science Day, July 3, 2019 Rollin Thomas, NERSC Lawrence Berkeley National Laboratory Python is friendly and popular Screenshots from: https://www.tiobe.com/tiobe-index/ So you want to run Python on a GPU? You have some Python code you like. Can you just run it on a GPU? import numpy as np from scipy import special import gpu ? Unfortunately no. What are your options? Right now, there is no “right” answer ● CuPy ● Numba ● pyCUDA (https://mathema.tician.de/software/pycuda/) ● pyOpenCL (https://mathema.tician.de/software/pyopencl/) ● Rewrite kernels in C, Fortran, CUDA... DESI: Our case study Now Perlmutter 2020 Goal: High quality output spectra Spectral Extraction CuPy (https://cupy.chainer.org/) ● Developed by Chainer, supported in RAPIDS ● Meant to be a drop-in replacement for NumPy ● Some, but not all, NumPy coverage import numpy as np import cupy as cp cpu_ans = np.abs(data) #same thing on gpu gpu_data = cp.asarray(data) gpu_temp = cp.abs(gpu_data) gpu_ans = cp.asnumpy(gpu_temp) Screenshot from: https://docs-cupy.chainer.org/en/stable/reference/comparison.html eigh in CuPy ● Important function for DESI ● Compared CuPy eigh on Cori Volta GPU to Cori Haswell and Cori KNL ● Tried “divide-and-conquer” approach on both CPU and GPU (1, 2, 5, 10 divisions) ● Volta wins only at very large matrix sizes ● Major pro: eigh really easy to use in CuPy! legval in CuPy ● Easy to convert from NumPy arrays to CuPy arrays ● This function is ~150x slower than the cpu version! ● This implies there
  • Introduction Shrinkage Factor Reference

    Introduction Shrinkage Factor Reference

    Comparison study for implementation efficiency of CUDA GPU parallel computation with the fast iterative shrinkage-thresholding algorithm Younsang Cho, Donghyeon Yu Department of Statistics, Inha university 4. TensorFlow functions in Python (TF-F) Introduction There are some functions executed on GPU in TensorFlow. So, we implemented our algorithm • Parallel computation using graphics processing units (GPUs) gets much attention and is just using that functions. efficient for single-instruction multiple-data (SIMD) processing. 5. Neural network with TensorFlow in Python (TF-NN) • Theoretical computation capacity of the GPU device has been growing fast and is much higher Neural network model is flexible, and the LASSO problem can be represented as a simple than that of the CPU nowadays (Figure 1). neural network with an ℓ1-regularized loss function • There are several platforms for conducting parallel computation on GPUs using compute 6. Using dynamic link library in Python (P-DLL) unified device architecture (CUDA) developed by NVIDIA. (Python, PyCUDA, Tensorflow, etc. ) As mentioned before, we can load DLL files, which are written in CUDA C, using "ctypes.CDLL" • However, it is unclear what platform is the most efficient for CUDA. that is a built-in function in Python. 7. Using dynamic link library in R (R-DLL) We can also load DLL files, which are written in CUDA C, using "dyn.load" in R. FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) We consider FISTA (Beck and Teboulle, 2009) with backtracking as the following: " Step 0. Take �! > 0, some � > 1, and �! ∈ ℝ . Set �# = �!, �# = 1. %! Step k. � ≥ 1 Find the smallest nonnegative integers �$ such that with �g = � �$&# � �(' �$ ≤ �(' �(' �$ , �$ .
  • Department of Geography

    Department of Geography

    Department of Geography UNIVERSITY OF FLORIDA, SPRING 2019 GEO 4167c section #09A6 / GEO 6161 section # 09A9 (3.0 credit hours) Course# 15235/15271 Intermediate Quantitative Methods Instructor: Timothy J. Fik, Ph.D. (Associate Professor) Prerequisite: GEO 3162 / GEO 6160 or equivalent Lecture Time/Location: Tuesdays, Periods 3-5: 9:35AM-12:35PM / Turlington 3012 Instructor’s Office: 3137 Turlington Hall Instructor’s e-mail address: [email protected] Formal Office Hours Tuesdays -- 1:00PM – 4:30PM Thursdays -- 1:30PM – 3:00PM; and 4:00PM – 4:30PM Course Materials (Power-point presentations in pdf format) will be uploaded to the on-line course Lecture folder on Canvas. Course Overview GEO 4167x/GEO 6161 surveys various statistical modeling techniques that are widely used in the social, behavioral, and environmental sciences. Lectures will focus on several important topics… including common indices of spatial association and dependence, linear and non-linear model development, model diagnostics, and remedial measures. The lectures will largely be devoted to the topic of Regression Analysis/Econometrics (and the General Linear Model). Applications will involve regression models using cross-sectional, quantitative, qualitative, categorical, time-series, and/or spatial data. Selected topics include, yet are not limited to, the following: Classic Least Squares Regression plus Extensions of the General Linear Model (GLM) Matrix Algebra approach to Regression and the GLM Join-Count Statistics (Dacey’s Contiguity Tests) Spatial Autocorrelation / Regression
  • Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA Luciano Martins and Robert Sohigian, 2018-11-22 Introduction to Python

    Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA Luciano Martins and Robert Sohigian, 2018-11-22 Introduction to Python

    Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA Luciano Martins and Robert Sohigian, 2018-11-22 Introduction to Python GPU-Accelerated Computing NVIDIA® CUDA® technology Why Use Python with GPUs? Agenda Methods: PyCUDA, Numba, CuPy, and scikit-cuda Summary Q&A 2 Introduction to Python Released by Guido van Rossum in 1991 The Zen of Python: Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Interpreted language (CPython, Jython, ...) Dynamically typed; based on objects 3 Introduction to Python Small core structure: ~30 keywords ~ 80 built-in functions Indentation is a pretty serious thing Dynamically typed; based on objects Binds to many different languages Supports GPU acceleration via modules 4 Introduction to Python 5 Introduction to Python 6 Introduction to Python 7 GPU-Accelerated Computing “[T]the use of a graphics processing unit (GPU) together with a CPU to accelerate deep learning, analytics, and engineering applications” (NVIDIA) Most common GPU-accelerated operations: Large vector/matrix operations (Basic Linear Algebra Subprograms - BLAS) Speech recognition Computer vision 8 GPU-Accelerated Computing Important concepts for GPU-accelerated computing: Host ― the machine running the workload (CPU) Device ― the GPUs inside of a host Kernel ― the code part that runs on the GPU SIMT ― Single Instruction Multiple Threads 9 GPU-Accelerated Computing 10 GPU-Accelerated Computing 11 CUDA Parallel computing
  • The Evolution of Econometric Software Design: a Developer's View

    The Evolution of Econometric Software Design: a Developer's View

    Journal of Economic and Social Measurement 29 (2004) 205–259 205 IOS Press The evolution of econometric software design: A developer’s view Houston H. Stokes Department of Economics, College of Business Administration, University of Illinois at Chicago, 601 South Morgan Street, Room 2103, Chicago, IL 60607-7121, USA E-mail: [email protected] In the last 30 years, changes in operating systems, computer hardware, compiler technology and the needs of research in applied econometrics have all influenced econometric software development and the environment of statistical computing. The evolution of various representative software systems, including B34S developed by the author, are used to illustrate differences in software design and the interrelation of a number of factors that influenced these choices. A list of desired econometric software features, software design goals and econometric programming language characteristics are suggested. It is stressed that there is no one “ideal” software system that will work effectively in all situations. System integration of statistical software provides a means by which capability can be leveraged. 1. Introduction 1.1. Overview The development of modern econometric software has been influenced by the changing needs of applied econometric research, the expanding capability of com- puter hardware (CPU speed, disk storage and memory), changes in the design and capability of compilers, and the availability of high-quality subroutine libraries. Soft- ware design in turn has itself impacted applied econometric research, which has seen its horizons expand rapidly in the last 30 years as new techniques of analysis became computationally possible. How some of these interrelationships have evolved over time is illustrated by a discussion of the evolution of the design and capability of the B34S Software system [55] which is contrasted to a selection of other software systems.
  • How to Access Python for Doing Scientific Computing

    How to Access Python for Doing Scientific Computing

    How to access Python for doing scientific computing1 Hans Petter Langtangen1,2 1Center for Biomedical Computing, Simula Research Laboratory 2Department of Informatics, University of Oslo Mar 23, 2015 A comprehensive eco system for scientific computing with Python used to be quite a challenge to install on a computer, especially for newcomers. This problem is more or less solved today. There are several options for getting easy access to Python and the most important packages for scientific computations, so the biggest issue for a newcomer is to make a proper choice. An overview of the possibilities together with my own recommendations appears next. Contents 1 Required software2 2 Installing software on your laptop: Mac OS X and Windows3 3 Anaconda and Spyder4 3.1 Spyder on Mac............................4 3.2 Installation of additional packages.................5 3.3 Installing SciTools on Mac......................5 3.4 Installing SciTools on Windows...................5 4 VMWare Fusion virtual machine5 4.1 Installing Ubuntu...........................6 4.2 Installing software on Ubuntu....................7 4.3 File sharing..............................7 5 Dual boot on Windows8 6 Vagrant virtual machine9 1The material in this document is taken from a chapter in the book A Primer on Scientific Programming with Python, 4th edition, by the same author, published by Springer, 2014. 7 How to write and run a Python program9 7.1 The need for a text editor......................9 7.2 Spyder................................. 10 7.3 Text editors.............................. 10 7.4 Terminal windows.......................... 11 7.5 Using a plain text editor and a terminal window......... 12 8 The SageMathCloud and Wakari web services 12 8.1 Basic intro to SageMathCloud...................
  • International Journal of Forecasting Guidelines for IJF Software Reviewers

    International Journal of Forecasting Guidelines for IJF Software Reviewers

    International Journal of Forecasting Guidelines for IJF Software Reviewers It is desirable that there be some small degree of uniformity amongst the software reviews in this journal, so that regular readers of the journal can have some idea of what to expect when they read a software review. In particular, I wish to standardize the second section (after the introduction) of the review, and the penultimate section (before the conclusions). As stand-alone sections, they will not materially affect the reviewers abillity to craft the review as he/she sees fit, while still providing consistency between reviews. This applies mostly to single-product reviews, but some of the ideas presented herein can be successfully adapted to a multi-product review. The second section, Overview, is an overview of the package, and should include several things. · Contact information for the developer, including website address. · Platforms on which the package runs, and corresponding prices, if available. · Ancillary programs included with the package, if any. · The final part of this section should address Berk's (1987) list of criteria for evaluating statistical software. Relevant items from this list should be mentioned, as in my review of RATS (McCullough, 1997, pp.182- 183). · My use of Berk was extremely terse, and should be considered a lower bound. Feel free to amplify considerably, if the review warrants it. In fact, Berk's criteria, if considered in sufficient detail, could be the outline for a review itself. The penultimate section, Numerical Details, directly addresses numerical accuracy and reliality, if these topics are not addressed elsewhere in the review.
  • Tangent: Automatic Differentiation Using Source-Code Transformation for Dynamically Typed Array Programming

    Tangent: Automatic Differentiation Using Source-Code Transformation for Dynamically Typed Array Programming

    Tangent: Automatic differentiation using source-code transformation for dynamically typed array programming Bart van Merriënboer Dan Moldovan Alexander B Wiltschko MILA, Google Brain Google Brain Google Brain [email protected] [email protected] [email protected] Abstract The need to efficiently calculate first- and higher-order derivatives of increasingly complex models expressed in Python has stressed or exceeded the capabilities of available tools. In this work, we explore techniques from the field of automatic differentiation (AD) that can give researchers expressive power, performance and strong usability. These include source-code transformation (SCT), flexible gradient surgery, efficient in-place array operations, and higher-order derivatives. We implement and demonstrate these ideas in the Tangent software library for Python, the first AD framework for a dynamic language that uses SCT. 1 Introduction Many applications in machine learning rely on gradient-based optimization, or at least the efficient calculation of derivatives of models expressed as computer programs. Researchers have a wide variety of tools from which they can choose, particularly if they are using the Python language [21, 16, 24, 2, 1]. These tools can generally be characterized as trading off research or production use cases, and can be divided along these lines by whether they implement automatic differentiation using operator overloading (OO) or SCT. SCT affords more opportunities for whole-program optimization, while OO makes it easier to support convenient syntax in Python, like data-dependent control flow, or advanced features such as custom partial derivatives. We show here that it is possible to offer the programming flexibility usually thought to be exclusive to OO-based tools in an SCT framework.
  • Goless Documentation Release 0.6.0

    Goless Documentation Release 0.6.0

    goless Documentation Release 0.6.0 Rob Galanakis July 11, 2014 Contents 1 Intro 3 2 Goroutines 5 3 Channels 7 4 The select function 9 5 Exception Handling 11 6 Examples 13 7 Benchmarks 15 8 Backends 17 9 Compatibility Details 19 9.1 PyPy................................................... 19 9.2 Python 2 (CPython)........................................... 19 9.3 Python 3 (CPython)........................................... 19 9.4 Stackless Python............................................. 20 10 goless and the GIL 21 11 References 23 12 Contributing 25 13 Miscellany 27 14 Indices and tables 29 i ii goless Documentation, Release 0.6.0 • Intro • Goroutines • Channels • The select function • Exception Handling • Examples • Benchmarks • Backends • Compatibility Details • goless and the GIL • References • Contributing • Miscellany • Indices and tables Contents 1 goless Documentation, Release 0.6.0 2 Contents CHAPTER 1 Intro The goless library provides Go programming language semantics built on top of gevent, PyPy, or Stackless Python. For an example of what goless can do, here is the Go program at https://gobyexample.com/select reimplemented with goless: c1= goless.chan() c2= goless.chan() def func1(): time.sleep(1) c1.send(’one’) goless.go(func1) def func2(): time.sleep(2) c2.send(’two’) goless.go(func2) for i in range(2): case, val= goless.select([goless.rcase(c1), goless.rcase(c2)]) print(val) It is surely a testament to Go’s style that it isn’t much less Python code than Go code, but I quite like this. Don’t you? 3 goless Documentation, Release 0.6.0 4 Chapter 1. Intro CHAPTER 2 Goroutines The goless.go() function mimics Go’s goroutines by, unsurprisingly, running the routine in a tasklet/greenlet.
  • Estimating Regression Models for Categorical Dependent Variables Using SAS, Stata, LIMDEP, and SPSS*

    Estimating Regression Models for Categorical Dependent Variables Using SAS, Stata, LIMDEP, and SPSS*

    © 2003-2008, The Trustees of Indiana University Regression Models for Categorical Dependent Variables: 1 Estimating Regression Models for Categorical Dependent Variables Using SAS, Stata, LIMDEP, and SPSS* Hun Myoung Park (kucc625) This document summarizes regression models for categorical dependent variables and illustrates how to estimate individual models using SAS 9.1, Stata 10.0, LIMDEP 9.0, and SPSS 16.0. 1. Introduction 2. The Binary Logit Model 3. The Binary Probit Model 4. Bivariate Logit/Probit Models 5. Ordered Logit/Probit Models 6. The Multinomial Logit Model 7. The Conditional Logit Model 8. The Nested Logit Model 9. Conclusion 1. Introduction A categorical variable here refers to a variable that is binary, ordinal, or nominal. Event count data are discrete (categorical) but often treated as continuous variables. When a dependent variable is categorical, the ordinary least squares (OLS) method can no longer produce the best linear unbiased estimator (BLUE); that is, OLS is biased and inefficient. Consequently, researchers have developed various regression models for categorical dependent variables. The nonlinearity of categorical dependent variable models (CDVMs) makes it difficult to fit the models and interpret their results. 1.1 Regression Models for Categorical Dependent Variables In CDVMs, the left-hand side (LHS) variable or dependent variable is neither interval nor ratio, but rather categorical. The level of measurement and data generation process (DGP) of a dependent variable determines the proper type of CDVM. Binary responses (0 or 1) are modeled with binary logit and probit regressions, ordinal responses (1st, 2nd, 3rd, …) are formulated into (generalized) ordered logit/probit regressions, and nominal responses are analyzed by multinomial logit, conditional logit, or nested logit models depending on specific circumstances.
  • Arxiv:1210.6293V1 [Cs.MS] 23 Oct 2012 So the Finally, Accessible

    Arxiv:1210.6293V1 [Cs.MS] 23 Oct 2012 So the Finally, Accessible

    Journal of Machine Learning Research 1 (2012) 1-4 Submitted 9/12; Published x/12 MLPACK: A Scalable C++ Machine Learning Library Ryan R. Curtin [email protected] James R. Cline [email protected] N. P. Slagle [email protected] William B. March [email protected] Parikshit Ram [email protected] Nishant A. Mehta [email protected] Alexander G. Gray [email protected] College of Computing Georgia Institute of Technology Atlanta, GA 30332 Editor: Abstract MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning library re- leased in late 2011 offering both a simple, consistent API accessible to novice users and high performance and flexibility to expert users by leveraging modern features of C++. ML- PACK provides cutting-edge algorithms whose benchmarks exhibit far better performance than other leading machine learning libraries. MLPACK version 1.0.3, licensed under the LGPL, is available at http://www.mlpack.org. 1. Introduction and Goals Though several machine learning libraries are freely available online, few, if any, offer efficient algorithms to the average user. For instance, the popular Weka toolkit (Hall et al., 2009) emphasizes ease of use but scales poorly; the distributed Apache Mahout library offers scal- ability at a cost of higher overhead (such as clusters and powerful servers often unavailable to the average user). Also, few libraries offer breadth; for instance, libsvm (Chang and Lin, 2011) and the Tilburg Memory-Based Learner (TiMBL) are highly scalable and accessible arXiv:1210.6293v1 [cs.MS] 23 Oct 2012 yet each offer only a single method.
  • IMPLEMENTING OPTION PRICING MODELS USING PYTHON and CYTHON Sanjiv Dasa and Brian Grangerb

    IMPLEMENTING OPTION PRICING MODELS USING PYTHON and CYTHON Sanjiv Dasa and Brian Grangerb

    JOURNAL OF INVESTMENT MANAGEMENT, Vol. 8, No. 4, (2010), pp. 1–12 © JOIM 2010 JOIM www.joim.com IMPLEMENTING OPTION PRICING MODELS USING PYTHON AND CYTHON Sanjiv Dasa and Brian Grangerb In this article we propose a new approach for implementing option pricing models in finance. Financial engineers typically prototype such models in an interactive language (such as Matlab) and then use a compiled language such as C/C++ for production systems. Code is therefore written twice. In this article we show that the Python programming language and the Cython compiler allows prototyping in a Matlab-like manner, followed by direct generation of optimized C code with very minor code modifications. The approach is able to call upon powerful scientific libraries, uses only open source tools, and is free of any licensing costs. We provide examples where Cython speeds up a prototype version by over 500 times. These performance gains in conjunction with vast savings in programmer time make the approach very promising. 1 Introduction production version in a compiled language like C, C++, or Fortran. This duplication of effort slows Computing in financial engineering needs to be fast the deployment of new algorithms and creates ongo- in two critical ways. First, the software development ing software maintenance problems, not to mention process must be fast for human developers; pro- added development costs. grams must be easy and time efficient to write and maintain. Second, the programs themselves must In this paper we describe an approach to tech- be fast when they are run. Traditionally, to meet nical software development that enables a single both of these requirements, at least two versions of version of a program to be written that is both easy a program have to be developed and maintained; a to develop and maintain and achieves high levels prototype in a high-level interactive language like of performance.