Proceedings of the 8Th Annual Python in Science Conference

Proceedings of the 8th Python in Science Conference (SciPy 2009) Fast numerical computations with Cython Dag Sverre Seljebotn ([email protected])– University of Osloabc, Norway aInstitute of Theoretical Astrophysics, University of Oslo, P.O. Box 1029 Blindern, N-0315 Oslo, Norway bDepartment of Mathematics, University of Oslo, P.O. Box 1053 Blindern, N-0316 Oslo, Norway cCentre of Mathematics for Applications, University of Oslo, P.O. Box 1053 Blindern, N-0316 Oslo, Norway Cython has recently gained popularity as a tool for one does not want a 1:1 mapping between the library conveniently performing numerical computations in API and the Python API, but rather a higher-level the Python environment, as well as mixing efficient Pythonic wrapper. Secondly, it allows incrementally calls to natively compiled libraries with Python code. speeding up Python code. One can start out with We discuss Cython’s features for fast NumPy array a simple Python prototype, then proceed to incre- access in detail through examples and benchmarks. mentally add type information and C-level optimiza- Using Cython to call natively compiled scientific li- tion strategies in the few locations that really mat- braries as well as using Cython in parallel computa- ter. While being a superset of Python is a goal for tions is also given consideration. We conclude with Cython, there is currently a few incompatibilities and a note on possible directions for future Cython de- unsupported constructs. The most important of these velopment. is inner functions and generators (closure support). In this paper we will discuss Cython from a numerical computation perspective. Code is provided for illus- Introduction tration purposes and the syntax is not explained in full, for a detailed introduction to Cython we refer to Python has in many fields become a popular choice [Tutorial] and [Docs]. [Wilbers] compares Cython with for scientific computation and visualization. Being de- similar tools ([f2py], [Weave], [Instant] and [Psyco]). signed as a general purpose scripting language without The comparison is for speeding up a particular numer- a specific target audience in mind, it tends to scale well ical loop, and both speed and usability is discussed. as simple experiments grow to complex applications. Cython here achieves a running time 1.6 times that From a numerical perspective, Python and associated of the Fortran implementation. We note that had the libraries can be regarded mainly as a convenient shell arrays been declared as contiguous at compile-time, around computational cores written in natively com- this would have been reduced to 1.3 times the time piled languages, such as C, C++ and Fortran. For in- of Fortran. [Ramach] is a similar set of benchmarks, stance, the Python-specific SciPy [SciPy] library con- which compare Pyrex and other tools with a pure tains over 200 000 lines of C++, 60 000 lines of C, and Python/NumPy implementation. Cython is based on 75 000 lines of Fortran, compared to about 70 000 lines [Pyrex] and the same results should apply, the main of Python code. difference being that Cython has friendlier syntax for There are several good reasons for such a workflow. accessing NumPy arrays efficiently. First, if the underlying compiled library is usable in its own right, and also has end-users writing code in MATLAB, C++ or Fortran, it may make little sense to Fast array access tie it too strongly to the Python environment. In such cases, writing the computational cores in a compiled Fast array access, added to the Cython language by language and using a Python wrapper to direct the D. S. Seljebotn and R. W. Bradshaw in 2008, was an computations can be the ideal workflow. Secondly, as important improvement in convenience for numerical we will see, the Python interpreter is too slow to be users. The work is based on PEP 3118, which defines a usable for writing low-level numerical loops. This is C API for direct access to the array data of Python ob- 1 particularly a problem for computations which can not jects acting as array data containers . Cython is able be expressed as operations on entire arrays. to treat most of the NumPy array data types as cor- Cython is a programming language based on Python, responding native C types. Since Cython 0.11.2, com- with additional syntax for optional static type declara- plex floating point types are supported, either through tions. The Cython compiler is able to translate Cython the C99 complex types or through Cython’s own im- code into C code making use of the CPython C API plementation. Record arrays are mapped to arrays of [CPyAPI], which can in turn be compiled into a mod- C structs for efficient access. Some data types are not ule loadable into any CPython session. The end-result supported, such as string/unicode arrays, arrays with can perhaps be described as a language which allows non-native endianness and boolean arrays. The lat- one to use Python and C interchangeably in the same ter can however be treated as 8-bit integer arrays in code. This has two important applications. First, Cython. [Tutorial] contains further details. it is useful for creating Python wrappers around na- 1PEP 3118 is only available on Python 2.6 and greater, there- tively compiled code, in particular in situations where fore a backwards-compatibility mechanism is also provided to 15 D. Seljebotnin Proc. SciPy 2009, G. Varoquaux, S. van der Walt, J. Millman (Eds), pp. 15–23 Fast numerical computations with Cython To discuss this feature we will start with the example bus. This is close to equally expensive for Python and of naive matrix multiplication. Beginning with a pure Cython and thus tends to slightly diminish any other Python implementation, we will incrementally add op- effects. Table 1 has the complete benchmarks, with timizations. The benchmarks should help Cython one in-cache benchmark and one out-of-cache bench- users decide how far one wants to go in other cases. mark in every case. For C = AB the computation is Note however that the speedup does not come without Pn Cij = AikBkj some costs. First, the routine is now only usable for k=1 64-bit floating point. Arrays containing any other data type will result in a ValueError being raised. Second, where n is the number of columns in A and rows in B. it is necessary to ensure that typed variables containing A basic implementation in pure Python looks like this: Python objects are not None. Failing to do so can def matmul(A, B, out): result in a crash or data corruption if None is passed for i in range(A.shape[0]): for j in range(B.shape[1]): to the routine. s = 0 The generated C source for the array lookup A[i, k] for k in range(A.shape[1]): now looks like this: s += A[i, k] * B[k, j] tmp_i = i; tmp_k = k; out[i,j] = s if (tmp_i < 0) tmp_i += A_shape_0; if (tmp_i < 0 || tmp_i >= A_shape_1) { For clarity of exposition, this skips the details of sanity PyErr_Format(<...>); checking the arguments. In a real setting one should err_lineno = 33; goto error; probably also automatically allocate out if not pro- } vided by the caller. if (tmp_k < 0) tmp_k += A_shape_1; if (tmp_k < 0 || tmp_k >= A_shape_1) { Simply compiling this in Cython results in a about PyErr_Format(<...>); 1.15x speedup over Python. This minor speedup is err_lineno = 33; goto error; due to the compiled C code being faster than Python’s } byte code interpreter. The generated C code still uses A_ik = *(dtype_t*)(A_data + tmp_i * A_stride_0 + tmp_k * A_stride_1); the Python C API, so that e.g. the array lookup A[i, k] translates into C code very similar to: This is a lot faster because there are no API calls in tmp = PyTuple_New(2); a normal situation, and access of the data happens if (!tmp) { err_lineno = 21; goto error; } directly to the underlying memory location. The ini- Py_INCREF(i); tial conditional tests are there for two reasons. First, PyTuple_SET_ITEM(tmp, 0, i); Py_INCREF(k); an if-test is needed to support negative indices. With PyTuple_SET_ITEM(tmp, 1, k); the usual Python semantics, A[-1, -1] should refer A_ik = PyObject_GetItem(A, tmp); to the lower-right corner of the matrix. Second, it is if (!A_ik) { err_lineno = 21; goto error; } necessary to raise an exception if an index is out of Py_DECREF(tmp); bounds. The result is a pointer to a Python object, which is Such if-tests can bring a large speed penalty, especially further processed with PyNumber_Multiply and so on. in the middle of the computational loop. It is therefore To get any real speedup, types must be added: possible to instruct Cython to turn off these features import numpy as np through compiler directives. The following code dis- cimport numpy as np ables support for negative indices (wraparound) and ctypedef np.float64_t dtype_t def matmul(np.ndarray[dtype_t, ndim=2] A, bounds checking: np.ndarray[dtype_t, ndim=2] B, cimport cython np.ndarray[dtype_t, ndim=2] out=None): @cython.boundscheck(False) cdef Py_ssize_t i, j, k @cython.wraparound(False) cdef dtype_t s def matmul(np.ndarray[dtype_t, ndim=2] A, if A is None or B is None: np.ndarray[dtype_t, ndim=2] B, raise ValueError("Input matrix cannot be None") np.ndarray[dtype_t, ndim=2] out=None): for i in range(A.shape[0]): <...> for j in range(B.shape[1]): s = 0 This removes all the if-tests from the generated code. for k in range(A.shape[1]): The resulting benchmarks indicate around 800 times s += A[i, k] * B[k, j] speedup at this point in the in-cache situation, 700 out[i,j] = s times out-of-cache.

Load more