Performance Analysis and Benchmarking of Python Workloads
Total Page:16
File Type:pdf, Size:1020Kb
Performance Analysis and Benchmarking of Python Workloads Arthur Crapé Student number: 01502848 Supervisor: Prof. dr. ir. Lieven Eeckhout Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Academic year 2019-2020 Performance Analysis and Benchmarking of Python Workloads Arthur Crapé Student number: 01502848 Supervisor: Prof. dr. ir. Lieven Eeckhout Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Academic year 2019-2020 PREFACE i Preface It is remarkable how engineering is not confined to its own field. Rather, it shows the student the importance of knowledge. Five years ago, I took on the challenge of getting through university, but little did I know that was only part of the puzzle. Somehow, the mathematics, the physics and the computer science made me realize how fascinating, riveting and compelling today’s world is. They showed me that a degree is not a one-way ticket to success and that the road to success is never-ending. This master’s dissertation might put an end to five inspiring, insightful and exciting years of engineering, but it is only the start of the bigger picture. However, success is achieved by your own rules and those achievements should be cherished, appreciated and acknowledged. For this, I would like to express my deepest gratitude to my supervisor Prof. Dr. Ir. Lieven Eeckhout. The weekly meetings, practical suggestions and the helpful advice were instrumental for the realisation of this dissertation. Thanks should also go to Dr. Ir. Almutaz Adileh for his help at the beginning of this academic year. Last but not least, I cannot begin to express my thanks to my family and friends, who have supported me throughout the years and have proven, one by one, to be invaluable and irreplaceable. Thank you. PERMISSION OF USE AND CONTENT ii Permission of Use and Content “The author(s) gives (give) permission to make this master dissertation available for con- sultation and to copy parts of this master dissertation for personal use. In all cases of other use, the copyright terms have to be respected, in particular with re- gard to the obligation to state explicitly the source when quoting results from this master dissertation.” Arthur Crapé, May 27, 2020 Performance Analysis and Benchmarking of Python Workloads Arthur Crapé Master’s dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Academic year 2019-2020 Supervisor: Prof. Dr. Ir. L. Eeckhout Faculty of Engineering and Architecture Ghent University Abstract As Python is becoming increasingly more popular, several alternatives to the standard Python implementation called CPython have been proposed. There exists numerous re- search regarding benchmarking of these approaches, but they either contain too few bench- marks, too many, do not represent the current industrial applications or try to draw con- clusions by focusing solely on the metric time. This thesis identifies the main shortcomings of current Python benchmark systems, presents a thorough clarification of the underlying meaning of the principal components of Python implementations, reports a scientifically based quantitative performance analysis and provides a framework to identify the most representative workloads from a sizeable set of benchmarks. Additionally, we apply the framework to a specific use-case, the Python JIT compiler implementation called PyPy. We rectify the speedup reported by the PyPy Speed center and find that a select number of benchmarks provide adequate results in order to draw similar conclusions as when using the entire benchmarking suite. Although this thesis applies the framework primarily on CPython versus PyPy, it is noted that the findings can be applied, and the recommenda- tions found in this thesis can be generalized, to other Python implementations. Index Terms Python, interpreters, PyPy, JIT, benchmarking, clustering PERFORMANCE ANALYSIS AND BENCHMARKING OF PYTHON WORKLOADS 2020 Performance Analysis and Benchmarking of Python Workloads Arthur Crape´ Supervisor: Lieven Eeckhout Abstract—As Python is becoming increasingly more developer of said implementations should have a popular, several alternatives to the standard Python good understanding of the various implementations implementation called CPython have been proposed. across the Python Ecosystem. There exists numerous research regarding benchmark- ing of these approaches, but they either contain too few benchmarks, too many, do not represent the current industrial applications or try to draw conclusions by II. BACKGROUND AND PROBLEM STATEMENT focusing solely on the metric time. This thesis identifies the main shortcomings of current Python benchmark A. Python Ecosystem systems, presents a thorough clarification of the under- lying meaning of the principal components of Python Contrary to compilers, an interpreter, in its implementations, reports a scientifically based quanti- simplest form, executes the generated processor- tative performance analysis and provides a framework comprehensive code directly and does not produce to identify the most representative workloads from an executable that can be distributed. The default a sizeable set of benchmarks. Additionally, we apply implementation of Python is CPython, an interpreted the framework to a specific use-case, the Python JIT compiler implementation called PyPy. We rectify the based implementation written in C, which can be speedup reported by the PyPy Speed center and find found on the main website of Python. that a select number of benchmarks provide adequate Next to interpreters, Python can also be imple- results in order to draw similar conclusions as when mented using a Just-In-Time compiler (JIT) such as using the entire benchmarking suite. Although this PyPy2, which compiles a program or certain parts thesis applies the framework primarily on CPython versus PyPy, it is noted that the findings can be thereof at runtime. At the heart of PyPy a hot-loop applied, and the recommendations found in this thesis identifier is located, which detects whether or not can be generalized, to other Python implementations. a certain part of the code is frequently used as Index Terms—Python, interpreters, PyPy, JIT, dynamically compiling code generally takes a long benchmarking, clustering time. We refer to the full dissertation for a complete I. INTRODUCTION overview of the Python Ecosystem. YTHON is a recent programming language P that focuses more on the ease and speed of B. Previous Work developing than other programming languages like The main benchmark suite that is often referred 1 Java or C do . It is considered more productive than to is the official Python Performance Benchmark other languages due to the fact that it uses less code Suite [1]. Firstly, the benchmarks from the official and comes with several useful libraries for, e.g., Python performance benchmark suite do not seem machine learning (ML) and web development. well grounded. According to the Python and PyPy To effectively run a program written in Python, speedcenters3, the benchmarks run only for at most one needs a Python implementation. Various imple- a few seconds, where most of them only run for a mentations exist, ranging from the default Python couple of milliseconds, a time frame on which a JIT interpreter called CPython to JIT compilers like implementation cannot show its strengths. PyPy and Numba to even static compilers such as Cython. It is apparent that both the user and the 2PyPy, an alternative Python implementation using a JIT: https://www.pypy.org/. 1The Python Programming Language can be found at: 3The Python and PyPy speedcenters can be found at: https://www.python.org/ https://speed.python.org/ and https://speed.pypy.org/. Secondly, the PyPerformance Benchmark Suite Finally, we investigate C-Extensions and conclude implementation uses an arbitrarily selected number more research is needed on this matter. of warm-ups for all benchmarks. As this research is a master’s dissertation, it was Benchmarking frameworks should not use an ar- decided to only focus on aforementioned points bitrarily selected number of iterations, must include in context of the standard implementation called benchmarks that take more than a few millisec- CPython versus the JIT compiler called PyPy. It onds and should use a statistically rigorous method. is apparent that these procedures can be carried Benchmark suites should also not include too many forward to other implementations. benchmarks in order for the suite to remain useful. Furthermore, none of the mentioned benchmark IV. KEY FINDINGS suites focus on trending applications of the Python We mention that for all of our results, we nor- language, namely ML. This gives rise to questions malize every metric by the number of instructions. such as whether or not a given speedup for a given Otherwise, we are unable to draw any meaningful implementation is even reliable at all, as the main conclusions. For PCA, we also scale our data so that application, ML for that matter, is not included in all of our normalized metrics reside in the [0, 1] the measurements. interval. Finally, the mentioned suites also only focus on For CPython, no important differences are noted the execution time. Although Redondo et al. [2] also when comparing startup and steady-state PC values. focus on memory usage, further research should go On the contrary, upon comparing startup and steady- into understanding the properties and characteristics state for PyPy, we find that several PCs experience of other metrics. different influences from the metrics they are built of, as shown on Figure 1. This is due to the different III. GOAL underlying behaviour of pure interpretation versus This thesis is built on four different research parts. JIT compilation. For this, a benchmarking framework was set-up to Alongside this, we find that PyPy requires more benchmark Python workloads in a statistically rigor- PCs in order to explain as much variance in the data ous way, similarly to Eeckhout et al.