Performance Analysis and Benchmarking of Python Workloads

Arthur Crapé Student number: 01502848

Supervisor: Prof. dr. ir. Lieven Eeckhout

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Academic year 2019-2020

Performance Analysis and Benchmarking of Python Workloads

Arthur Crapé Student number: 01502848

Supervisor: Prof. dr. ir. Lieven Eeckhout

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Academic year 2019-2020 PREFACE i

Preface

It is remarkable how engineering is not confined to its own field. Rather, it shows the student the importance of knowledge. Five years ago, I took on the challenge of getting through university, but little did I know that was only part of the puzzle. Somehow, the mathematics, the physics and the computer science made me realize how fascinating, riveting and compelling today’s world is. They showed me that a degree is not a one-way ticket to success and that the road to success is never-ending. This master’s dissertation might put an end to five inspiring, insightful and exciting years of engineering, but it is only the start of the bigger picture. However, success is achieved by your own rules and those achievements should be cherished, appreciated and acknowledged. For this, I would like to express my deepest gratitude to my supervisor Prof. Dr. Ir. Lieven Eeckhout. The weekly meetings, practical suggestions and the helpful advice were instrumental for the realisation of this dissertation. Thanks should also go to Dr. Ir. Almutaz Adileh for his help at the beginning of this academic year. Last but not least, I cannot begin to express my thanks to my family and friends, who have supported me throughout the years and have proven, one by one, to be invaluable and irreplaceable.

Thank you. PERMISSION OF USE AND CONTENT ii

Permission of Use and Content

“The author(s) gives (give) permission to make this master dissertation available for con- sultation and to copy parts of this master dissertation for personal use. In all cases of other use, the copyright terms have to be respected, in particular with re- gard to the obligation to state explicitly the source when quoting results from this master dissertation.”

Arthur Crapé, May 27, 2020 Performance Analysis and Benchmarking of Python Workloads

Arthur Crapé

Master’s dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Academic year 2019-2020

Supervisor: Prof. Dr. Ir. L. Eeckhout

Faculty of Engineering and Architecture Ghent University

Abstract

As Python is becoming increasingly more popular, several alternatives to the standard Python implementation called CPython have been proposed. There exists numerous re- search regarding benchmarking of these approaches, but they either contain too few bench- marks, too many, do not represent the current industrial applications or try to draw con- clusions by focusing solely on the metric time. This thesis identifies the main shortcomings of current Python benchmark systems, presents a thorough clarification of the underlying meaning of the principal components of Python implementations, reports a scientifically based quantitative performance analysis and provides a framework to identify the most representative workloads from a sizeable set of benchmarks. Additionally, we apply the framework to a specific use-case, the Python JIT implementation called PyPy. We rectify the speedup reported by the PyPy Speed center and find that a select number of benchmarks provide adequate results in order to draw similar conclusions as when using the entire benchmarking suite. Although this thesis applies the framework primarily on CPython versus PyPy, it is noted that the findings can be applied, and the recommenda- tions found in this thesis can be generalized, to other Python implementations.

Index Terms

Python, interpreters, PyPy, JIT, benchmarking, clustering PERFORMANCE ANALYSIS AND BENCHMARKING OF PYTHON WORKLOADS 2020

Performance Analysis and Benchmarking of Python Workloads Arthur Crape´ Supervisor: Lieven Eeckhout

Abstract—As Python is becoming increasingly more developer of said implementations should have a popular, several alternatives to the standard Python good understanding of the various implementations implementation called CPython have been proposed. across the Python Ecosystem. There exists numerous research regarding benchmark- ing of these approaches, but they either contain too few benchmarks, too many, do not represent the current industrial applications or try to draw conclusions by II.BACKGROUNDAND PROBLEM STATEMENT focusing solely on the metric time. This thesis identifies the main shortcomings of current Python benchmark A. Python Ecosystem systems, presents a thorough clarification of the under- lying meaning of the principal components of Python Contrary to , an interpreter, in its implementations, reports a scientifically based quanti- simplest form, executes the generated processor- tative performance analysis and provides a framework comprehensive code directly and does not produce to identify the most representative workloads from an executable that can be distributed. The default a sizeable set of benchmarks. Additionally, we apply implementation of Python is CPython, an interpreted the framework to a specific use-case, the Python JIT compiler implementation called PyPy. We rectify the based implementation written in , which can be speedup reported by the PyPy Speed center and find found on the main website of Python. that a select number of benchmarks provide adequate Next to interpreters, Python can also be imple- results in order to draw similar conclusions as when mented using a Just-In-Time compiler (JIT) such as using the entire benchmarking suite. Although this PyPy2, which compiles a program or certain parts thesis applies the framework primarily on CPython versus PyPy, it is noted that the findings can be thereof at runtime. At the heart of PyPy a hot-loop applied, and the recommendations found in this thesis identifier is located, which detects whether or not can be generalized, to other Python implementations. a certain part of the code is frequently used as Index Terms—Python, interpreters, PyPy, JIT, dynamically compiling code generally takes a long benchmarking, clustering time. We refer to the full dissertation for a complete I.INTRODUCTION overview of the Python Ecosystem. YTHON is a recent P that focuses more on the ease and speed of B. Previous Work developing than other programming languages like The main benchmark suite that is often referred 1 Java or C do . It is considered more productive than to is the official Python Performance Benchmark other languages due to the fact that it uses less code Suite [1]. Firstly, the benchmarks from the official and comes with several useful libraries for, e.g., Python performance benchmark suite do not seem machine learning (ML) and web development. well grounded. According to the Python and PyPy To effectively run a program written in Python, speedcenters3, the benchmarks run only for at most one needs a Python implementation. Various imple- a few seconds, where most of them only run for a mentations exist, ranging from the default Python couple of milliseconds, a time frame on which a JIT interpreter called CPython to JIT compilers like implementation cannot show its strengths. PyPy and to even static compilers such as . It is apparent that both the user and the 2PyPy, an alternative Python implementation using a JIT: https://www.pypy.org/. 1The Python Programming Language can be found at: 3The Python and PyPy speedcenters can be found at: https://www.python.org/ https://speed.python.org/ and https://speed.pypy.org/. Secondly, the PyPerformance Benchmark Suite Finally, we investigate C-Extensions and conclude implementation uses an arbitrarily selected number more research is needed on this matter. of warm-ups for all benchmarks. As this research is a master’s dissertation, it was Benchmarking frameworks should not use an ar- decided to only focus on aforementioned points bitrarily selected number of iterations, must include in context of the standard implementation called benchmarks that take more than a few millisec- CPython versus the JIT compiler called PyPy. It onds and should use a statistically rigorous method. is apparent that these procedures can be carried Benchmark suites should also not include too many forward to other implementations. benchmarks in order for the suite to remain useful. Furthermore, none of the mentioned benchmark IV. KEY FINDINGS suites focus on trending applications of the Python We mention that for all of our results, we nor- language, namely ML. This gives rise to questions malize every metric by the number of instructions. such as whether or not a given speedup for a given Otherwise, we are unable to draw any meaningful implementation is even reliable at all, as the main conclusions. For PCA, we also scale our data so that application, ML for that matter, is not included in all of our normalized metrics reside in the [0, 1] the measurements. interval. Finally, the mentioned suites also only focus on For CPython, no important differences are noted the execution time. Although Redondo et al. [2] also when comparing startup and steady-state PC values. focus on memory usage, further research should go On the contrary, upon comparing startup and steady- into understanding the properties and characteristics state for PyPy, we find that several PCs experience of other metrics. different influences from the metrics they are built of, as shown on Figure 1. This is due to the different III.GOAL underlying behaviour of pure interpretation versus This thesis is built on four different research parts. JIT compilation. For this, a benchmarking framework was set-up to Alongside this, we find that PyPy requires more benchmark Python workloads in a statistically rigor- PCs in order to explain as much variance in the data ous way, similarly to Eeckhout et al. [3]. Alongside as CPython. More specifically, for PyPy, at least five this, 117 benchmarks were selected from a range of dimensions are needed to explain at least 90% of different benchmarking suites. the data, whilst only four dimensions are needed First, Principal Component Analysis (PCA) was for CPython, as shown on Figure 2. applied to the results of the benchmarking frame- We disprove the use of the geometric mean for work, similar to Eeckhout et al. [4]. Here, we get comparing workloads run using different Python im- a better understanding of the underlying meaning plementations and show using Figure 3 that different of the different principal components (PCs) by conclusions are found when using the harmonic comparing the PCs of the different configurations. average than when using the geometric mean. An example of such a configuration is CPython in For the third objective of this thesis, we take a steady-state. closer look at how the results differ when using The second part focuses on a quantitative analysis a subset of the benchmarks. For each cluster we of CPython and PyPy. We also disprove the use of take the benchmark closest to the cluster center. We the geometric mean for comparing workloads using then calculate the weighted harmonic average over different Python implementations, which is done on all the closest benchmarks, with the weights equal the official PyPy Speed Center. to the number of benchmarks in the corresponding Next, we investigate how conclusions differ when cluster. We note that we are satisfied with a weighted using a subset of the benchmarks. For this, we average in a 2% range of the average over all cluster workloads according to two different def- benchmarks. initions of speedup. The first speedup is defined First, we define the speedup as the execution as the execution time of the base environment time of the base environment (CPython), divided (CPython), divided by the execution time of the by the execution time of the alternative environment alternative environment (PyPy). The second speedup (PyPy). As can be seen on Figure 4, upon clustering is defined as the execution time when using the startup and steady-state using this definition, we find startup methodology, divided by the execution time that 60 benchmarks suffice, rather than the entire set when using the steady-state methodology. of benchmarks. Fig. 3: S-Curve for the PyPerformance Benchmark Suite using the startup methodology. We find that PyPy is slower than CPython with a harmonic average of 0.77. This is not clear from the geometric average of 1.64, indicating that PyPy would be faster.

As is clear on Figure 5, upon using the speedup Fig. 1: The PC values for all benchmarks in the defined as the execution time using the startup PyPy Environment. (a) Startup methodology and methodology, divided by the execution time of the (b) Steady-state methodology. Some components steady-state methodology, 30 benchmarks suffice, have inverted behaviour with respect to the startup rather than the entire set of benchmarks. components. Differences in magnitudes of some Alongside this, we also found that the geometric metrics such as time, context-switches, page-faults mean overestimates the speedup and report that the and branch-misses are visible and suggest a different chosen benchmarks are selected in a fair way from behaviour between startup and steady-state. the different benchmark suites. Finally, we report our findings regarding the MLPerf benchmark suite. In order to use bench- marks from this suite, several ML libraries, that are written in C, have to be installed on the Python implementations, such as Tensorflow or scikit-learn. This is why only one MLPerf benchmark was eval- uated, more specifically the gmnt benchmark. Here we found that, switching Python implementations for ML is not useful for this workload as only 5% of the benchmark time is spent in the Python code. In order to evaluate the time taken for the Python code to talk to its C-Extension, we created three different kinds of benchmarks. In one of these benchmarks, PyPy takes longer to talk to C per iteration than CPython, in another benchmark it is the other way around and in the third case, they Fig. 2: The percentage of explained variance for are as fast. This indicates that C-Extensions play an startup and steady-state methodologies for different important role in the evaluation of Python imple- environments. All benchmarks are used for this mentations and more specifically, for the usage of graph. the implementations and the benchmarking thereof in ML. that PyPy requires more PCs for explaining as much variance as CPython. We found that the geometric mean is not a good fit for benchmarks in Python environments and thus disprove the use of the geometric average for the PyPerformance Benchmark Suite and the different Python speed centers. We confirm that for pure Python implementations, PyPy is faster than CPython. Alongside this, we found that for CPython, the startup methodology barely differs from its steady-state methodology. On Fig. 4: The difference of the harmonic average the contrary, for PyPy, several differences are noted. of the speedups of the benchmarks closest to the This is due to the fact that a JIT compiler behaves cluster centers and the harmonic average of the differently compared to an interpreter. speedups over all benchmarks. Speedups are defined However, due to problems regarding C- as execution time of CPython divided by execution Extensions that also rise when attempting to time of PyPy. Both dimensions are equal to four. run MLPerf benchmarks, PyPy is not widely 60 benchmarks are needed in order for the weighted adopted as a reference implementation. Improving average to be within a 2% range of the average over on this matter is the key for future versions of all benchmarks. PyPy and, in general, future versions of Python implementations. We clustered benchmarks according to two defini- tions of speedups and found for both definitions an optimal set of parameters for clustering the bench- mark results. Using these parameters, we identified a set of benchmarks to approximate the results of the entire list of benchmarks. All of the results mentioned in this thesis are generated using an automatic benchmarking suite, designed to run Python benchmarks using different Python implementations, that benchmarks Python workloads in both startup and steady-state for dif- ferent metrics. It also generates various graphs and Fig. 5: The difference of the harmonic average an easy-to-use API to quickly gain more insight in of the speedups of the benchmarks closest to the the benchmarking results. cluster centers and the harmonic average of the speedups over all benchmarks. Speedups are defined as execution time in startup divided by execution REFERENCES time in steady-state. CPython’s dimension is four, [1] Victor Stinner, “Pyperformance benchmark suite,” 2012, PyPy’s dimension is five. 30 benchmarks are needed https://pyperformance.readthedocs.io/index.html. [2] J. M. Redondo and F. Ortin, “A comprehensive evaluation of in order for the weighted average to converge within common python implementations,” IEEE , vol. 32, a 2% range of the average over all benchmarks. no. 4, pp. 76–84, July 2015. [3] A. Georges, D. Buytaert, and L. Eeckhout, “Statistically rigorous java performance evaluation,” in Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object- V. CONCLUSION Oriented Programming Systems and Applications, ser. OOPSLA ’07. New York, NY, USA: Association for We got a better understanding of the meaning of Computing Machinery, 2007, p. 57–76. [Online]. Available: https://doi.org/10.1145/1297027.1297033 different principal components for different Python [4] L. Eeckhout, H. Vandierendonck, and K. D. Bosschere, environments and identified similarities and dissim- “Quantifying the impact of input data sets on program ilarities between the PCs of the startup and the behavior and its applications,” Journal of Instruction-Level steady-state methodologies for both environments. Parallelism, vol. 5, pp. 1–33, 2003. We got a better understanding of the behaviour of the PCs when plotted versus other PCs and found CONTENTS viii

Contents

Preface i

Permission of Use and Content ii

Abstract iii

Extended Abstract iv

Contents viii

List of Figures xi

List of Tables xvi

List of Abbreviations xviii

1 Introduction 1 1.1 Thesis Context ...... 1 1.2 Problem Statement ...... 3 1.3 Thesis Goal ...... 4 1.4 Key Findings ...... 5 1.5 Thesis Overview ...... 7

2 Python Ecosystem 8 2.1 CPython ...... 8 2.2 PyPy ...... 9 2.3 Numba ...... 9 2.4 Elaboration on C-Extensions ...... 10 2.5 Cython ...... 11 2.6 Additional Implementations ...... 11 CONTENTS ix

3 Benchmarks and Profiling 13 3.1 Profiling Methods ...... 13 3.1.1 Time Library and datetime ...... 13 3.1.2 Profile, cProfile, pycallgraph ...... 14 3.1.3 Intel VTune and perf Analysis Tools ...... 14 3.1.4 Perflib ...... 14 3.2 Benchmarks ...... 15 3.2.1 Microbenchmarks ...... 22 3.2.2 PyPerformance Benchmark Suite ...... 22 3.2.3 Benchmark Suite ...... 22 3.2.4 Computer Benchmarks Game ...... 22 3.2.5 MLPerf ...... 23 3.2.6 Custom Algorithm Implementations ...... 23 3.3 Set-up ...... 24

4 Microbenchmarks 25 4.1 LOOP and FUNC ...... 25 4.2 MEM ...... 29

5 Benchmarking Framework 31 5.1 Start-up Performance ...... 31 5.1.1 Discarding first measurement ...... 31 5.1.2 Implementation Specifics ...... 32 5.1.3 Implementation Remarks ...... 34 5.1.4 Storage ...... 34 5.2 Steady-State Performance ...... 36 5.2.1 Implementation Specifics ...... 36 5.2.2 Implementation Remarks ...... 36 5.2.3 Proposed Solutions ...... 38 5.2.4 Storage ...... 40 5.3 PCA and K-Means Clustering ...... 42 5.3.1 PCA and Alternatives ...... 42 5.3.2 K-Means and Alternatives ...... 43 5.3.3 Implementation Specifics ...... 43

6 Results 45 6.1 Principal Component Analysis ...... 45 6.1.1 Principal Component Meanings ...... 46 6.1.2 Principal Component Selection ...... 49 6.1.3 Principal Component Comparison ...... 50 CONTENTS x

6.2 Performance Analysis ...... 57 6.2.1 Average Selection ...... 57 6.2.2 Quantitative Analysis of Speedup Averages ...... 58 6.3 Clustering ...... 63 6.3.1 CPython versus PyPy ...... 63 6.3.2 Startup versus Steady-State ...... 66 6.3.3 Specific Workloads ...... 68 6.4 C-Extensions ...... 70

7 Conclusion 74

Bibliography 75

A Benchmark List 79 LIST OF FIGURES xi

List of Figures

1.1 The TIOBE index indicating the popularity of programming languages [6]. 2

3.1 A visualisation of the execution times of the sorted workloads, executed using CPython and PyPy using both the startup and steady-state method- ologies...... 16 3.2 A visualisation of the number of instructions of the sorted workloads, ex- ecuted using CPython and PyPy using both the startup and steady-state methodologies...... 16 3.3 A visualisation of the number of branch misses of the sorted workloads, executed using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions...... 17 3.4 A visualisation of the number of LLC misses of the sorted workloads, ex- ecuted using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions...... 17 3.5 A visualisation of the number of context switches of the sorted workloads, executed using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions...... 18 3.6 A visualisation of the execution times of the workloads sorted by the exe- cution time of CPython startup, executed using CPython and PyPy using both the startup and steady-state methodologies...... 19 3.7 A visualisation of the number of instructions of the workloads sorted by the execution time of CPython startup, executed using CPython and PyPy using both the startup and steady-state methodologies...... 20 3.8 A visualisation of the number of branch misses of the workloads sorted by the execution time of CPython startup, executed using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions...... 20 LIST OF FIGURES xii

3.9 A visualisation of the number of LLC misses of the workloads sorted by the execution time of CPython startup, executed using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions...... 21 3.10 A visualisation of the number of context switches of the workloads sorted by the execution time of CPython startup, executed using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions...... 21

4.1 The average CPU time per process for the LOOP microbenchmark. For the blue bars, the number of iterations within the loop equals 10M. For the red bars, that number is 100M. Results are averaged over 27 executions. The JIT compilers PyPy and Numba are faster than CPython. IronPython is slower due to a high startup time. Cython is faster, but needs annotations in order to work optimal...... 26 4.2 The average CPU time per process for the FUNC microbenchmark. For the blue bars, the number of iterations within the loop is 10M. For the red bars, that number is 100M. Results are averaged over 27 executions. PyPy is able to identify hot-loops faster...... 26 4.3 The average CPU time per process for the microbenchmark LOOP with compute-intensive loops. We notice the execution time of CPython with respect to C increasing with the number of iterations. Numba has an initial high constant compilation overhead, which in turn has a lower influence on the execution time at a larger number of iterations. This is because at that moment, the program is running more application code, rather than compilation code. PyPy resembles CPython at a lower number of iterations and Numba and C at higher number of iterations...... 28 4.4 The average CPU time per process for the MEM microbenchmark. Results are averaged over 27 executions...... 30

5.1 A visual explanation of the startup methodology, which measures the startup time of both the Python invocation and one benchmark run. Alongside the execution time, it is possible to measure other metrics as well...... 33 LIST OF FIGURES xiii

5.2 A visualisation of the storage system of the startup methodology. For every benchmark, we store a 2D matrix. Every row of such a matrix is an iteration of the startup methodology containing the different metrics. The CI in the algorithm is calculated only using the execution time column. The number of iterations i is specific to the benchmark...... 35 5.3 A visual explanation of the steady-state methodology. It is not possible to measure any metrics other than the execution time...... 38 5.4 The complex solution to the problem of steady-state visualized. It is not an optimal solution since the CoV is not calculated over the last four bench-

marks. For any metric, Mij is the average of that metric over the subsequent th runs. mi is the average over the last four M values in the i iteration. ... 39 5.5 The proposed solution to the problem of steady-state visualized. It is not an optimal solution since the CoV is not calculated over the last four bench-

marks. For any metric, Mij is the average of that metric over the subsequent runs. The number of inner iterations is determined from Algorithm 2. ... 40 5.6 A visual explanation of the storage system of the steady-state methodology. In part 1, for every benchmark, we store a 2D matrix. Every row of such a matrix is an iteration of the steady-state methodology containing the different execution times. The number of inner loops is calculated using the CoV. In part 2, for every benchmark, we store the different metrics averaged over the number of inner iterations j for that outer iteration i. The number of iterations i is specific to the benchmark. The CI is calculated usingthe execution times from part 1...... 41

6.1 The PC values for all benchmarks in the CPython Environment. (a) Startup methodology and (b) Steady-state methodology, which has values very sim- ilar to the startup PC values. PC 6 and 7 are the only components that behave differently...... 47 6.2 The PC values for all benchmarks in the PyPy Environment. (a) Startup methodology and (b) Steady-state methodology, which either has similar values or values with inverted behaviour (due to the linear relation of the components) compared to the startup PC values. A difference in magnitude for some metrics is also visible, which suggests a different behaviour between startup and steady-state...... 48 6.3 The percentage of explained variance for startup and steady-state method- ologies for different environments. All benchmarks are used for this graph. 49 LIST OF FIGURES xiv

6.4 An overview of the components 1 through 6 for both CPython and PyPy for both the startup and the steady-state methodologies...... 51 6.5 A visualisation of PC 1 versus PC 2 in a scatter plot for the startup method- ology. The environments are distinguishable...... 53 6.6 A visualisation of PC 3 versus PC 5 in a scatter plot for the startup method- ology. The linear behaviour of CPython is clearly visible...... 54 6.7 A visualisation of PC 1 versus PC 5 in a scatter plot for the steady-state methodology. The environments are distinguishable...... 55 6.8 A visualisation of PC 2 versus PC 3 in a scatter plot for the all the con- figurations. CPython has similar results for both startup and steady-state ( and red, respectively). PyPy is distinguishable from CPython. .. 56 6.9 Histograms of the speedup of all benchmarks of PyPy with regards to CPython. (a) Startup methodology and (b) Steady-state methodology. Both histograms are asymmetric and are skewed to the right (long-tail). It seems that they both follow a log-normal distribution...... 58 6.10 S-Curves for the PyPerformance Benchmark Suite. (a) The startup method- ology. PyPy is slower than CPython (harmonic average of 0.77). This is not clear from the geometric average of 1.64, indicating that PyPy would be faster. (b) The steady-state methodology. PyPy is faster than CPython (harmonic average of 1.11)...... 62 6.11 S-Curves for the speedups of all the benchmarks from startup to steady- state. (a) The CPython environment. (b) The PyPy Environment. .... 62 6.12 The difference of the harmonic/geometric average of the speedups ofthe benchmarks closest to the cluster centers and the harmonic/geometric aver- age of the speedups over all benchmarks. Speedups are defined as execution time of CPython divided by execution time of PyPy. Both dimensions are equal to four. 60 benchmarks are needed in order for the weighted average to be within a 2% range of the average over all benchmarks. (a) Harmonic average and (b) Geometric average...... 65 6.13 The difference of the harmonic/geometric average of the speedups ofthe benchmarks closest to the cluster centers and the harmonic/geometric aver- age of the speedups over all benchmarks. Speedups are defined as execution time in startup divided by execution time in steady-state. CPython’s di- mension is four, PyPy’s dimension is five. 30 benchmarks are needed in order for the weighted average to converge within a 2% range of the average over all benchmarks. (a) Harmonic average and (b) Geometric average. .. 67 LIST OF FIGURES xv

6.14 Bar plots of the different clustering configurations. Fairness is observed between the different benchmark suites. Although the discrete algorithms seem underrepresented. However, it is the smallest suite with only 6 workloads. 69 6.15 The MLogistic example of scikit-learn using the MNist dataset adapted to be a benchmark evaluated using the startup methodology. PyPy is faster. . 71 6.16 The Isotonic Regression model from scikit-learn as a benchmark evaluated using the startup methodology. PyPy is slower...... 71 6.17 A custom C-Extension benchmark that calculates primes evaluated using the startup methodology. CPython and PyPy behave the same way, mind an offset equal to be the difference of the startup time of the different VMs. 72 6.18 A custom C-Extension benchmark that calculates primes evaluated using the steady-state methodology. CPython and PyPy are equal since in steady- state the difference in startup time of the Python implementations isnot included in the measurements...... 72 LIST OF TABLES xvi

List of Tables

3.1 An overview of the number of benchmarks, selected from the different bench- marking suites...... 15

5.1 Time taken to run benchmark bm_graham_imports.py upon clearing page cache buffers and its subsequent runs. This shows how imports affectthe execution time of subsequent runs...... 32 5.2 A list of all the metrics measured in this thesis. The perf event counter can be found at [20]. The PMUs for the Intel Cores can be found at then end of the Intel Developer Manual at [21]...... 44

6.1 Harmonic averages of the speedup of PyPy with respect to CPython for the metric time for the different benchmark suites. Also includes the speedups for all benchmark suites together. All results in this table are gathered using our framework...... 60 6.2 Geometric averages of the speedup of PyPy with respect to CPython for the metric time for the different subsets of benchmark suites. Also includes the speedups for all benchmark suites together. All results in this table are gathered using our framework...... 60 6.3 Harmonic averages of the speedup of steady-state with respect to startup for the metric time for the different benchmark suites. Also includes the speedups for all benchmark suites together. All results in this table are gathered using our framework...... 61 6.4 Geometric averages of the speedup of steady-state with respect to startup for the metric time for the different benchmark suites. Also includes the speedups for all benchmark suites together. All results in this table are gathered using our framework...... 61 LIST OF TABLES xvii

6.5 Linear Approximations to the custom C-Extension benchmarks, calculated using the first values, i.e. i = 1, 2. i is the parameter of the specific bench- mark. For Logistic Regression, CPython has a smaller slope and thus ex- ecutes faster when adding more iterations. For Isotonic Regression, the reverse is true: PyPy has a smaller slope and thus executes faster when adding more iterations. For the Prime Calculator, both environments have a similar slope and are as fast...... 73

A.1 All benchmarks used in this thesis. CBG = Computer Benchmarks Game, DA = Discrete Algorithms, CP = CPython, PP = PyPy, SU = Startup, SS = Steady-State. Values in the final four columns are the absolute execution timings...... 82 LIST OF ABBREVIATIONS xviii

List of Abbreviations

CI Confidence Interval CoV Coefficient of Variance

FA Factor Analysis

GC Garbage Collection GCE Geometric Cluster Error

HCE Harmonic Cluster Error

JIT Just-In-Time

LLC Last-Level Cache

ML Machine Learning

PC Principal Component PCA Principal Component Analysis

VM Virtual Machine INTRODUCTION 1

Chapter 1

Introduction

Python [5] is a recent programming language that focuses more on the ease and speed of developing than other programming languages like Java or C do. According to the TIOBE index [6], a popularity metric of programming languages, Python has had a steady rise in popularity in the past two years, which can be seen on Figure 1.1. Already in 2016, Python was used in 21% of Facebook’s code base [7], while Netflix uses Python for its Personalized Machine Learning [8]. In general, Python is more productive because it uses less code than Java or C for the same program and comes with several useful libraries such as Tensorflow and Scikit-Learn for machine learning, Django for web development, and for pure mathematics. Before diving into the technical details, we need a broader understanding of the Python Ecosystem. That is, what essentially is Python, what are Python implementations, how do these differ from one another, and what is the difference with other languages such as C.

1.1 Thesis Context

Python, in essence, is simply a set of instructions that can be used to write a program. To effectively run that program, one needs a so-called Python Implementation, which could be an interpreter that knows the specifics of the platform on which the code is run.We immediately note that there are different ways of dealing with this approach, such as compiling a superset of the Python language, cfr. Cython [13]. Contrary to compilers, an interpreter, in its simplest form, executes the generated processor-comprehensive code directly and does not produce an executable that can be distributed. The default implementation of Python is CPython, an interpreter-based implementa- 1.1 Thesis Context 2

Figure 1.1: The TIOBE index indicating the popularity of programming languages [6]. tion written in C, which can be found on the main website of Python1. Generally speaking, CPython analyzes the Python code and executes the generated bytecode directly after cre- ating various representations such as an Abstract Syntax Tree, Symbol Table and more [9]. Next to interpreters, there exist other kinds of Python implementations ranging from Just-In-Time compilers (JIT) to static compilers. A JIT compiler compiles a program or certain parts thereof at runtime, generally is faster and uses less memory than non- JIT implementations [10]. An example of such a Python implementation is PyPy, a JIT written in RPython, which is a subset of the Python language and a translation and support framework for producing implementations of dynamic languages [11]. In order to run C code from Python, one would need to write a C-Extension in C, wrap that into a Python function and install that compiled C-Extension on the used Python implementation. This is somewhat tedious work and does not always produce the expected results with respect to the time for executing said extension. The Cython language2 is a superset of the Python language that additionally supports calling C functions and declaring C types on variables and class attributes. Associated with the Cython language, there is a static compiler called Cython that compiles this code into an executable [13]. In order to really make use of Cython, one needs to heavily annotate the Python code, which could reduce the programmer’s productivity.

1The Python Programming Language can be found at: https://www.python.org/ 2The optimising static compiler and programming language Cython: https://cython.org/. 1.2 Problem Statement 3

There exist several other Python implementations. Although we will not discuss them in-depth in this introduction, we include two more relevant frameworks. We first mention Jython3, a JIT implementation made for easier interaction with Java, which is a Java implementation of Python [14]. Secondly, we bring up IronPython4, a C# implementation of Python for .NET frameworks.

1.2 Problem Statement

It is clear that there exist several different approaches to running the Python language and that both the user and the developer of said implementations should have a good understanding of them in order to make a reasoned and informed decision. Even more, there should exist a way of evaluating said implementations using a specific set of benchmarks, determined in a scientifically based way. It is important to note that this, in a way, has already been done before. In order to completely understand the need of this thesis, we need to take a closer look at the different benchmark suites and inspect where the specific suites fall short. Coming out of the PyCon 2011 Virtual Machine and language summits, it was com- monly agreed that CPython, PyPy, IronPython and should strive to move to a common set of benchmarks and a single performance-oriented site [25]. Results of the evaluation of these benchmarks can be found on the corresponding speed centers. How- ever, upon visiting the different sites, it is unclear how those results are produced. The proposed benchmarks usually only run for at most a second and questions arise whether or not benchmarking was done according to a statistically rigorous method [1]. In all, it can at least be noted that the way of benchmarking in this suite should be reviewed. The benchmarks used in this benchmark suite stem from the PyPerformance Bench- mark Suite [27]. Upon further inspection of these benchmarks, it was concluded that these benchmarks needed improvement. On the one hand, they are rather small benchmarks that take at most a few seconds to run, at least if not using the right parameters. On the other hand, Python applications have changed over the past years, where the emphasis is laying on machine learning using libraries such as Teano, Keras, scikit-learn and Pybrain. It is important that benchmarks related to this field of study are also included. This can be linked to one of the reasons why selecting a reference workload is hard, as the work-

3Jython, a Java implementation of Python: https://www.jython.org/ 4IronPython, an open source C# implementation of Python for .NET frameworks: https://ironpython.net/ 1.3 Thesis Goal 4 load is constantly changing as mentioned in the book Computer Architecture Performance Evaluation Methods [2]. After deeper inspection of this benchmark suite, it was found that the number of warm- up iterations is chosen arbitrarily. This is once more a motive to apply a statistical rigorous method. Ismail et al. conducted a research regarding quantitative overhead analysis for Python [24] and they too use the PyPerformance Benchmark Suite. They propose opti- mizations for Python by inspecting C, together with several metrics other than time. This gives rise to questions such as whether or not the results from the conducted research are viable since the benchmark results of the PyPerformance Benchmark Suite might not be as one would expect. Several papers have already benchmarked different Python implementations, one of which J. M. Redondo et al. [23]. A similar remark can be noted in that this suite contains a lot of rather small benchmarks and in that ML is not mentioned anywhere. In this case, as they present a big benchmark suite (a combined running time of 1162 hours), this is somewhat compensated. It has to be noted that this makes the benchmarking suite less usable as it takes too long to get results. Whereas the PyPerformance Benchmark Suite had too short benchmarks, we note here that there should not be too many benchmarks in a useful suite. Another remark is that none of the mentioned suites report more than the execution time and, in some cases, memory usage. Further research should go into understanding the properties and characteristics of other metrics. One final note is that none of the mentioned suites benchmark C-Extensions. This is a grey zone and should not be neglected. In general, a Python implementation benchmark suite should contain enough bench- marks to cover the current Python applications, but not too many in order to remain usable. The specific benchmarks should be selected out of a big set of benchmarks, which this thesis will attempt to tackle using a clustering algorithm. Not only time, but other metrics such as branch-misses and page-faults should be reported. All benchmarking should be done in a correct and reliable way by splitting up the evaluation in both start-up and steady-state performance [1].

1.3 Thesis Goal

Having mentioned the different approaches in the Python Ecosystem and the shortcomings in the current benchmark suites, the motive behind this thesis can now be understood more clearly. We propose a benchmarking methodology that automatically runs benchmarks in 1.4 Key Findings 5 both a startup and a steady-state version while taking several metrics into account. On the premise that these metrics are highly correlated, Principal Components (PCs) can be selected using Principal Component Analysis (PCA)[4]. One gets the message when imagining that when the execution time increases, the number of branch-misses, or another metric such as the number of cache-misses for that matter, will probably increase as well. Subsequently, we apply a clustering algorithm in order to identify what kind of benchmarks produce similar results. First of all, it would be interesting to get a better understanding of how metrics influence the specific PCA components. That way, we can identify how benchmarks behave using said components. Secondly, it would be interesting to analyse the performance of a Python implemen- tation and compare that to official speedup reports, such as those of the Python Speed Centers. It is possible that they produce correct conclusions, but this should at least be verified. At the core of this part lays the way of comparing different benchmark suites, which should be thoroughly described and evaluated. Thirdly, it would be helpful to note whether or not a benchmark suite with specifically selected benchmarks produces the same conclusions as the same benchmark suite without post-processing on the benchmarks. If that is the case, it is not necessary to run all of the original benchmarks in the suite. Rather, a specific subset of the benchmarks would suffice. Fourthly, it is interesting to explore C-Extensions more in-depth. That way, we can get a better understanding of this kind of benchmarks. Lastly, it would be interesting to apply the generated practices to a specific use-case and inspect that specific Python implementation somewhat deeper. As this research is a masters thesis, it was decided to only focus on aforementioned points in context of the standard implementation called CPython versus the JIT compiler called PyPy. It is apparent that these procedures can be carried forward to other implementations. In that case, different conclusions will be observed, but the way of getting the results todraw conclusions from will be similar.

1.4 Key Findings

We start of by tackling the first of four objectives. For CPython, no important differences are noted when comparing startup and steady-state PC values. On the contrary, upon comparing startup and steady-state for PyPy, we find that several PCs experience different 1.4 Key Findings 6 influences from the metrics they are built of. This is due to the different underlying behaviour of pure interpretation versus JIT compilation. Also, some components when using the steady-state methodology and PyPy show an inverse behaviour with respect to the startup methodology of PyPy. For example, if a PC is highly influenced by a high number of cache-misses in startup, then thesame PC is highly influenced by a low number of cache-misses in steady-state. This holds forall benchmarks and for subsets of the benchmarks. An elaborated answer to the meaning of the different components can be found further down this thesis. Alongside this, we find that PyPy requires more PCs in order to explain as much variance in the data as CPython. More specifically, for PyPy, at least five dimensions are needed to explain at least 90% of the data, whilst only four dimensions are needed for CPython. Lastly, we got a better understanding of the underlying behaviour of the different PCs. For example, the first PC focuses primarily on differentiating between CPython and PyPy, while the third component focuses on distinguishing PyPy’s startup configuration from the other configurations. We continue with the second objective of this thesis. We disprove the use of the geometric mean for comparing workloads run using different Python implementations. We compare our scientific based speedups to the speedups reported in the PyPy Speed Center and PyPerformance Benchmark Suite and prove that different conclusions are found when using the harmonic average. For the third objective of this thesis, we take a closer look at how the results differ when using a subset of the benchmarks. In this thesis, we cluster benchmarks according to two different definitions of speedup. First, we define the speedup as the execution timeofthe base environment (CPython), divided by the execution time of the alternative environment (PyPy). Upon clustering startup and steady-state using this definition, we find that 60 benchmarks suffice, rather than the entire set of benchmarks. Secondly, we cluster again, but now using the speedup defined as the execution time using the startup methodology, divided by the execution time of the steady-state methodol- ogy. Upon clustering CPython and PyPy using this definition, we find that 30 benchmarks suffice, rather than the entire set of benchmarks. Alongside this, we also found that the geometric mean overestimates the speedup and report that the chosen benchmarks are selected in a fair way from the different benchmark suites. For the final objective of this thesis, we investigate C-Extensions more in-depth. We first report our findings regarding the MLPerf benchmark suite. In order to use benchmarks 1.5 Thesis Overview 7 from this suite, several ML libraries have to be installed on the Python implementations, such as Tensorflow or scikit-learn. This is necessary as these libraries are writteninC and need to be connected to the implementation, which is a process also known as using so-called C-Extensions. As is mentioned on the official website of PyPy, C-Extensions to this day are a problem for PyPy. This last note is the reason why only one MLPerf benchmark was evaluated, more specifically the gmnt benchmark. We found that only 5% of the time running the code was spent in the Python implementation glue code. The other 95% of the benchmark was running optimized C code for ML. This indicated that at first, switching Python implementations for ML is not useful for this workload as only 5% of the benchmark time is spent in the Python code. However, it is interesting to further investigate the time taken to translate from the Python code to the C code. That is, how fast can the application talk to its C part from its Python part and vice versa. We created three different kinds of benchmarks and measured the time spent in the ML code per iteration. In one of these benchmarks, PyPy takes longer to talk to C per iteration than CPython, in another benchmark it is the other way around and in the third case, they are as fast. This indicates that C-Extensions play an important role in the evaluation of Python implementations and more specifically, for the usage of the implementations and the benchmarking thereof in ML.

1.5 Thesis Overview

We conclude this discussion with a brief overview of the results of this thesis. Chapter 2 describes the Python Ecosystem. There, we investigate the different Python implementa- tions and get a better understanding of C-Extensions. Chapter 3 focuses on the different profiling methods and benchmarks used in this thesis. We provide some visualisations of the raw data gathered using the benchmarking framework. Chapter 4 investigates three small microbenchmarks in order to get a better understanding of the different Python implementations. Chapter 5 discusses the structure of the benchmarking framework built in this thesis. It describes the startup and steady-state methodologies, as well as PCA and K-Means clustering. Chapter 6 discusses the results of this thesis in-depth. There, we explain the meaning of the different PCs, give a performance analysis of the different environments and cluster the benchmarks according to different definitions of speedup. Finally, it also discusses C-Extensions and the need of further research thereof. Chapter 7 concludes this thesis. PYTHON ECOSYSTEM 8

Chapter 2

Python Ecosystem

We briefly mentioned the concept of running the Python language through either inter- preters, dynamic compilers or static compilers. There exist other methods such as com- piling C code translated from Python code. In this chapter, we go over some of the more common and lesser known methods for running a program written in Python. This thesis primarily investigates the benchmarking of Python implementations and the processing of the results thereof. Consequently, it is out of scope of this thesis to focus on all the different steps and inner workings within the different implementations. We provide a brief and concise overview of the implementations without going into too much detail.

2.1 CPython

The default implementation of Python is CPython, an interpreter-based implementation written in C, which can be found on the main website of Python1. Generally speaking, CPython analyzes the Python code and executes the generated bytecode directly after cre- ating various representations such as an Abstract Syntax Tree, Symbol Table and more [9]. Other Python implementations either build further upon the basis of CPython (cfr. Un- laden Swallow, Pyston, Pyjion, etc. ) or build an entire new framework from scratch (cfr. PyPy, Jython, IronPython, etc. ) [19].

1The Python Programming Language can be found at: https://www.python.org/ 2.2 PyPy 9

2.2 PyPy

Next to interpreters, there exist other kinds of Python implementations ranging from JIT to static compilers. A JIT compiler compiles a program or certain parts thereof at runtime, generally is faster and uses less memory than non-JIT implementations [10]. An example of such a Python implementation is PyPy, a JIT compiler written in RPython, which is a subset of the Python language and a translation and support framework for producing JIT implementations of dynamic languages [11]. PyPy is an example of a JIT implementation of the Python language and is proposed to be used as a complete replacement of the CPython interpreter. Typically, a JIT compiler interprets the code and simultaneously compiles the code in a dynamic way, thus at run- time. The compilation process can either be started immediately when running a workload, or after a certain period of time. This latter method is the case for PyPy. At the heart of PyPy a hot-loop identifier is present, which detects whether or not a certain part of the code is hot. Once identified, PyPy compiles that function or loop dynamically. From that point on, it runs this dy- namically compiled code. As we will see in Chapter 4 regarding microbenchmarks, we will find that in a short program, PyPy behaves like CPython and in a longer version, i.e.,for a larger number of iterations, PyPy behaves like C.

2.3 Numba

Numba is closesly related to PyPy in the sense that it too is a JIT compiler. It differs from PyPy, however, in that it requires annotations from the programmer in order to know which part of the code needs to be compiled at runtime [12]. As already mentioned, PyPy finds that out on itself by checking for so-called hot-loops, while Numba immediately compiles the code that is instructed by the programmer to be compiled. In the chapter about microbenchmarks we verify the expectation that for a smaller number of iterations, the compilation overhead of Numba will be significant. However, as this overhead does not increase for a larger number of iterations, the influence of this over- head decreases and approaches zero. Numba-compiled numerical algorithms can approach the speed of C or Fortran, as will be shown for a few microbenchmarks in the next chapter. It was decided to only examine Numba in this thesis for a few micro-benchmarks. 2.4 Elaboration on C-Extensions 10

2.4 Elaboration on C-Extensions

In order to run C code from Python, one would need to write a C-Extension in C, wrap that into a Python function and install that compiled C-Extension on the used Python implementation. We mentioned in the subsection of PyPy that PyPy can be seen as a replacement for the CPython interpreter. This, however, is not the case for code using C-Extensions [32]. In order to better understand this, we need a deeper understanding on one of the differences between PyPy and CPython, namely Garbage Collection (GC). We will first go over the basic idea of PyObjects and C-Extensions and will thengivea brief summary of CPython’s and PyPy’s GC mechanisms, which in turn will give us more insight in the problem regarding C-Extensions. In CPython, Python objects are represented as PyObject* [9, 32]. As CPython is Python implemented in C, one can think of such objects by calling a malloc() in C, which subsequently creates a block of memory that is casted to a PyObject*. The address to that block is never changed and the memory management structure thereof is based on the well-known reference counting method. In order to operate on those objects, one needs to call the appropriate API functions, for example PyNumber_Add() to add two objects together. Up to here, PyPy does not differ alot from CPython as the PyPy objects are subclasses of the RPython W_Root class and are operated by calling methods on the interpreter itself. C-Extensions, such as libraries written in C (for example Numpy or Tensorflow) need to be compiled into the CPython implementation using the Python.h header file. PyPy also has a replacement of Python.h called cpyext. The difference is that cpyext does not always work as in some cases as the performance is either poor or does not even work at all. In order to understand this problem more, we need to take a closer look at the differences in GC between CPython and PyPy. Contrary to CPython that employs reference counting GC, PyPy uses a so-called gen- erational GC, an optimized method that separates memory into subspaces based on object age [24]. It is also called an efficient tracing GC. It uses a subspace specifically for young objects called the nursery and another subspace for older objects. If an object survives long enough, it is moved from the nursery to the older subspace. It is more efficient when relying on the fact that most objects in a program die young. A quick GC can collect the younger objects in the nursery, while a slower GC, which runs less frequent, runs in the older subspace. This effectively means that the address of the PyObjects need tobe changed. Subsequently, in PyPy, we need a way to represent PyObjects with a fixed ad- dress when we pass them to C-Extensions, but they also need to be adaptable due to the 2.5 Cython 11

GC [32]. There are other issues/slowdowns regarding PyPy, but according to a recent talk by Victor Stinner at EuroPython, the issues regarding C-Extensions are the main reason PyPy is not wildly adopted [19]. However, the main argument goes that PyPy currently is more efficient, while CPython is showing its age.

2.5 Cython

Writing a C-Extension can be considered tedious and does not always produce the expected results with respect to the time for executing said extension. The Cython language is a superset of the Python language that additionally supports calling C functions and declaring C types on variables and class attributes. Associated with the Cython language, there is a static compiler called Cython that compiles this code into an executable [13]. In order to really make use of Cython, one needs to heavily annotate the Python code, which could reduce the programmer’s productivity. It was decided to only examine Cython in this thesis for a few micro-benchmarks.

2.6 Additional Implementations

There exist several other Python implementations, which we will not discuss in-depth in this masters thesis as they are out of scope. We include them however for completeness, but only mention those that are still in development. Jython, a JIT made for an easier interaction with Java, is a Java implementation of Python [14]. Python programs are typi- cally 2-10x shorter than the equivalent Java program, which translates directly to increased programmer productivity. Others are IronPython, a C# implementation of Python for the .NET framework, which is a JIT which compiles Python source code to Dynamic Language Runtime, then further to Common Language Runtime byte code, which is then again com- piled to native machine code [15]. It thus lets you run on the Microsoft Common Language Runtime. Shed Skin2 compiles Python to C++ to speed up the execution of computation-intensive Python workloads. Although it can then compile that C++ code into an executable, which can then be run as a standalone program, it only works for programs written in a restricted subset of Python. As it is still in an early stage of development, it is not further

2Shed Skin, an experimental Python-to-C++ compiler: https://shedskin.readthedocs.io/en/latest/documentation.html#introduction. 2.6 Additional Implementations 12 investigated in this thesis. We do note however that we use benchmarks from this project in our framework. HotPy and Pyston are also mentioned, but are still in development. MicroPython is a C implementation of Python, optimized to run on a [16]. A variant of MicroPython is CircuitPython, a beginner-friendly, open-source version of Python for tiny, inexpensive computers called [17]. Finally, we mention Nuitka3, a Python compiler written in Python. It is said to be a good replacement of the Python interpreter and compiles every construct that all relevant CPython versions offer.

3Nuitka, a Python compiler written in Python: http://nuitka.net/pages/overview.html BENCHMARKS AND PROFILING 13

Chapter 3

Benchmarks and Profiling

This chapter focuses on the different profiling methods that can be used for benchmarking Python implementations. Afterwards, we give a brief description of all the benchmarks used in this thesis. Finally, we discuss the used set-up ranging from the machine to the operating systems the benchmarks were run on.

3.1 Profiling Methods

The benchmarking framework proposed in this thesis utilises the profiling tool called perf. Notwithstanding this, there are various ways of profiling a program. In this section, we go over a few of the more commonly known profilers and timing methods in order to justify the use of perf. We note that even though in the end we chose perf, a different tool can be used.

3.1.1 Time Library and datetime

The built-in timeit1 function of Python is used to time small bits of Python code. As we wanted to benchmark code in another instance of a Python Virtual Machine (VM), we moved on to other timing methods. As a second, but nevertheless evenly important remark, the timeit function does not suffice in this case as we want to measure more than only the metric time. The built-in datetime library2 is similar and was used for a part of the steady-state methodology as will be explained later in this thesis.

1The timeit function of Python: https://docs.python.org/2/library/timeit.html 2The datetime library of Python: https://docs.python.org/3/library/datetime.html 3.1 Profiling Methods 14

3.1.2 Profile, cProfile, pycallgraph

Profile and cProfile3 are two in-built Python profilers. Similarly to pycallgraph [34], they measure timings of specific functions in a program. That way, one can inspect and discover the bottlenecks residing in said program. We used cProfile in a first brief explo- ration of the MLPerf benchmarks, but decided not to use this method for our benchmarking framework since analyzing bottlenecks is not the main goal of this thesis. We did, however, use cProfile to identify the time spent in glue code in the MLPerf benchmark.

3.1.3 Intel VTune and perf Analysis Tools

Intel VTune4 and perf5 roughly operate in the same way. They profile a benchmark in a statistical way, are called sampling profilers and come with support for hardware events on several architectures. In other words, these tools use hardware performance counters. For what this thesis aims to do, both of these tools are a possibility. For the microbenchmarks in the exploration phase during the first months of this thesis, the Intel VTune profiler was used. The perf tool was eventually used for the creation of the benchmark suite, as it deemed easier to use. We note that as perf monitors programs, in other words Python invocations, we cannot monitor specific iterations of a benchmark within an invocation. For that, other methods were found that will be discussed later in this thesis.

3.1.4 Perflib

Perflib6, written by John Emmons, enables the user to access perf results through a Python API. Although this is a very interesting library, at the time of conducting the research it was not possible to install Perflib for PyPy due to compilation problems regarding C- Extensions. Without a Python API however, we are unable to measure individual iterations of a benchmark within the same Python VM invocation, i.e., when measuring steady-state. As we will see later on in this thesis, we found a way around this problem by solely using the perf tool.

3The Profile and cProfile profilers of Python: https://docs.python.org/2/library/profile.html 4The Intel VTune profiler: https://software.intel.com/en-us/vtune 5The Perf profiling tool by Brendan Gregg: http://www.brendangregg.com/perf.html 6Perflib, a Python library for accessing CPU performance counters on Linux, written by John Emmons: https://github.com/jremmons/perflib 3.2 Benchmarks 15

3.2 Benchmarks

In this section we delve deeper into the used benchmarks for this thesis. We go over several existing benchmark suites and discuss some self-made workloads. We end this chapter with a discussion of the MLPerf benchmarks and how we incorporated such workloads in this thesis. We note that a detailed overview of the final set of benchmarks along with their input parameters can be found in Appendix A. We refer to Table 3.1 for an overview of the number of benchmarks. Since the table in the Appendix is hard to interpret, we added visualisations of the raw data. Figure 3.1 visualizes the average execution time of the sorted workloads. Figure 3.2 visualizes the number of instructions of the sorted workloads. Figures 3.3, 3.4 and 3.5 do the same for the number of branch misses, number of Last-Level Cache (LLC) misses and number of context switches, respectively, all normalised to the number of instructions. The execution times of the different workloads vary between 0.5 and 350 seconds, with only a select few having an execution time below 5 seconds. The number of branch misses, normalised by the number of instructions, varies between 0.0001 and 0.1. The number of LLC misses, normalised by the number of instructions, varies from 0.001 to 0.15.

Subset Number of Benchmarks PyPerformance 29 Shed Skin 42 Computer Benchmarks Game 15 C-Extensions & ML 25 Custom discrete algorithms 6 Total 117

Table 3.1: An overview of the number of benchmarks, selected from the different bench- marking suites. 3.2 Benchmarks 16

Figure 3.1: A visualisation of the execution times of the sorted workloads, executed using CPython and PyPy using both the startup and steady-state methodologies.

Figure 3.2: A visualisation of the number of instructions of the sorted workloads, executed using CPython and PyPy using both the startup and steady-state methodologies. 3.2 Benchmarks 17

Figure 3.3: A visualisation of the number of branch misses of the sorted workloads, exe- cuted using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions.

Figure 3.4: A visualisation of the number of LLC misses of the sorted workloads, exe- cuted using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions. 3.2 Benchmarks 18

Figure 3.5: A visualisation of the number of context switches of the sorted workloads, executed using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions. 3.2 Benchmarks 19

These figures purely serve for visualisation purposes of the raw data. Since forevery curve, the workloads are sorted, we cannot compare the individual workloads. For this, we generated similar figures, but now only sorted according to the CPython startup value. The results are shown in Figures 3.6, 3.7, 3.8, 3.9 and 3.10. We find that, in some cases, PyPy is slower than CPython. In most cases, PyPyhas an execution time lower than that of CPython and in some cases it is as fast. In the next chapters, we investigate when this is the case. Finally, we notice that PyPy in general has less instructions compared to CPython and that, in most cases, it has a higher number of cache misses per instruction.

Figure 3.6: A visualisation of the execution times of the workloads sorted by the execution time of CPython startup, executed using CPython and PyPy using both the startup and steady-state methodologies. 3.2 Benchmarks 20

Figure 3.7: A visualisation of the number of instructions of the workloads sorted by the execution time of CPython startup, executed using CPython and PyPy using both the startup and steady-state methodologies.

Figure 3.8: A visualisation of the number of branch misses of the workloads sorted by the execution time of CPython startup, executed using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions. 3.2 Benchmarks 21

Figure 3.9: A visualisation of the number of LLC misses of the workloads sorted by the execution time of CPython startup, executed using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions.

Figure 3.10: A visualisation of the number of context switches of the workloads sorted by the execution time of CPython startup, executed using CPython and PyPy using both the startup and steady-state methodologies, normalised by the number of instructions. 3.2 Benchmarks 22

3.2.1 Microbenchmarks

In order to get familiar with the general problem and project, three microbenchmarks were first explored called LOOP, FUNC and MEM. The former simply iterates over asetof instructions such as additions and multiplications. The second microbenchmark is very similar, but iterates over function calls. The latter is a more memory-intensive benchmark that investigates what happens to the benchmark results when we overload memory with more than the caches can handle.

3.2.2 PyPerformance Benchmark Suite

Most of the benchmarks from the PyPerformance Benchmark Suite7 are included in the final list of benchmarks. Some were disregarded as they gave rise to problems during translation to Python 3. Although most of these benchmarks seem to focus on only a specific algorithm and did not seem reflections of industry workloads, they were included as they are also included in other research related to this topic.

3.2.3 Shed Skin Benchmark Suite

Some programs from Redondo et al. were also considered, for example the Benchmarks from the experimental compiler called Shed Skin8, that can translate pure, but implicitly statically typed Python (2.4-2.6) programs into optimized C++ [23]. The Shed Skin doc- umentation notes that the compiler is currently mostly useful to compile smaller programs that do not make extensive use of dynamic Python features or standard or external li- braries. As it is still in early development, the Shed Skin compiler itself was not taken into account in this thesis.

3.2.4 Computer Benchmarks Game

The Computer Benchmarks game9 is a project set-up with the intention to compare which programming language is fastest. They do so by implementing specific algorithms in different languages. From this, we took a few Python implementations to use in this research. 7The PyPerformance Benchmark Suite: https://pyperformance.readthedocs.io/ 8The Python to C++ compiler Shed Skin, along with their benchmark suite: https://shedskin.readthedocs.io/en/latest/ 9The Computer Benchmarks Game: https://benchmarksgame-team.pages.debian.net/benchmarksgame/index.html 3.2 Benchmarks 23

3.2.5 MLPerf

None of the aforementioned benchmark suites include ML benchmarks, although it is a topic that should not be neglected as it is one of the main reasons Python is more and more frequently used. For this reason, we explored the MLPerf benchmark suite10, specifically for measuring training and inference performance of ML hardware, software, and services using fair and useful benchmarks. Since the MLPerf benchmark suite provides industry-standard workloads for artificial intelligence chips, an attempt was made to add its ML workloads to our list of benchmarks. In order to make these benchmarks run, the installation of Tensorflow11 is necessary in almost all cases. We explicitly call this the installation of Tensorflow on the Python implementation since Tensorflow is not written in Python. In fact, it is an optimized library written in C and since it is not included in the standard Python language, it needs to be hooked to the Python implementation. However, at the point of writing this dissertation, it is not possible to install Tensorflow on PyPy. Therefore, we were unable to run most of the benchmarks from MLPerf in our framework. Instead, we hypothesised that it could be interesting to call the library using CPython, but connect these calls in PyPy and call these connections glue code. As the code could not easily be split up, we were unable to verify our assumption this way. Instead, we identified the time taken in Tensorflow upon running a benchmark and found thatfor the gmnt translation workload, around 95% of the time was spent in the C code. This disproves our assumption. In what follows, we investigate the time it takes to translate from Python to C code. That is, how the Python implementation talks to the C code of the library. This is called the use of C-Extensions and will be discussed further in this thesis.

3.2.6 Custom Algorithm Implementations

A few benchmarks related to the convex hull problem, created in context of the Discrete Algorithms’ course at Ghent University, were included, as well as some basic ML algorithms using the scikit-learn library12. Finally, some initial benchmarks related to C-Extensions were explored.

10The MLPerf Benchmark Suite: https://mlperf.org/ 11The machine learning library Tensorflow: https://www.tensorflow.org/ 12The ML Python library scikit-learn: https://scikit-learn.org/stable/ 3.3 Set-up 24

3.3 Set-up

The machine used in this thesis is an HP Envy Notebook. The particular notebook has two Intel Core i7-6500U CPUs with a base frequency of 2.50 GHz and 12 GB of RAM memory at 1600 MHz. Although the microbenchmarks from Chapter 4 were run on Windows 10, the framework created in this thesis was developed on a dual-boot of Ubuntu 16.04.6 LTS (Xenial Xerus). Since the framework uses Linux-specific commands, it is not portable to other operating systems. MICROBENCHMARKS 25

Chapter 4

Microbenchmarks

In order to understand and verify performance results from previous research, a few mi- crobenchmarks were explored first. All microbenchmarks from this chapter were profiled using the VTune profiler and were run 27 times, which was the default number ofiter- ations, in order to have consistent results. The results from this chapter were gathered during the initial exploratory phase of this thesis. The Confidence Intervals (CIs) of these microbenchmarks were too small to be seen visually and as such, they are not considered.

4.1 LOOP and FUNC

The first investigated microbenchmark is called LOOP, in which a simple loop, containing short calculations such as additions and multiplications of a certain number of iterations is run. The number of iterations in the loop is either 10M or 100M and the respective results of the LOOP benchmark are shown on Figure 4.1. For reference, the absolute timings of the LOOP benchmark for CPython were around 5s for 10M iterations and 45s for 100M iterations. The second benchmark is called FUNC, in which a certain compute- intensive function is called multiple times in a loop. Again, the number of iterations in the loop is either 10M or 100M. The results of the FUNC benchmark are reported in Figure 4.2. These microbenchmark results are normalized to the average CPU time of the corresponding CPython implementation. 4.1 LOOP and FUNC 26

Avg CPU Time / Process: LOOP 2,5

2

1,5

1

0,5

0 Execution Time relative to CPython to relative Time Execution CPython PyPy IronPython Numba CythonConv CythonReg 10M 100M

Figure 4.1: The average CPU time per process for the LOOP microbenchmark. For the blue bars, the number of iterations within the loop equals 10M. For the red bars, that number is 100M. Results are averaged over 27 executions. The JIT compilers PyPy and Numba are faster than CPython. IronPython is slower due to a high startup time. Cython is faster, but needs annotations in order to work optimal.

Avg CPU Time / Process: FUNC 2,5

2

1,5

1

0,5

0 Execution Time relative to CPython to relative Time Execution CPython PyPy IronPython Numba CythonConv CythonReg 10M 100M

Figure 4.2: The average CPU time per process for the FUNC microbenchmark. For the blue bars, the number of iterations within the loop is 10M. For the red bars, that number is 100M. Results are averaged over 27 executions. PyPy is able to identify hot-loops faster. 4.1 LOOP and FUNC 27

IronPython is slower because it focuses on .NET frameworks and has a relatively big start-up time. Consequently, IronPython was not considered in further phases of this thesis. Both Cython Converted and Cython Regular, where annotations are added or no an- notations were added, respectively, are faster than CPython. Not shown in these figures is that the Converted Cython approaches the speed of C for this microbenchmark. This shows it is necessary to annotate the entire code in order to reach the effect of Cython and immediately is the reason why Cython was only explored during the exploration phase during this thesis, as annotating the entire code is a time-consuming process. For these first two figures, we find that the JIT compilers PyPy and Numba are faster for the compute-intensive microbenchmarks. We note that PyPy is faster for 10M FUNC than for 10M LOOP. This indicates that PyPy is able to identify a function call faster as hot-code than just simple calculations. This also gave rise to the investigation of the behaviour of the timings when increasing the number of iterations. This can be seen on Figure 4.3, where results are normalized to the corresponding C implementation. 4.1 LOOP and FUNC 28

Avg CPU Time / Process: LOOP 60

50

40

30

20

10

Slowdown w.r.t execution time C of execution w.r.t Slowdown 0 10K 100K 1M 10M 100M 1B Number of iterations C CPython PyPy Numba

Figure 4.3: The average CPU time per process for the microbenchmark LOOP with compute-intensive loops. We notice the execution time of CPython with respect to C increasing with the number of iterations. Numba has an initial high constant compilation overhead, which in turn has a lower influence on the execution time at a larger number of iterations. This is because at that moment, the program is running more application code, rather than compilation code. PyPy resembles CPython at a lower number of iterations and Numba and C at higher number of iterations. 4.2 MEM 29

Noteworthy is the initial compilation overhead of Numba. As depicted before, Numba compiles the program immediately, which can result in a worse performance. When the number of iterations is increased, this fixed compilation time has less and less influence on the total execution time. Eventually, the execution time of Numba closely resembles that of C. CPython’s execution time increases with the number of iterations, which is as expected as it just interprets at run-time. It is expected that with enough iterations, CPython will hit a limit and eventually flatten, an effect that is visible on the right side of thefigure. Finally, PyPy keeps a relatively steady execution time at smaller iteration counts, which closely resembles CPython. This is because at that point in time, the aforementioned hot- loop is not yet identified. At around 10M iterations, the execution time starts todrop. This is a turning point where PyPy starts benefiting from the compilation. At that point, it has identified that the loop effectively is a hot-loop, after which it compiles thatloop to then start using the compiled binary instead of interpreting. The execution time then closely resembles that of C and Numba, because then, they all use compiled binaries.

4.2 MEM

Finally, a more memory-intensive microbenchmark called MEM was considered, which is shown on Figure 4.4. Due to time constraints, Numba here too was not investigated as the annotation of every separate benchmark would be too time-consuming. We include this graph to indicate that, even for microbenchmarks, a specific speedup cannot be given. Sometimes the speedup of PyPy versus CPython is 2×, for other workloads it is 10× and for yet others it is even higher. 4.2 MEM 30

Avg CPU Time / Process: MEM C of time execution w.r.t Slowdown 0 7 6 5 4 3 2 1

0 Slowdown w.r.t execution time C of execution w.r.t Slowdown 10 1K 100K # iterations C CPython PyPy

Figure 4.4: The average CPU time per process for the MEM microbenchmark. Results are averaged over 27 executions. BENCHMARKING FRAMEWORK 31

Chapter 5

Benchmarking Framework

This chapter focuses on the implementation of the benchmarking framework. It serves as a basis for all results gathered in the next chapter. We describe the evaluation of workloads using the startup and steady-state method and illustrate how PCA and clustering are integrated in the framework. It is important to highlight the need for both startup and steady-state in a rigorous way. This often gets overlooked and should not be neglected when benchmarking different workloads. Eeckhout et al. show that frequently used methodologies can lead to incorrect or misleading conclusions [1]. They mention that many of the identified issues in their research apply to other programming languages, which thus also holds for Python workloads.

5.1 Start-up Performance

We start by describing the details of the startup evaluation method. We explain our decisions regarding cache clearing, the stopping criteria of the algorithm and how results are stored in memory.

5.1.1 Discarding first measurement

It is important to note that different measurements of one benchmark should be inde- pendent [1]. In practice, however, this is not necessarily the case as the execution time of importing external dynamic libraries changes after the first iteration of measurements. Imported dynamic libraries are kept in the page cache (also known as disk cache), resulting in faster imports in subsequent invocations. To show this in practice, a specific benchmark was modified so that it imports a number 5.1 Start-up Performance 32

of libraries such as Matplotlib1, which initializes the whole library upon importing. In order to ensure that no libraries were loaded upfront, i.e., before the first invocation in this experiment, the cache buffers were cleared using the command 5.1:

echo “echo 1 > /proc/sys/vm/drop_caches” | sudo sh (5.1) As can be seen in Table 5.1, the first VM invocation takes significantly longer to import all the libraries. Subsequent invocations already have the libraries loaded in the page caches and take less time to execute. Based on these measurements it was decided that, in order to have meaningful results, the page caches should be cleared before every VM invocation. Particularly, we do not assume any prior library loadings before executing workloads. This might seem counter- intuitive at first, but we keep in mind that library loadings also affect other benchmark evaluations and thus, in a way, make it so that even different benchmarks are not inde- pendent, as a library can be imported in more than one workload.

bm_graham_imports.py Time (s) run 1 9.88 2 3.66 3 3.64 4 3.65 5 3.65

Table 5.1: Time taken to run benchmark bm_graham_imports.py upon clearing page cache buffers and its subsequent runs. This shows how imports affect the execution time of subsequent runs.

5.1.2 Implementation Specifics

The startup evaluation method as described in Statistically Rigorous Benchmarking was used to guarantee reliable results [1]. The complete methodology is described in Algo- rithm 1. We refer to the Statistically Rigorous Benchmarking paper if any of the calcula- tions seem puzzling. Figure 5.1 gives a visual explanation of the algorithm.

1Matplotlib, a comprehensive library for creating static, animated, and interactive visualizations in Python: https://matplotlib.org/ 5.1 Start-up Performance 33

Startup Procedure

Iteration 1

t1 Iteration 2

Until CI over { t1 , … , ti } at most 5% around average t2 OR . i > 30 . . = Python VM invocation Iteration i = Benchmark run

ti

Figure 5.1: A visual explanation of the startup methodology, which measures the startup time of both the Python invocation and one benchmark run. Alongside the execution time, it is possible to measure other metrics as well.

Algorithm 1: Startup Procedure input : perf command to run benchmark once output: All measurements of all iterations begin timings ← [] all_metrics ← [][] for i ← 1, 30 do clear caches all_metrics[i] ← run(cmd) timings[i] ← all_metrics[i][0] n ← length(timings) m ← mean(timings) ← √s c t1−α/2;n−1 n c if m < 0.05 then break end end end 5.1 Start-up Performance 34

Imagine an empty 2D matrix. For every iteration i in Algorithm 1, we run a workload, measure all of the different metrics and store them in the ith row of the matrix. Assume that the first column of that matrix contains the execution times of the subsequent runs. Then, at the end of every iteration i, calculate the 95% Confidence Interval (CI) around the average of all the execution times so far. If that CI is at most a small percentage around the average, we are sure to have run our benchmark enough in order to have reliable results. For the sake of complexity, we decided to set the percentage to 5%. Ideally, this should be around 1% or 2%. Also for the sake of complexity, we set the limit of the maximum number of iterations to be 30.

5.1.3 Implementation Remarks

First, we note that for every iteration i of the startup methodology, we store the mea- surements of all the metrics. For every benchmark run, this corresponds to the ith row in the matrix. Although this seems trivial, it is important, since this is not possible in the steady-state methodology. A second remark can be noted regarding what metric to choose from in order to cal- culate the CI. Although we measure different kinds of metrics, we only take the CI of the metric time into account in order to decided whether or not to stop the algorithm. Other metrics can also be used for this purpose, but since that increases the execution time by too much, this was left out. Thirdly, we highlight once more that every iteration includes the clearing of the caches so that subsequent iterations are independent. We also note that this makes sure that dif- ferent workloads also are independent, since the loading of libraries affects every subsequent workload, not only every subsequent iteration.

5.1.4 Storage

For the sake of completeness, we mention that for every evaluation, all results are saved in a JSON format. By all results we mean: for all benchmarks, all the metrics of all the iterations are stored. For the startup procedure every benchmark thus has a 2D matrix of metrics. A visualization of this storage system is shown in Figure 5.2. 5.1 Start-up Performance 35

Startup Storage

BM 1 Execution Time # Instructions …

Iteration 1 t1 instr1 ... Iteration 2 t2 instr2 ... … … … ...

Iteration i ti instri ...

BM 2 Execution Time # Instructions …

Iteration 1 t1 instr1 ... Iteration 2 t2 instr2 ... … … … ...

Iteration i ti instri ...

Figure 5.2: A visualisation of the storage system of the startup methodology. For every benchmark, we store a 2D matrix. Every row of such a matrix is an iteration of the startup methodology containing the different metrics. The CI in the algorithm is calculated only using the execution time column. The number of iterations i is specific to the benchmark. 5.2 Steady-State Performance 36

5.2 Steady-State Performance

We continue with a detailed description of the steady-state procedure. We explain the stop- ping criteria and how we tackled the profiling within a VM for all of the metrics. Finally, we go over some problems that occurred during the implementation of this methodology and discuss how results are stored in memory.

5.2.1 Implementation Specifics

In order to guarantee reliable results in the steady-state strategy, we follow the evaluation method as described in Statistically Rigorous Benchmarking [1]. The complete methodol- ogy is described in Algorithm 2. Figure 5.3 gives a visual explanation of the algorithm. Imagine an empty 2D matrix. For every iteration i in Algorithm 2, we invoke a Python VM. Within that VM, we repeatedly run the workload until the Coefficient of Variance (CoV) over the execution times of the four last runs is below a certain threshold (the convergence target). In our case, this threshold is set to be 0.02. The execution times within the ith VM invocation correspond to the ith row of the matrix. After the inner loop of iteration i, we calculate the 95% CI around the average of the averages of the subsequent rows. If that CI is at most a small percentage around this average of averages, we are sure to have obtained enough reliable results. For the sake of complexity, we decided to set this percentage to 5%. Also, we set the limit of both the maximum number of inner and outer iterations to be 30.

5.2.2 Implementation Remarks

Firstly, in the startup methodology, it was possible to measure the different metrics per iteration. For steady-state, however, it is not possible to measure any metric other than the execution time for a benchmark run within a VM invocation. The reason for this is that there does not exist a general perf API Tool. perf, that allows the user to count events, has its scope defined outside the Python VM. As such, it is not possible to trigger the start or termination of the perf monitoring tool from inside an environment. As discussed in Chapter 3, there exists a Python API named Perflib, but since it does not work in the PyPy environment, the tool was not used. Thus, for steady-state, we have no other choice than to use the execution time in the CI calculation. Secondly, the size of the moving window is arbitrarily taken to be four. This is in order to create the steady-state effect by not including the first n − 4 measurements in the 5.2 Steady-State Performance 37

Algorithm 2: Steady State Procedure input : benchmark information output: All timings all iterations begin all_timings ← [][] timing_means ← [] for i ← 1, 30 do Startup P ython Environment using shell cur_timings ← [] for j ← 1, 30 do cur_timings[j] ← time(benchmark) if j > 4 then cov = CoV (cur_timings[−4]) if cov < 0.02 then break end end end all_timings[i] ← cur_timings timing_means[i] ← mean(cur_timings) n ← length(timing_means) m ← mean(timing_means) ← √s c t1−α/2;n−1 n c if m < 0.05 then break end end end 5.2 Steady-State Performance 38

Steady-State Procedure

Until CoV of {t1,j-3, …, t1, j} at most 0.02 Iteration 1 … OR j > 30

t11 t12 t13 t1j

Iteration 2 … Until CoV of {t2,j-3, …, t2, j} at most 0.02 OR Until CI of {m1, …, mi} at most 5% around average j > 30 OR t21 t22 t23 t2j i > 30 . . . Until CoV of {t t } at most 0.02 … i,j-3, …, i, j Iteration i OR = Python VM invocation j > 30 = Benchmark run ti1 ti2 ti3 tij

Figure 5.3: A visual explanation of the steady-state methodology. It is not possible to measure any metrics other than the execution time.

calculations,Steady-State without Methodology: increasing Part the2 base total execution time too much. Run perf for j runs in every iteration i

IterationThirdly, 1 it is important to note is that in this thesis it was decided to only execute one j =5

outer iterationM1 as the execution time of the procedure was too high. As such, in theory, Iteration 2 wej = are7 only able to guarantee convergence within the Python VM and not over several M . 2 . invocations.. However, the startup procedure tackled this for one execution per VM and as

Iteration i such,j = 6 we are satisfied with the decision.

Mi

5.2.3 Proposed Solutions

An immediate solution to this problem would be to configure the benchmarking framework as follows: first, measure one run of a workload using perf. Then run the code twice, measure using perf and take the average. Then similarly do the same for three executions, four, etc. This would give the desired results but would increase the number of executions N(N+1) from N to 2 . That is, instead of iteratively calculating the CoV by constantly adding one more iteration, we calculate it for one iteration, then rerun it for two iterations, for three, and so on. Note that in this solution, the average time also includes the VM startup time (as well as all the benchmark runs, not only the last four), which also is not ideal. This complex solution is visualized on Figure 5.4. 5.2 Steady-State Performance 39

Steady-State Procedure: complex solution

Until CoV of { M1,j-3, …, M1, j } at most 0.02 Iteration 1 … … OR j > 30 M M11 M12 1j

Iteration 2 … … Until CoV of { M2,j-3, …, M2, j } at most 0.02 OR Until CI of {m1, …, mi} at most 5% around average j > 30 OR M M21 M22 2j i > 30 . . . Until CoV of { M M } at most 0.02 … … i,j-3, …, i, j Iteration i OR = Python VM invocation j > 30 M = Benchmark run Mi1 Mi2 ij

Figure 5.4: The complex solution to the problem of steady-state visualized. It is not an optimal solution since the CoV is not calculated over the last four benchmarks. For any

metric, Mij is the average of that metric over the subsequent runs. mi is the average over the last four M values in the ith iteration.

That is why we opted for a different method. It is more intuitive to determine the necessary number of iterations first iteratively, with solely the metric time as shownin Algorithm 2, to then rerun all benchmarks with the number of iterations as a parameter. In this solution, measuring the execution time is done using a built-in timing library. In this thesis, we used the datetime library for this purpose. Finally, we have to average the results over the number of iterations per benchmark. This second solution is depicted on Figure 5.5. We note that in both solutions, aside from the added complexity, there is one catch: we have to add an extra parameter to every benchmark called iterations to set the number of inner iterations from inside the benchmark. Say we set this parameter to 5 and tell perf to run the benchmark. If we now divide the total time by 5, we get the average execution time of the benchmark within the VM invocation. This, on the one hand is a problem, since we do not explicitly calculate the average in steady-state (i.e., over the last four benchmark runs). On the other hand, even though this is not ideal, we determined the ideal number of warm-ups and thus were able to calculate the execution time in the correct way. 5.2 Steady-State Performance 40

Steady-State Procedure: proposed solution

Iteration 1 j=5

M1

Iteration 2 j=7

M2 . . . = Python VM invocation Iteration I j=6 = Benchmark run

Mi

Figure 5.5: The proposed solution to the problem of steady-state visualized. It is not an optimal solution since the CoV is not calculated over the last four benchmarks. For any metric, Mij is the average of that metric over the subsequent runs. The number of inner iterations is determined from Algorithm 2.

5.2.4 Storage

For the sake of completeness, we mention that for every evaluation, all results are saved in a JSON format. The entire storage system is shown in Figure 5.6. For part 1 of steady-state, for every benchmark we have a 2D matrix of execution times. That is, for all benchmarks, for every VM invocation, a list of execution times of all iterations. In this thesis, we measure steady-state by only evaluating the first outer iteration. For part 2, for every benchmark, we have a 2D list of metrics. That is, for all bench- marks, for every VM invocation, a list of metrics each averaged over the determined number of iterations of part 1. 5.2 Steady-State Performance 41

Steady-State Storage: Part 1 Determine the number of inner iterations j BM 1 Execution Time Execution Time … run 1 run 2

Iteration 1 t11 t12 ... Iteration 2 t21 t22 ... … … … ...

Iteration i ti1 ti2 ...

BM 2 Execution Time Execution Time … run 1 run 2

Iteration 1 t11 t12 ... Iteration 2 t21 t22 ... … … … ...

Iteration i ti1 ti2 ... … Steady-State Storage: Part 2 Run perf for j runs in every iteration i BM 1 Execution Time # Instructions …

Iteration 1 t1 instr1 ... Iteration 2 t2 instr2 ... … … … ...

Iteration i ti instri ...

BM 2 Execution Time # Instructions …

Iteration 1 t1 instr1 ... Iteration 2 t2 instr2 ... … … … ...

Iteration i ti instri ...

Figure 5.6: A visual explanation of the storage system of the steady-state methodology. In part 1, for every benchmark, we store a 2D matrix. Every row of such a matrix is an iteration of the steady-state methodology containing the different execution times. The number of inner loops is calculated using the CoV. In part 2, for every benchmark, we store the different metrics averaged over the number of inner iterations j for that outer iteration i. The number of iterations i is specific to the benchmark. The CI is calculated using the execution times from part 1. 5.3 PCA and K-Means Clustering 42

5.3 PCA and K-Means Clustering

In the final section of this chapter we explain PCA and K-Means clustering and provide some possible alternatives that can be used in future research. We also discuss the imple- mentation details of how we analysed the results of our benchmarking framework.

5.3.1 PCA and Alternatives

Principal Component Analysis (PCA) is a statistical data analysis technique that attempts to find a different representation for the data set[4]. Its two main objectives are to reduce the dimensionality of the used dataset and to identify new meaningful underlying variables [37]. It does so by defining new uncorrelated variables from the existing variables. Those new variables are called Principal Components (PCs) and are built in such a way that the first PC explains as much variance in the dataset as possible. Subsequent components account for as much of the remaining variability as possible, while being orthogonal to the other PCs. We effectively perform feature extraction in order to find variables to perform clustering more easily. This comes from the hypothesis that, although several metrics are counted, most of them are highly correlated. For example, if the time goes up, the number of instructions (in general) also goes up. This is analogous to the number of page faults, cache misses, and so on, with some exceptions. There exist several adaptions to the original PCA model used in this thesis such as a more robust form of PCA [39]. A different approach would be to use Factor Analysis (FA) instead of PCA. FA differs from PCA in that there is a definite underlying model [38]. In contrast to PCA, it assumes that the original variables can be expressed as linear functions of hypothetical random variables. We note that the aforementioned models focus on linear transformations of the origi- nal variables. It is also possible to model non-linear functions such as when using auto- encoders. Although auto-encoders can provide better explanations of the data (since they are trained to reconstruct the data), in general it is more complex to train an auto-encoder than it is to apply PCA. This simply follows from the fact that an auto-encoder needs backpropagation to train. We note that whichever model is chosen, the goal is to explain the data in such a way to better understand what is going on and to help the clustering method that follows in reaching better results. As PCA was more intuitive, we decided to work with uncorrelated variables and chose PCA similar to Eeckhout et al. [4]. 5.3 PCA and K-Means Clustering 43

5.3.2 K-Means and Alternatives

One of the observations mentioned in the previous chapters was that most benchmark suites either have too short-running benchmarks or too many benchmarks so that executing the entire benchmark suite takes too long. The former was tackled by carefully selecting those benchmarks that run long enough, for example at least five seconds. Of course it has to be noted that, if the user explicitly wants shorter-running benchmarks, they should be included. The latter can be tackled by selecting only the most important benchmarks. We attempt to tackle this problem by using the unsupervised ML clustering model of K-Means. We cluster the benchmarks based on similar normalised metrics and finally select only the benchmarks that are closest to the cluster means. This way, we only have to use the selected benchmarks to compare different Python implementations, which highly reduces the evaluation time for later versions of Python implementations. We note, however, that PCA should be added in front of the K-Means model. A first reason is as follows. Based on our hypothesis, we do not have to waste computation time for training our K-Means model on every metric. If we reduce the correlation between the features, we can run K-Means faster and also more efficient. This will also enable usto generate useful visualizations. Also, by applying PCA we reduce the dimensionality of the problem, effectively tackling the “Curse of Dimensionality”. A third reason is that PCA creates an orthogonal space. If this is not done, the underlying characteristics of the workloads would have too much impact in order to draw useful conclusions from the results.

5.3.3 Implementation Specifics

We start by listing the metrics that are measured using perf in Table 5.2. We remark that any form of deduction on the results can be performed when at least following normalisation is performed. All metrics should be divided by the number of instructions of the specific run. Otherwise, there are too many variables influencing the metrics to correctly draw conclusions. For PCA, we also added a scaler so that all normalized metrics are normalized a second time, now to the [0, 1] interval. This way, PCA can better perform on the results. The reason for this is that larger value ranges have a higher influence on PCA than smaller ranges. It is thus clear that it is ideal to remove the difference in these ranges. There are different options to scaling such as scaling to zero mean and unit variance. However, since we prefer to keep the metrics intuitive (e.g., the number of instructions 5.3 PCA and K-Means Clustering 44

All Metrics Name (number of) 1 execution time 2 branch instructions 3 branch misses 4 LLC misses 5 LLC references 6 instructions 7 context switches 8 page faults 9 L1 dcache load misses 10 L1 dcache loads 11 L1 dcache stores 12 LLC load misses 13 LLC loads

Table 5.2: A list of all the metrics measured in this thesis. The perf event counter can be found at [20]. The PMUs for the Intel Cores can be found at then end of the Intel Developer Manual at [21].

cannot be negative), we decided not to use this. Selecting the optimal set of {dimension, k} for K-Means is defined as follows. Every set of parameters, particularly the number of PCs used in K-means and the number of cluster means, denoted as dimension and k, respectively, leads to a specific solution. We calculate the harmonic mean of the speedup on the entire suite as well as the weighted harmonic mean of the speedup over the benchmarks closest to their respective cluster center. We denote the weighted average as the average of the speedups of the benchmarks, each multiplied by the number of benchmarks in their corresponding cluster. To make it more clear, say we have 100 benchmarks with a harmonic mean of 5. We cluster using dimension = 5 and k = 10 and thus end up with 10 benchmarks. We calculate the harmonic mean over these 10 benchmarks and find 7. We do this for a set of possible {dimension, k} so that the difference between the harmonic mean of the entire setand the weighted harmonic mean of the closest benchmarks is minimal. RESULTS 45

Chapter 6

Results

This chapter focuses on the results and conclusions drawn from all measurements using the workloads and methodologies mentioned in the previous chapters. Firstly, we go over the results gathered using PCA and investigate what metrics have the most influence on the individual components. As such, we also include the PC value differences between startup, steady-state and the different subsets of workloads. Nextto that, we compare the different components to each other and inspect their relations. Secondly, we investigate and compare speedups from different subsets of the benchmarks and find whether or not results reported from other benchmark suites are valid. Here,we also give a summary of the quantitative results. We conclude this part with a performance analysis of CPython versus PyPy. Thirdly, we explore K-Means and how conclusions drawn from the entire set of work- loads are affected when drawn from clustered results. We investigate whether ornotit is necessary to run the entire list of benchmarks and attempt to find an optimal set of {dimension, k}. Finally, we explore C-Extensions more in-depth and put emphasis on the need for further research on this topic.

6.1 Principal Component Analysis

We start by investigating the underlying meaning of the different PCs. Thereafter, we provide some quantitative results regarding the different components. 6.1 Principal Component Analysis 46

6.1.1 Principal Component Meanings

We start by inspecting the influence of the different metrics on the various PC values. We can compare PC values in different ways, by selecting different kinds of subsets ofthe entire workload list. This includes choosing either startup, steady-state, CPython, PyPy and/or different benchmark suites. Figure 6.1 indicates the differences between startup and steady-state for CPython and Figure 6.2 presents the same for PyPy. We immediately notice that startup and steady-state are extremely similar if not al- most the same for the CPython environment. This behaviour is as expected since CPython interprets every next instruction. It does not differentiate between instructions that regu- larly return, as is the case in steady-state. This is a first indication that it might not be necessary to run both startup and steady-state for the CPython environment. Before diving into the differences between startup and steady-state for other environ- ments, particularly PyPy, we investigate the different PC values of CPython and look at the influence of the different metrics on those components. PC 1 focuses primarily on pos- itive cache behaviour. More specifically, benchmarks with a high amount of cache misses, and LLC loads will have a high first PC. It has to be noted that due to the linear property of a PC, i.e., a linear combination of the original metrics, this does not hold in the other direction. A CPython benchmark in startup with a high first PC does not necessarily have a high amount of cache misses. Similarly, it was found that the second PC is a measure for the instruction mix of an application. More specifically, benchmarks with a high second PC would be applications with a low amount of jump instructions and a small amount of L1 loads and stores. Finally, the third component is mostly influenced positively by time, context switches and page faults. Figure 6.2, we notice that the PyPy environment has similar PC values. However, zooming in on the second PC of Figure 6.2, we notice different magnitudes of the metrics between the components in startup and steady-state. As the second PC is a measure of the instruction mix, this is explained due to the fact that during compilation, a higher blend of instruction types are found. On the other hand, in steady-state, we always run with the same set of instructions. We identify differences in magnitude for other factor loadings as well, such as fortime, context switches, page faults and branch misses. This again suggests a different behaviour between startup and steady-state. We see that for benchmarks run using the startup methodology in the PyPy environ- ment more emphasis is put on the L1 cache loads and stores for the third PC. This shows that for startup and steady-state, the underlying idea behind the PC differs. 6.1 Principal Component Analysis 47

(a) The PC values for all benchmarks measured using the startup methodology.

(b) The PC values for all benchmarks measured using the steady-state methodology

Figure 6.1: The PC values for all benchmarks in the CPython Environment. (a) Startup methodology and (b) Steady-state methodology, which has values very similar to the startup PC values. PC 6 and 7 are the only components that behave differently. 6.1 Principal Component Analysis 48

(a) The PC values for all benchmarks measured using the startup methodology.

(b) The PC values for all benchmarks measured using the steady-state methodology

Figure 6.2: The PC values for all benchmarks in the PyPy Environment. (a) Startup methodology and (b) Steady-state methodology, which either has similar values or values with inverted behaviour (due to the linear relation of the components) compared to the startup PC values. A difference in magnitude for some metrics is also visible, which suggests a different behaviour between startup and steady-state. 6.1 Principal Component Analysis 49

6.1.2 Principal Component Selection

Before exploring the different components more in-depth using a quantitative analysis, we investigate the necessary amount of components needed to explain enough variance in the data. Figure 6.3 indicates that in general, when using the PyPy environment, more PCs are needed to explain the same amount of variance than for the CPython environment. This is yet another confirmation of the difference in behaviour between the interpretation of CPython and the JIT compilation of PyPy. We keep in mind that CPython needs at least four dimensions to explain at least 90% of the variance, while PyPy needs at least five dimensions.

Figure 6.3: The percentage of explained variance for startup and steady-state methodolo- gies for different environments. All benchmarks are used for this graph. 6.1 Principal Component Analysis 50

6.1.3 Principal Component Comparison

We continue with a comparison of the different PCs. Figure 6.4 visualizes the different components for CPython and PyPy in both the startup and the steady-state methodologies. PC 1 and 4 focus primarily on the differences between CPython and PyPy. Whilst PC 3 slightly distinguishes the startup methodology of PyPy from the other configurations, PC 2 and 5 do both things: they isolate PyPy from CPython and isolate the two methodologies of PyPy from one another. From the previous subsection, we know that for CPython, we need at least four PCs to explain 90% of the variance in the data. For steady-state, we need at least five PCs. As such, we do not discuss the other PCs. We do note that, except for PC 6, the startup and steady-state methodologies of CPython have almost identical behaviour. This is an- other confirmation that no distinction is necessary between the startup and steady-state methodologies for the CPython environment. PCA, however, attempts to distinguish CPython’s startup from CPython’s steady-state using PC 6. As can be seen on Figure 6.1, PC 6 has different factor loadings in startup than it has in steady-state. This is confirmed when looking at PC 6 on Figure 6.4, where it is clearly visible that CPython’s startup methodology is distinguished from the other configurations. 6.1 Principal Component Analysis 51

Figure 6.4: An overview of the components 1 through 6 for both CPython and PyPy for both the startup and the steady-state methodologies. 6.1 Principal Component Analysis 52

Now that we know what the underlying behaviour of the different components is, we can end this section with scatter plots comparing the different components. Figure 6.5 visualises PC 1 versus PC 2. We find that, in general PyPy is easily distinguishable for CPython. We include a visualisation of PC 3 versus PC 5, which indicates a linear behaviour of CPython. The final visualisations in this section display PC 1 versus PC 5 and PC 2 versus PC 3. The former indicates once again PyPy is easily distinguished from CPython. The latter also indicates that there is not much difference between startup and steady-state for CPython. Notable on all scatter plots is that CPython generally is clustered in the space, while PyPy is scattered throughout the space. For example, on Figure 6.6 has an outlier of PyPy in the far top right corner of the figure. This is, once again, a different view onthefact that a JIT compiler behaves differently compared to an interpreter. 6.1 Principal Component Analysis 53

Figure 6.5: A visualisation of PC 1 versus PC 2 in a scatter plot for the startup method- ology. The environments are distinguishable. 6.1 Principal Component Analysis 54

Figure 6.6: A visualisation of PC 3 versus PC 5 in a scatter plot for the startup method- ology. The linear behaviour of CPython is clearly visible. 6.1 Principal Component Analysis 55

Figure 6.7: A visualisation of PC 1 versus PC 5 in a scatter plot for the steady-state methodology. The environments are distinguishable. 6.1 Principal Component Analysis 56

Figure 6.8: A visualisation of PC 2 versus PC 3 in a scatter plot for the all the configu- rations. CPython has similar results for both startup and steady-state (orange and red, respectively). PyPy is distinguishable from CPython. 6.2 Performance Analysis 57

6.2 Performance Analysis

6.2.1 Average Selection

According to Eeckhout in his book regarding performance evaluation methods [2], it is best to use the harmonic average to compare benchmarks over different VMs. This is in direct contradiction with the fact that the PyPerformance Benchmark Suite reports the geometric mean over its benchmarks. While extrapolating the performance number to a full workload space is questionable for harmonic mean, the necessary assumptions of the geometric mean are unproven and probably not valid [2]. Thus, we will first investigate the validity of using the geometric mean for drawing conclusions regarding the performance of Python implementations. We do so by examining the necessary assumptions for the geometric mean. According to Mashey, speedups, defined as the execution time of the base environment (CPython) divided by the execution time of the alternative environment (PyPy), are distributed ac- cording to a log-normal distribution [3]. Particularly, this means that the speedups follow an asymmetric distribution with a long tail to the right. On first sight, Figure 6.9 indicates that this is the case. However, we must verify this using a statistical normality test to ensure this is the case. Eeckhout [2] notes that this log-normal distribution over the speedups means that the elements in the population are not normally distributed, but that their logarithms (of any base) are. Based on this observation, we can test for normality on the log of the speedups instead of testing log-normality on the speedups. We used three different normality tests to verify whether or not the log of the speedups is normal. All three tests have the null- hypothesis that the population is normally distributed. The first is the Shapiro-Wilk test1, which returns a p value. If that value is less than a chosen alpha level, say 0.05, the null hypothesis is rejected. The D’Agostino’s K2 test2 is similar to the Shapiro-Wilk test in that is attempts to identify whether or not a given sample comes from a normally distributed population. This test is based on the skewness of the data. The last test is the Anderson-Darling normality test3. This test rejects the null hy-

1The Shapiro-Wilk normality test from the library: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html 2The D’Agostino K2 normality test from the scipy library: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html 3The Anderson-Darling normality test from the scipy library: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html 6.2 Performance Analysis 58

(a) Startup methodology (b) Steady-state methodology

Figure 6.9: Histograms of the speedup of all benchmarks of PyPy with regards to CPython. (a) Startup methodology and (b) Steady-state methodology. Both histograms are asym- metric and are skewed to the right (long-tail). It seems that they both follow a log-normal distribution. pothesis if the test’s statistic is larger than the critical values of the normal distribution. All three tests reject the null hypothesis for all of the speedups found in this thesis. They also reject the hypothesis for the speedups from both the startup and steady-state methodologies. We thus conclude that the geometric average should not be used to carry out a performance analysis for Python implementations.

6.2.2 Quantitative Analysis of Speedup Averages

In the previous subsection, we showed that the geometric average should not be used to carry out a performance analysis of Python implementations. In what follows, we present a quantitative analysis of both the harmonic, as well as the geometric average of both the PyPerformance Benchmark Suite and our proposed framework. The PyPy Speed Center reports that PyPy is 4.4 times faster than CPython. This result is purely based on the geometric average over the speedups of the benchmarks. First of all, this number should not be used in analysis. It is apparent that results should be gathered from the same machine. That is, the same machine that is used for our framework. We thus need to rerun the PyPerformance Benchmark Suite (on what the PyPy Speed Center is based on). This results in a harmonic mean of 0.95 and a geometric mean of 4.2. The 6.2 Performance Analysis 59

smaller than 1 harmonic average makes sense, since the startup methodology is measured. Running the same benchmarks of the PyPerformance Benchmark Suite using our frame- work, which is based on well-grounded research regarding statistically rigorous bench- marking and which distinguishes startup from steady-state, we find the following average speedups. Regarding the geometric mean: we find an average speedup of 2.2 over all benchmarks, of 1.64 using the startup methodology and of 2.96 using the steady-state methodology. Regarding the harmonic mean: we find an average speedup of 0.91 over all benchmarks, of 0.77 using the startup methodology and of 1.11 for the steady-state methodology. Most importantly, according to the harmonic average, PyPy is slower in the startup methodology than CPython. In steady-state, PyPy is faster. Although this is an expected result, since a JIT compilation takes some time in startup, this was not clear from the reported results from the PyPerformance Benchmark Suite, nor from using the geometric mean in our framework. This shows that the distinction between startup and steady-state is necessary, as well as the usage of the harmonic average. Tables 6.1 and 6.2 provide the harmonic and geometric averages of our benchmarking framework for the different subsets of benchmarks as discussed in Chapter 3. It is apparent that steady-state is faster than startup in all cases, as again is to be expected. CPython simply finishes the benchmark whereas PyPy constantly checks for hot-loop and compiles where necessary. In a limited timeframe, PyPy cannot take advantage of its JIT. We notice that Shed Skin workloads have higher speedups than other benchmark suites. Most of the programs from this benchmark suite include a lot of mathematical concepts and are thus the perfect benchmarks to compile. This shows the strength of PyPy, but does not show its weaknesses, which can lead to wrong conclusions. We note that C-Extensions & ML is the only subset to report a slower average execution time for PyPy than CPython and this for all three averages. Even though only one subset has this behaviour, we emphasize that this is an extremely important result. C-Extensions occur in all ML and mathematical libraries. Improving on this kind of benchmarks is key for future versions of Python implementations. Tables 6.3 and 6.4 provide the harmonic and geometric averages of the speedup of steady-state with respect to the startup methodology, respectively. In other words, these values indicate how much the Python implementation speeds up from startup to steady- state. We note that for these speedups, the tests also reject the null hypothesis. As expected, CPython remains close to, but slightly higher than 1. This is because a workload execution in CPython for steady-state does not differ from its execution using the startup methodology, except for the fact that in steady-state, the Python invocation 6.2 Performance Analysis 60

Harmonic Average Startup Steady-State All All 1.22 1.60 1.38 PyPerformance 0.77 1.11 0.91 Shed Skin 4.20 6.78 5.18 Computer Benchmarks Game 1.34 1.86 1.56 C-Extensions & ML 0.80 0.84 0.82 Custom discrete algorithms 1.08 2.09 1.43

Table 6.1: Harmonic averages of the speedup of PyPy with respect to CPython for the met- ric time for the different benchmark suites. Also includes the speedups for all benchmark suites together. All results in this table are gathered using our framework.

Geometric Average Startup Steady-State All All 2.10 3.46 2.69 PyPerformance 1.64 2.96 2.20 Shed Skin 5.07 10.47 7.28 Computer Benchmarks Game 1.63 2.30 1.94 C-Extensions & ML 0.86 0.92 0.89 Custom discrete algorithms 1.09 2.17 1.54

Table 6.2: Geometric averages of the speedup of PyPy with respect to CPython for the metric time for the different subsets of benchmark suites. Also includes the speedups for all benchmark suites together. All results in this table are gathered using our framework. startup time is not included. That is why pure Python implementation benchmarks such as those from the Shed Skin Benchmark Suite almost have no speedup from startup to steady-state in CPython. This, because the startup time of the Python implementation in that case is extremely low and does not have a high impact on the total benchmark execution time. In contrast, for Shed Skin, the execution time in PyPy from startup to steady-state is reduced the most. This, because PyPy is known to work very well with pure Python implementations. The fact that PyPy is slower than CPython in startup is only observable when using the harmonic average. This is once more visualized in Figure 6.10. Here, in startup, we see the harmonic average of the speedup from CPython to PyPy is 0.77, which is smaller than 1. However, the geometric average of the same data is 1.64, which is greater than 1. 6.2 Performance Analysis 61

Harmonic Average CPython PyPy All All 1.09 1.66 1.32 PyPerformance 1.11 1.76 1.36 Shed Skin 1.03 2.04 1.37 Computer Benchmarks Game 1.13 1.56 1.31 C-Extensions & ML 1.11 1.17 1.14 Custom discrete algorithms 1.40 2.67 1.84

Table 6.3: Harmonic averages of the speedup of steady-state with respect to startup for the metric time for the different benchmark suites. Also includes the speedups forall benchmark suites together. All results in this table are gathered using our framework.

Geometric Average CPython PyPy All All 1.14 1.88 1.46 PyPerformance 1.17 2.12 1.58 Shed Skin 1.08 2.23 1.55 Computer Benchmarks Game 1.20 1.70 1.43 C-Extensions & ML 1.12 1.19 1.15 Custom discrete algorithms 1.40 2.79 1.98

Table 6.4: Geometric averages of the speedup of steady-state with respect to startup for the metric time for the different benchmark suites. Also includes the speedups forall benchmark suites together. All results in this table are gathered using our framework.

As such, the geometric average is wrong and should not be used. Finally, we look at Figure 6.11 for a visualisation of the speedup when going from the startup methodology to the steady-state methodology and this for both the CPython and the PyPy environments. First, it is apparent that the CPython environment almost has no speedup from startup to steady-state. This is as expected, since both in startup and in steady-state, CPython is simply an interpreter. Remark that the number workloads with a high speedup in the CPython environment are limited. Those workloads are: crypto pyaes, kanoodle, nbody, regex_v8, reverse complement long and short and richards. These workloads either have a very low execution time and/or a higher number of libraries to import. For PyPy, quite a few benchmarks reach a harmonic speedup of 3× and some even reach 5×. In conclusion, the differences between CPython and PyPy are very apparent. 6.2 Performance Analysis 62

(a) Startup methodology (b) Steady-state methodology

Figure 6.10: S-Curves for the PyPerformance Benchmark Suite. (a) The startup method- ology. PyPy is slower than CPython (harmonic average of 0.77). This is not clear from the geometric average of 1.64, indicating that PyPy would be faster. (b) The steady-state methodology. PyPy is faster than CPython (harmonic average of 1.11).

(a) CPython Environment (b) PyPy Environment

Figure 6.11: S-Curves for the speedups of all the benchmarks from startup to steady-state. (a) The CPython environment. (b) The PyPy Environment. 6.3 Clustering 63

6.3 Clustering

We now inspect whether or not we get the same results and conclusions when only using the closest benchmarks gathered using clustering. We do as such for speedups defined as the execution time in the CPython environment, divided by execution time in the PyPy environment first. This, for both the startup methodology and the steady-state methodology. Secondly, we conduct a similar research regarding the speedups defined as the execution time when using the startup methodology, divided by the execution time when using the steady-state methodology. This, for both the CPython environment and the PyPy environment. As mentioned in Chapter 5, we cluster the workloads using K-Means clustering4 after preprocessing the results with PCA. In all cases, we fix the dimension and vary the number of clusters k. For every value of k, we calculate the weighted harmonic average of the speedups of the selected set of closest benchmarks. The weights of this average correspond to the number of benchmarks in the corresponding cluster. We call the difference of the weighted harmonic average of the subset and the harmonic average over all benchmarks the Harmonic Cluster Error (HCE). We call the difference, calculated using the geometric averages the Geometric Cluster Error (GCE). We note that the graphs in the following sections contain 234 workloads. This is explained as follows: we start with 117 benchmarks and run them both in CPython, PyPy using both the startup and steady-state methodologies. That gives us 468 data points. If we now cluster according to, for example, the first definition of speedup, we retain 234 speedups. 117 of those are speedups using the startup methodology, while the other 117 speedups use the steady-state methodology. We decided to cluster all speedups together, regardless of the methodology they use. This is purely a research decision. The same decision is made for the other definition of speedup.

6.3.1 CPython versus PyPy

We start with the first definition of the speedup. From Section 6.1 we know that, with respect to the necessary dimension, the startup methodology does not differ from the steady-state methodology. Figure 6.12 visualizes the HCE and the GCE for the different methodologies. We generated this figure as follows. We selected the benchmarks executed using the specific methodology, processed the data using PCA, applied K-Means clustering

4K-Means model from the scikit-learn library: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html 6.3 Clustering 64 using four dimensions and finally visualised the respective cluster errors. This figure indicates that for both averages, around 60 benchmarks are needed inorder for the respective errors to drop below 2%. Should the entire set of benchmarks be used again for a different evaluation, thus only 60 benchmarks are needed. Later in this chapter, we zoom in on what the selected benchmarks are. These figures are a good visualisation for indicating that the geometric averagein general overestimates the speedup, i.e., higher speedups are found when using the geometric average. 6.3 Clustering 65

Figure 6.12: The difference of the harmonic/geometric average of the speedups ofthe benchmarks closest to the cluster centers and the harmonic/geometric average of the speedups over all benchmarks. Speedups are defined as execution time of CPython di- vided by execution time of PyPy. Both dimensions are equal to four. 60 benchmarks are needed in order for the weighted average to be within a 2% range of the average over all benchmarks. (a) Harmonic average and (b) Geometric average. 6.3 Clustering 66

6.3.2 Startup versus Steady-State

Before inspecting the different benchmarks for the determined dimension and the number of clusters, we inspect the averages for the second definition of speedup. Here, we divide the execution time when using the startup methodology by the execution time when using the steady-state methodology. We do this for both CPython and PyPy. From Section 6.1, we know that CPython needs at least four dimensions to explain at least 90% of the variance in the data, whilst PyPy needs at least five dimensions. Figure 6.13 visualizes the HCE and the GCE for the different environments. From these visualisations we find that in order to approach the averages over all bench- marks within a 2% range, we need only around 30 benchmarks. Again, the geometric average overestimates the speedup. It is noted that more benchmarks are needed when evaluating the speedup from CPython to PyPy than when evaluating the speedup from startup to steady-state. This can be understood as follows: when evaluating from startup to steady-state, you effectively measure the compiler’s strength. That is, how good canthe JIT compiler compile a Python program from startup to steady-state. On the other hand, when evaluating from CPython to PyPy, one is essentially measuring a compilation of a specific workload. Thus, in order to get a good average, a higher number of specific workloads are necessary. 6.3 Clustering 67

Figure 6.13: The difference of the harmonic/geometric average of the speedups ofthe benchmarks closest to the cluster centers and the harmonic/geometric average of the speedups over all benchmarks. Speedups are defined as execution time in startup divided by execution time in steady-state. CPython’s dimension is four, PyPy’s dimension is five. 30 benchmarks are needed in order for the weighted average to converge within a 2% range of the average over all benchmarks. (a) Harmonic average and (b) Geometric average. 6.3 Clustering 68

6.3.3 Specific Workloads

In summary: for the speedup from CPython to PyPy, 60 clusters are needed for both the startup or the steady-state environment. Both the startup and steady-state methodologies’ dimensions are taken to be four. For the speedup from startup to steady-state, 30 benchmarks are needed for both the CPython and the PyPy environment. The dimension for the CPython benchmarks is four, whereas the dimension of the workloads executed in the PyPy environment is five. We now investigate for the configurations {dimension, k} equal to {4, 60} for startup, {4, 60} for steady-state, {4, 30} for CPython and {5, 30} for PyPy, what benchmarks are chosen and on what benchmarks the K-Means clustering focuses. We will not list all of the benchmarks for the 4 different configurations. Instead, we finish this section by listing a few remarks: Firstly, for every configuration wefindthat the selected benchmarks are not specific to one benchmark suite. Instead, all benchmark suites contribute in a fair way. This shows that the clustering mechanism was able to select the benchmarks evenly from all suites. Secondly, even inner joins from the selected benchmarks, i.e., benchmarks that are found in both the startup and steady-state list, or benchmarks that are found in both the CPython and the PyPy list, are also evenly distributed. The fairness of the different configurations can be seen on the barplots on Figure 6.14. 6.3 Clustering 69

Figure 6.14: Bar plots of the different clustering configurations. Fairness is observed between the different benchmark suites. Although the discrete algorithms seem underrep- resented. However, it is the smallest suite with only 6 workloads. 6.4 C-Extensions 70

6.4 C-Extensions

To conclude this chapter, we investigate the time it takes to translate from Python to C code. More specifically, we look at the time taken for the Python implementation totalk to the C code of a library. For this, we investigate custom written C-Extensions for Python that we compile into the CPython and PyPy interpreter. We analyze three different scenarios. The first, where CPython is faster than PyPy,is visualised in Figure 6.15. The second, where PyPy is faster than CPython, is visualised in Figure 6.16. The third, where both environments are equally fast, is visualised in Figures 6.17 and 6.18. Before going into the three scenarios, we note that all of the benchmarks are built in a linear fashion. We can thus calculate an linear approximation to the execution time using the measurements. By this, we can compare the different Python implementations. The approximations are found on Table 6.5. For the Logistic Regression benchmark in Figure 6.15, we find that the slope of CPython is smaller than the slope of PyPy. This shows that in the limit, CPython is faster than PyPy for this specific workload. For the Isotonic Regression benchmark in Figure 6.16 the inverse is true. This shows that, although initially PyPy is slower, due to the constant, PyPy is faster in the limit. Finally, we take a closer look at the final benchmark Prime Calculator in startup and steady-state, which is a custom C-Extension written in C code compiled into the different Python implementations. In Figure 6.17 that CPython and PyPy behave extremely similar in the limit, but differ a constant of 1.67. This is due to the difference in startup time for the CPython and PyPy VM. This constant is smaller when running the benchmark in steady-state as can be seen in Figure 6.18 In summary, for the logistic regression PyPy is faster when adding more iterations. For Isotonic Regression, the inverse is true and finally, for the Prime Calculator, PyPy is as fast as CPython. It makes sense that for pure Python code, PyPy is faster due to its JIT. However, C-Extensions are sometimes faster, sometimes slower, sometimes equal and sometimes do not even work at all, cfr. Tensorflow. This is due to the fact that the translation using cpyext, PyPy’s alternative to Python.h is not yet optimal. We end this chapter by stressing that future work should be done in identifying why PyPy is slower for certain C-Extensions. In order for PyPy to be a solid alternative to the CPython implementation, the drawback of using C-Extensions makes it so that PyPy is not adopted as a second reference implementation. 6.4 C-Extensions 71

Figure 6.15: The MLogistic example of scikit-learn using the MNist dataset adapted to be a benchmark evaluated using the startup methodology. PyPy is faster.

Figure 6.16: The Isotonic Regression model from scikit-learn as a benchmark evaluated using the startup methodology. PyPy is slower. 6.4 C-Extensions 72

Figure 6.17: A custom C-Extension benchmark that calculates primes evaluated using the startup methodology. CPython and PyPy behave the same way, mind an offset equal to be the difference of the startup time of the different VMs.

Figure 6.18: A custom C-Extension benchmark that calculates primes evaluated using the steady-state methodology. CPython and PyPy are equal since in steady-state the difference in startup time of the Python implementations is not included in the measurements. 6.4 C-Extensions 73

Linear Approximations of the Execution time of CPython PyPy C-Extension Benchmarks Logistic Regression (Startup) 11.30 + 27.19 × i 16.85 + 29.00 × i Isotonic Regression (Startup) 10.16 + 12.36 × i 12.51 + 9.07 × i Prime Calculator (Startup) 0.64 + 18.98 × i 2.31 + 19.04 × i Prime Calculator (Steady-State) 0.16 + 19.07 × i 0.78 + 18.98 × i

Table 6.5: Linear Approximations to the custom C-Extension benchmarks, calculated us- ing the first values, i.e. i = 1, 2. i is the parameter of the specific benchmark. For Logistic Regression, CPython has a smaller slope and thus executes faster when adding more it- erations. For Isotonic Regression, the reverse is true: PyPy has a smaller slope and thus executes faster when adding more iterations. For the Prime Calculator, both environments have a similar slope and are as fast. CONCLUSION 74

Chapter 7

Conclusion

We got a better understanding of the meaning of different PCs for the different Python environments and identified (dis)similarities between the PCs of the startup and the steady- state methodologies for both environments. Alongside this, we described the behaviour of the PCs when plotted versus other PCs and found that PyPy requires more PCs for explaining as much variance as CPython. We found that the geometric mean is not a good fit for benchmarks in Python environ- ments and thus disprove the use of the geometric average for the PyPerformance Benchmark Suite and the different Python speed centers. We confirm that for pure Python implemen- tations, PyPy is faster than CPython. Alongside this, we found that for CPython, the startup methodology barely differs from its steady-state methodology. On the contrary, for PyPy, several differences are noted. This is due to the fact thata JIT compiler behaves differently compared to an interpreter. However, due to problems regarding C-Extensions that also rise when attempting to run MLPerf benchmarks, PyPy is not widely adopted as a reference implementation. Improving on this matter is key for future versions of PyPy and, in general, future versions of Python implementations. We clustered benchmarks according to two definitions of speedups and found for both definitions an optimal set of parameters for clustering the data. Using these parameters, we identified a small set of benchmarks to approximate the conclusions over all workloads. All of the results mentioned in this thesis are generated using an automatic benchmark- ing suite, designed to run Python benchmarks using different Python implementations, that benchmarks Python workloads in both startup and steady-state for different metrics. It also generates various graphs and provides an easy-to-use API to quickly gain more insight in the benchmarking results. BIBLIOGRAPHY 75

Bibliography

[1] Andy Georges, Dries Buytaert, and Lieven Eeckhout. Statistically rigorous java per- formance evaluation. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications, OOPSLA ’07, page 57– 76, New York, NY, USA, 2007. Association for Computing Machinery.

[2] Lieven Eeckhout. Computer architecture performance evaluation methods. Synthesis Lectures on Computer Architecture, 5(1):1–145, 2010.

[3] John R. Mashey. War of the benchmark means: Time for a truce. SIGARCH Comput. Archit. News, 32(4):1–14, September 2004.

[4] Lieven Eeckhout, Hans Vandierendonck, and Koen De Bosschere. Quantifying the impact of input data sets on program behavior and its applications. Journal of Instruction-Level Parallelism, 5:1–33, 2003.

[5] Python Software Foundation. Python programming language, 2019. https://www. python.org/.

[6] TIOBE Software Quality Company. Tiobe programming languages popularity index, 2019. https://www.tiobe.com/tiobe-index/.

[7] R. Komorn. Python in production engineering, 2016. https://engineering.fb. com/production-engineering/python-in-production-engineering/.

[8] A. Ramanujam, E. Livengood. Python in production engineering, 2019. https: //netflixtechblog.com/python-at-netflix-bba45dae649e.

[9] O. Ike-Nwosu. Inside the python virtual machine. https://leanpub.com/ insidethepythonvirtualmachine/read.

[10] PyPy Development Team. Pypy, a python implementation. http://www.pypy.org/. BIBLIOGRAPHY 76

[11] The PyPy Project Revision. Rpython, subset of python programming language, 2016. https://rpython.readthedocs.io/en/latest/.

[12] Anaconda. Numba, a python implementation, 2012. http://numba.pydata.org/.

[13] Cython Development Team. Cython, a python implementation.

[14] Python Software Foundation. Jython, a python implementation. https://www. jython.org/.

[15] .NET Foundation. Ironpython, a python implementation. https://ironpython. net/.

[16] D. George. Micropython, a python implementation for microcontrollers, 2014. https: //micropython.org/.

[17] D. George. Circuitpytho, a beginner friendly, open source version of python for tiny, inexpensive computers called microcontrollers, 2017. https://github.com/ adafruit/circuitpython.

[18] N. Bivaldi. Using pypy instead of python for speed by niklas bivald at pycon sweden, 2017. https://www.youtube.com/watch?v=1n9KMqssn54.

[19] V. Stinner. Python performance: Past, present and future at europython conference, 2019. https://www.youtube.com/watch?v=TXRPCZ7Nmh4.

[20] Brendan Gregg. Linux perf tool hardware events, 2009. http://man7.org/linux/ man-pages/man2/perf_event_open.2.html.

[21] Intel. Intel® 64 and ia-32 architectures developer’s manual: Vol. 3b, 2016. https://www.intel.com/content/www/us/en/architecture-and-technology/ 64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html.

[22] Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo. Tracing the meta-level: Pypy’s tracing jit compiler. In Proceedings of the 4th Workshop on the Im- plementation, Compilation, Optimization of Object-Oriented Languages and Program- ming Systems, ICOOOLPS ’09, page 18–25, New York, NY, USA, 2009. Association for Computing Machinery.

[23] Jose Manuel Redondo and Francisco Ortin. A comprehensive evaluation of common python implementations. IEEE Software, 32(4):76–84, July 2015. BIBLIOGRAPHY 77

[24] M. Ismail and G. E. Suh. Quantitative overhead analysis for python. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pages 36–47, Sept 2018.

[25] Python Development Team. Common set of benchmarks for python implementations, 2012. https://speed.python.org/.

[26] PyPy Development Team. Common set of benchmarks for python implementations, 2012. https://speed.pypy.org/.

[27] Victor Stinner. Pyperformance benchmark suite, 2012. https://pyperformance. readthedocs.io/index.html.

[28] Shedskin Development Team. Shedskin benchmark suite, 2016. https://code. google.com/archive/p/shedskin/downloads.

[29] slm. Emptying the buffers cache, 2013. https://unix.stackexchange.com/ questions/87908/how-do-you-empty-the-buffers-and-cache-on-a-linux-system.

[30] Brendan Gregg. Emptying the buffers cache, 2009. https://unix.stackexchange.com/questions/87908/ how-do-you-empty-the-buffers-and-cache-on-a-linux-system.

[31] John Emmons. A python library for accessing cpu performance counters on linux, 2018. https://github.com/jremmons/perflib.

[32] Antonio Cuni. Inside cpyext, 2018. https://morepypy.blogspot.com/2018/09/ inside-cpyext-why-emulating--c.html.

[33] Gergö Barany. Python interpreter performance deconstructed. In Proceedings of the Workshop on Dynamic Languages and Applications, Dyla’14, page 1–9, New York, NY, USA, 2014. Association for Computing Machinery.

[34] Gerald Kaszuba. Python call graph, 2007. https://pycallgraph.readthedocs.io/ en/master/.

[35] P. Mattson, V. Janapa Reddi, C. Cheng, C. Coleman, G. Diamos, D. Kanter, P. Mi- cikevicius, D. Patterson, G. Schmuelling, H. Tang, G. Wei, and C. Wu. Mlperf: An industry standard benchmark suite for machine learning performance. IEEE Micro, pages 1–1, 2020. BIBLIOGRAPHY 78

[36] Guide Van Russum. Gradual typing for python 3, 2015. https://www.youtube.com/ watch?v=2wDvzy6Hgxg&t=16m52s.

[37] Hongchuan Yu and M. Bennamoun. 1d-pca, 2d-pca to nd-pca. In 18th International Conference on Pattern Recognition (ICPR’06), volume 4, pages 181–184, 2006.

[38] Ian Jolliffe. Pincipal Component Analysis, volume 25. 01 2002.

[39] Ian Jolliffe and Jorge Cadima. Principal component analysis: A review and recent de- velopments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374:20150202, 04 2016.

[40] J. Choi, T. Shull, M. J. Garzaran, and J. Torrellas. Shortcut: Architectural support for fast object access in scripting languages. In 2017 ACM/IEEE 44th Annual Inter- national Symposium on Computer Architecture (ISCA), pages 494–506, June 2017.

[41] B. Ilbeyi, C. F. Bolz-Tereick, and C. Batten. Cross-layer workload characterization of meta-tracing jit vms. In 2017 IEEE International Symposium on Workload Charac- terization (IISWC), pages 97–107, Oct 2017.

[42] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. SIGPLAN Not., 40(6):190–200, June 2005. BENCHMARK LIST 79

Appendix A

Benchmark List

As mentioned in Section 3.2, the proposed benchmark suite contains benchmarks from a variety of sources. Table A.1 shows all of the used benchmarks in this paper along with the source of the benchmark. If applicable, the input parameters of the benchmarks are given.

ID Suite Name Parameters CP SU CP SS PP SU PP SS 1 C-Ext & ML MLP_MNIST 1 68.01 63.82 319.15 308.19 2 C-Ext & ML MLogistic_MNIST 1 38.49 31.52 45.85 35.36 3 C-Ext & ML MLogistic_MNIST 2 65.68 58.71 74.85 65.11 4 C-Ext & ML MLogistic_MNIST 3 92.61 85.83 104.74 95.04 5 C-Ext & ML MLogistic_MNIST 4 119.90 113.09 135.22 125.12 6 C-Ext & ML MLogistic_MNIST 5 147.41 140.54 165.11 155.27 7 C-Ext & ML basic_numpy 10000 6.30 4.48 9.45 5.45 8 C-Ext & ML basic_numpy 15000 11.10 9.08 14.02 10.17 9 C-Ext & ML basic_numpy 5000 3.32 1.91 6.24 2.67 10 C-Ext & ML c_extension_1 * 76.87 76.32 78.56 76.83 11 C-Ext & ML c_extension_2 1 19.62 19.22 21.35 19.75 12 C-Ext & ML c_extension_2 2 38.59 38.29 40.39 38.73 13 C-Ext & ML c_extension_2 3 57.67 57.34 59.43 57.76 14 C-Ext & ML c_extension_2 4 76.68 76.34 78.50 76.79 15 C-Ext & ML c_extension_2 5 95.79 95.41 97.56 95.92 16 C-Ext & ML c_extension_2 6 114.72 114.42 116.58 114.94 17 C-Ext & ML c_extension_2 7 133.93 133.47 135.58 133.97 18 C-Ext & ML dataframe 10000 42.44 40.23 53.84 47.01 19 C-Ext & ML dataframe 15000 91.68 85.66 101.83 100.07 20 C-Ext & ML dataframe 5000 15.91 12.43 23.79 18.75 21 C-Ext & ML isotonic_regression 1 22.52 16.94 21.57 13.72 22 C-Ext & ML isotonic_regression 2 34.89 30.18 30.64 23.14 23 C-Ext & ML isotonic_regression 3 48.26 43.92 40.35 32.78 BENCHMARK LIST 80

24 C-Ext & ML isotonic_regression 4 61.72 57.53 49.62 42.14 25 C-Ext & ML isotonic_regression 5 75.05 70.90 58.70 51.47 26 CBG binaryTree 17 7 33.70 34.80 6.81 4.06 27 CBG binaryTree 21 4 106.62 109.11 16.26 13.42 28 CBG binaryTree 7 12 44.81 46.73 8.63 5.49 29 CBG fasta 1000000 3.48 3.04 3.66 1.93 30 CBG fasta 10000000 29.28 28.90 15.98 13.95 31 CBG fasta 25000000 73.11 74.14 36.03 34.42 32 CBG fasta 5000000 15.46 14.57 9.14 7.27 33 CBG knucleotide 1000000 4.21 3.49 5.04 2.71 34 CBG knucleotide 10000000 31.03 33.67 20.70 16.50 35 CBG knucleotide 25000000 83.50 85.26 45.71 42.45 36 CBG knucleotide 5000000 4.11 3.11 5.18 2.64 37 CBG regexRedux 10000000 16.63 18.52 11.14 7.65 38 CBG regexRedux 5000000 5.09 4.69 5.37 2.19 39 CBG reverseComplement_long 100000000 19.31 7.23 25.48 10.29 40 CBG reverseComplement_short 100000000 13.11 3.34 16.59 2.37 41 DA graham 19 7.47 5.60 7.64 3.18 42 DA graham 20 16.13 12.27 13.35 4.71 43 DA graham 21 34.85 26.87 26.25 10.76 44 DA quickHull 19 6.00 3.96 6.53 1.17 45 DA quickHull 20 12.09 8.13 11.30 4.80 46 DA quickHull 21 24.80 16.65 22.44 10.49 47 PyPerf crypto_pyaes * 0.81 0.35 2.47 0.64 48 PyPerf deltablue 125000 30.44 29.83 139.41 143.76 49 PyPerf deltablue 25000 4.58 4.13 7.15 5.27 50 PyPerf deltablue 50000 9.56 9.28 23.90 21.91 51 PyPerf deltablue 75000 15.73 15.00 51.99 51.21 52 PyPerf fannkuch 10 9.23 8.87 3.87 2.13 53 PyPerf fannkuch 11 106.39 111.96 21.21 19.49 54 PyPerf go 300 11.18 10.62 8.71 4.62 55 PyPerf go 500 40.69 40.32 20.23 14.68 56 PyPerf go 700 88.62 87.97 38.77 29.49 57 PyPerf mdp * 5.62 5.02 9.14 6.55 58 PyPerf nbody * 0.59 0.16 2.36 0.48 59 PyPerf nqueens 10 16.44 19.73 5.37 3.86 60 PyPerf pidigits 10000 7.74 7.27 8.19 6.42 61 PyPerf pidigits 20000 32.20 31.88 31.39 29.34 62 PyPerf pidigits 30000 75.21 75.89 69.05 66.79 63 PyPerf pidigits 5000 2.25 1.85 3.63 1.91 64 PyPerf pyflate * 1.92 1.43 3.29 0.69 65 PyPerf raytrace 300 300 9.00 8.42 2.90 0.51 66 PyPerf raytrace 400 400 15.46 17.21 2.95 0.87 BENCHMARK LIST 81

67 PyPerf raytrace 500 500 23.86 26.01 3.13 1.01 68 PyPerf raytrace 600 600 34.29 33.49 3.24 1.16 69 PyPerf raytrace 700 700 47.47 45.91 3.37 1.25 70 PyPerf regex_v8 * 0.79 0.26 2.57 0.15 71 PyPerf richards * 0.71 0.32 2.44 0.33 72 PyPerf spectral_norm 1000 13.99 13.57 2.83 1.07 73 PyPerf spectral_norm 1500 30.69 30.18 3.37 1.61 74 PyPerf spectral_norm 2000 54.65 60.80 4.30 2.40 75 PyPerf spectral_norm 5000 345.32 364.18 13.66 11.93 76 Shed Skin ant 200 11.55 10.97 4.11 2.02 77 Shed Skin ant 300 25.13 25.45 5.64 3.59 78 Shed Skin ant 350 34.50 33.93 6.68 4.69 79 Shed Skin connect4 6 6 19.02 18.49 3.37 1.52 80 Shed Skin connect4 6 7 23.07 22.99 3.77 1.93 81 Shed Skin connect4 7 7 79.48 78.84 7.20 5.31 82 Shed Skin convexHull 10000 13.27 13.13 3.31 1.16 83 Shed Skin convexHull 20000 39.79 42.37 4.24 2.11 84 Shed Skin dijkstra 5000 12.80 13.33 4.38 1.38 85 Shed Skin dijkstra 8000 36.87 33.97 6.30 4.15 86 Shed Skin genetic 200 16.56 16.85 5.38 2.35 87 Shed Skin genetic 300 24.27 24.61 6.43 3.33 88 Shed Skin kanoodle * 14.84 0.88 3.10 0.77 89 Shed Skin kmeans 100000 11.42 9.09 3.45 0.74 90 Shed Skin kmeans 500000 41.12 38.09 4.81 2.75 91 Shed Skin mandelbrot 1000 1000 11.04 10.62 2.88 0.95 92 Shed Skin mandelbrot 2000 1000 36.92 36.42 3.60 1.87 93 Shed Skin mandelbrot 3000 250 21.95 22.11 3.05 1.22 94 Shed Skin mao 10 29.81 31.14 5.08 2.02 95 Shed Skin mao 8 19.49 18.79 4.31 0.69 96 Shed Skin mao 9 24.09 23.73 4.71 0.82 97 Shed Skin oliva 1000 1000 7.54 7.07 2.98 0.98 98 Shed Skin oliva 2000 2000 28.26 27.80 3.84 1.48 99 Shed Skin oliva 3000 3000 64.27 65.90 5.97 3.44 100 Shed Skin oliva 4000 4000 113.57 114.04 9.38 6.69 101 Shed Skin path_tracing 20 100 100 28.82 28.48 3.80 1.59 102 Shed Skin path_tracing 3 500 500 108.91 108.38 6.00 3.65 103 Shed Skin path_tracing 5 320 240 55.32 54.08 4.48 2.28 104 Shed Skin pygmy 120 120 17.92 17.58 3.82 1.10 105 Shed Skin pygmy 150 150 27.47 28.53 3.81 1.19 106 Shed Skin pygmy 200 200 47.87 47.37 4.04 1.35 107 Shed Skin sieveOfAtkin 100000000 15.92 15.84 9.56 6.84 108 Shed Skin sieveOfAtkin 250000000 39.50 39.32 21.50 17.43 109 Shed Skin sieveOfAtkin 500000000 80.26 80.72 41.24 34.37 BENCHMARK LIST 82

110 Shed Skin sieveOfEratostenes 100000000 7.37 7.02 4.35 2.82 111 Shed Skin sieveOfEratostenes 250000000 17.98 17.76 7.61 5.95 112 Shed Skin sieveOfEratostenes 500000000 35.28 38.18 14.04 11.91 113 Shed Skin tictactoe 100 58.00 62.43 27.96 24.21 114 Shed Skin voronoi 10000 51.60 53.87 3.79 1.56 115 Shed Skin voronoi 2000 10.70 10.27 3.08 0.92 116 Shed Skin voronoi 5000 25.90 27.10 3.31 1.13 117 Shed Skin yopyra * 14.11 15.11 2.74 0.62

Table A.1: All benchmarks used in this thesis. CBG = Computer Benchmarks Game, DA = Discrete Algorithms, CP = CPython, PP = PyPy, SU = Startup, SS = Steady-State. Values in the final four columns are the absolute execution timings.

Performance Analysis and Benchmarking of Python Workloads

Arthur Crapé Student number: 01502848

Supervisor: Prof. dr. ir. Lieven Eeckhout

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Academic year 2019-2020