A High-Performance, Parallel Virtual Machine for Python

Research Collection Doctoral Thesis A High-Performance, Parallel Virtual Machine for Python Author(s): Meier, Remigius Publication Date: 2019 Permanent Link: https://doi.org/10.3929/ethz-b-000371019 Rights / License: Creative Commons Attribution-ShareAlike 4.0 International This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Diss. ETH No. 26030 AHIGH-PERFORMANCE, PARALLELVIRTUALMACHINE FORPYTHON A dissertation submitted to attain the degree of DOCTOROFSCIENCES of ETHZURICH (Dr. sc. ETH Zurich) presented by REMIGIUSJONASMEIER MSc ETH CS, ETH Zürich born on 02.05.1988 citizen of Gempen SO accepted on the recommendation of Prof. Dr. Thomas R. Gross Prof. Dr. Dr. h.c. Hanspeter Mössenböck Prof. Dr. Martin Schulz 2019 Remigius Jonas Meier: A High-Performance, Parallel Virtual Machine for Python, © 2019 ACKNOWLEDGMENTS I would rst like to thank my advisor, Professor Thomas Gross, who gave me the opportunity to conduct this research. His guidance and support throughout the years at the Laboratory for Software Technology promoted my growth not only as a researcher, but as a person. I am deeply grateful for everything that I have learned from him. My heartfelt gratitude goes to the external co-examiners, Professor Mössenböck and Professor Schulz. From them, I received many comments that helped to greatly improve this dissertation, and I appreciate that they took the time to travel to Zürich for my defense. Many thanks to the people of the PyPy project. Back in 2012, it was a project proposal from Armin Rigo that sparked my interest in the topic of this dissertation. Since then, I was fortunate to work together with Armin on many occasions. I thank him for his invaluable support, his ideas, and his contributions throughout this endeavor. To my fellow PhD students at ETH Zürich, many thanks for the great time and all the support. Due to our all-too-common, deep, and intense discussions — and my inability to withdraw from them early enough — I probably spent the most time over the years with the LLT group members Aristeidis, Luca, and Michael. It was fun. Nobody is more important to me than my friends and, especially, my family. It is because of their support that I managed to overcome all obstacles during my PhD and life in general. I thank them especially for their patience during the months of writing this dissertation and hope to nally see them all more often again. Thank you very much, everyone! Remigius Jonas Meier Zürich, September 3, 2019 iii CONTENTS acknowledgments iii abstract ix zusammenfassung xi 1 introduction 1 1.1 Thesis Statement . 2 1.2 Overview and Contributions . 3 2 problem statement 5 2.1 Python’s Concurrent Execution Model . 6 2.1.1 Serialization Semantics . 7 2.1.2 Sequential Consistency . 9 2.1.3 Atomic Operations . 10 2.2 Challenges for an Ecient, Parallel Virtual Machine . 11 2.2.1 Implementation Languages . 11 2.2.2 Fine-Grained Locking . 13 2.2.3 Just-in-Time Compilation and Concurrency . 16 2.3 Summary . 20 2.4 Related Work . 21 3 approach 23 3.1 The RPython Approach . 23 3.1.1 Specifying Sequential Semantics . 24 3.1.2 Specifying Concurrency Semantics . 26 3.1.3 Observations . 28 3.2 Parallel RPython ...................... 30 3.2.1 The Execution Model . 30 3.2.2 Specifying Concurrency Semantics with Parallel RPython ....................... 33 3.3 The Design of a Concurrency Control Component . 36 3.3.1 Parallel Execution for Quanta . 37 3.3.2 A Parallel Implementation . 40 3.3.3 External Code . 42 3.3.4 External Code and the Concurrency Control Com- ponent . 46 3.4 Summary . 48 3.5 Related Work . 49 v vi contents 4 realization of parallel rpython 51 4.1 Challenges for an stm Component . 51 4.2 RPython’s jit ......................... 54 4.3 Design of an stm System . 56 4.3.1 Terminology . 56 4.3.2 Validation . 57 4.3.3 Aborting . 58 4.3.4 Atomic Commit and Parallel Worlds . 58 4.3.5 Key Design Decisions . 59 4.4 Ecient Parallel Worlds . 60 4.4.1 State-Sharing Worlds . 60 4.4.2 Page Sharing . 63 4.5 Page Management . 65 4.5.1 Challenges . 65 4.5.2 On-Demand Page Mapping . 68 4.5.3 Page Reconstruction . 70 4.5.4 World Reconciliation . 72 4.5.5 Ecient Bookkeeping . 75 4.6 Related Work . 76 5 case study: pypy 79 5.1 Parallelization of PyPy .................... 79 5.2 The Impact of Conicts . 81 5.3 Inline Caching . 82 5.4 Built-in Data Types . 84 5.5 Findings . 85 6 evaluation 87 6.1 Python vm Representatives . 88 6.2 Benchmarks . 89 6.3 Host Environment . 91 6.4 Methodology . 92 6.5 Speculation Overhead . 93 6.6 Performance Classication . 95 6.6.1 Parallel Performance Levels . 97 6.7 Comparison with Increased Workloads . 101 6.7.1 Scalability Comparison . 104 6.7.2 Scalability and the World Limit . 106 6.8 Summary . 110 7 conclusions 113 7.1 Implications . 115 contents vii 7.1.1 Implications for Python . 117 7.2 Closing Remarks . 118 a appendix 119 a.1 Jython Atomicity Violation . 119 a.2 IronPython Atomicity Violation . 119 bibliography 120 ABSTRACT Today’s hardware is increasingly parallel, and modern programming languages must thus allow a programmer to use this parallelism eectively. For languages that depend on a virtual machine (vm), it is the responsibility of the vm to execute code eciently and in parallel. In the case of Python, a popular dynamic language, several vms exist, but none of them can deliver high performance and parallel execution at the same time. The reason for the lack of such a high-performance, parallel vm for Py- thon lies with Python’s concurrency semantics, which is based on a strong memory model and atomic operations on large entities. Such a concurrency semantics does not map well to modern hardware architectures, which is why parallel vms must emulate this semantics under parallel execution through expensive synchronization. This dissertation introduces a new approach to the design and construc- tion of high-performance, parallel vms with a concurrency semantics such as Python’s. The introduction of large-scale atomicity to the implementation language of a vm lets a vm developer specify the concurrency semantics independently from the vm’s synchronization mechanism. Thereby, the used synchronization mechanism becomes an exchangeable vm component. For the high-performance execution of Python code, just-in-time compilation is an essential concern. Unfortunately, Python’s strong memory model inhibits basic compiler optimizations under concurrency. Hence, to allow a compiler to optimize eectively, the concept of Parallel Worlds is introduced. Parallel Worlds transparently isolate concurrent computations from each other, and thereby allow for eective optimizations under the assumption of no concurrency. The transparent isolation of Parallel Worlds is supported by a special-purpose software transactional memory system (stm). Apart from isolation, this stm is the key enabler for the ecient parallel execution of Python code. Parallel execution builds on the speculative execution capability of the stm. The product of this dissertation is PyPy-stm, a high-performance, parallel vm for Python. With PyPy-stm, multi-threaded Python programs can take advantage of the parallelism in modern hardware. On a set of benchmark programs, PyPy-stm outperforms established Python vms such as CPython, ix x contents Jython, IronPython, and PyPy. Compared with PyPy, the current top-of- class in program performance, PyPy-stm achieves speedups in the range of 1:5 to 8:0× with 8 threads available. These results conrm the viability of the approach and show that PyPy-stm deserves the designation as a high-performance, parallel vm for Python. ZUSAMMENFASSUNG Die heutigen Rechnerarchitekturen arbeiten zunehmend parallel, weshalb es moderne Programmiersprachen den Programmiererinnen und Program- mierern ermöglichen sollten, diese Parallelität eektiv zu nutzen. Bei Spra- chen, die auf eine virtuelle Maschine (vm) setzen, ist es die Aufgabe dieser vm, Programme ezient und parallel auszuführen. Im Falle von Python, einer weit verbreiteten dynamischen Programmiersprache, gibt es zwar bereits verschiedene vms, von diesen kann aber keine zugleich parallele Ausführung und hohe Ausführungsgeschwindigkeit bieten. Der Grund für das Fehlen einer solchen performanten, parallelen vm für Python liegt in der Nebenläugkeitssemantik von Python, welche auf einem strikten Speichermodell und atomaren Operationen auf grossen Entitäten basiert. Eine solche Nebenläugkeitssemantik lässt sich nicht direkt auf modernen Rechnerarchitekturen umsetzen, weshalb parallele vms diese Semantik bei nebenläuger Ausführung durch aufwändige Synchronisation genau nachbilden müssen. Diese Dissertation stellt einen neuen Ansatz für das Design und die Konstruktion von performanten, parallelen vms mit einer Nebenläugkeits- semantik wie der von Python vor. Die Einführung von ächendeckender Atomizität in der Implementierungssprache einer vm ermöglicht es bei der vm-Entwicklung die Semantik der Nebenläugkeit unabhängig des Synchro- nisationsmechanismus der vm zu denieren. Der verwendete Mechanismus wird dabei zu einer austauschbaren Komponente der vm. Eine wesentliche Bedeutung für die hohe Ausführungsgeschwindigkeit von Python-Programmen kommt der Just-in-Time-Kompilierung zu. Doch das strikte Speichermodell von Python verhindert bei nebenläuger Aus- führung grundlegende Optimierungen des Kompilierers. Um Kompilierern trotzdem das wirksame Optimieren unter diesen Bedingungen zu ermögli- chen, wird das Konzept der Parallelwelten eingeführt. Parallelwelten isolieren nebenläuge Berechnungen transparent von- einander. So bleiben Optimierungen wirksam, da sie unter der Annahme arbeiten können, dass keine nebenläuge Ausführung stattndet. Die trans- parente Isolierung von Parallelwelten wird durch einen softwarebasierten transaktionalen Speicher (stm) unterstützt. Neben der Isolation ist dieser stm der Schlüssel zur ezienten parallelen Ausführung von Python- xi xii contents Programmen. Parallele Ausführung wird durch die Fähigkeit des stms zu spekulativer Ausführung ermöglicht. Das Ergebnis dieser Arbeit ist PyPy-stm, eine performante, parallele vm für Python.

A High-Performance, Parallel Virtual Machine for Python

Analysis of GPU-Libraries for Rapid Prototyping Database Operations

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Scalable and High Performance MPI Design for Very Large

DNA Scaffolds Enable Efficient and Tunable Functionalization of Biomaterials for Immune Cell Modulation

A Cilk Implementation of LTE Base-Station Up- Link on the Tilepro64 Processor

Library for Handling Asynchronous Events in C++

Intel Threading Building Blocks

Lecture 15-16: Parallel Programming with Cilk

SYCL – a Gentle Introduction

High Performance Integration of Data Parallel File Systems and Computing

Trisycl Open Source C++17 & Openmp-Based Opencl SYCL Prototype

Optimizing Apache Spark* to Maximize Workload Throughput