Quantitative Overhead Analysis for Python

Quantitative Overhead Analysis for Python Mohamed Ismail and G. Edward Suh Cornell University Ithaca, NY, USA fmii5, [email protected] Abstract—Dynamic programming languages such as Python inherent inefficiency compared to static languages. Second, a are becoming increasingly more popular, yet often show a language run-time also adds overhead to dynamic languages significant performance slowdown compared to static languages compared to statically compiled code. We break down Python such as C. This paper provides a detailed quantitative analysis of the overhead in Python without and with just-in-time (JIT) execution time into language and run-time components as compilation. The study identifies a new major source of overhead, well as core computations to understand overhead sources. C function calls, for the Python interpreter. Additionally, we study Finally, at the hardware level, we study how the dynamic the interaction of the run-time with the underlying processor language features impact microarchitecture-level performance hardware and find that the performance of Python with JIT through instruction-level parallelism, branch prediction, and depends heavily on the cache hierarchy and memory system. We find that proper nursery sizing is necessary for each application memory access characteristics. We compare CPython [5], an to optimize the trade-off between cache performance and garbage interpreter-only design, with PyPy [6], a JIT-based design, to collection overhead. Although our studies focuses on Python, we understand the microarchitecture-level differences between the show that our key findings can also apply to other dynamic run-time implementations. languages such as Javascript. The study is broken into two main parts. In the first part, we look at the language and run-time features of Python to I. INTRODUCTION understand which aspects of the language and run-time add As software becomes more complex and the costs of de- additional overhead compared to C, the baseline static lan- veloping and maintaining code increase, dynamic program- guage. By annotating instructions at the interpreter-level, we ming languages are becoming more desirable alternatives to can generate breakdowns for a large number of benchmarks. traditional static languages. Dynamic languages allow pro- In addition to the sources of overhead already identified by grammers to express more functionality with less code. In previous work, we find that C function calls represent a major addition, run-time checks and memory management are built- source of overhead that has not been previously identified. in, limiting the possibility of low-level program bugs such In the second part, we study the interaction of the run- as buffer overflows. Dynamic languages such as Javascript, time with the underlying processor microarchitecture. We find Python, PHP, and Ruby consistently rank in the top ten that both CPython and PyPy exhibit low instruction-level most popular languages across multiple metrics [1]–[3]. These parallelism. Using PyPy with JIT helps decrease sensitivity to dynamic languages are increasingly utilized in production branch predictor accuracy, but increases sensitivity to cache environments in order to bring new features quickly. and memory configurations. In particular, we find that the Unfortunately, programs written in a dynamic language generational garbage collection used in PyPy introduces an often execute significantly slower than an equivalent program inherent trade-off between cache performance and garbage written in a static language, sometimes by orders of magnitude. collection overhead. Frequent allocation of objects in dynamic Therefore, the performance overhead represents a major cost languages increases a pressure on the memory hierarchy. of using dynamic languages in high-performance applications. However, increasing the garbage collection frequency to im- Ideally, companies with enough time and resources may prove cache performance can lead to high garbage collection rewrite performance-critical portions of code in faster static overhead. Our study shows that the optimal nursery size languages when they are mature. For example, Twitter initially depends upon application characteristics as well as run-time used Ruby on Rails to build their infrastructure and reported and cache configurations. If the nursery is sized considering a 3x reduction in search latencies after rewriting it in Java [4]. the cache performance and garbage collection overhead trade- However, porting code is an expensive proposition and just-in- off, then there can be significant improvements in program time (JIT) compilation is often used as a lower-cost alternative performance. to improve the performance of dynamic language programs. While we focus primarily on Python for most of our studies, In this paper, we provide a quantitative study on the we believe that the main results from our studies are applicable sources of performance overhead in Python, a popular dynamic to other dynamic languages as well. For a subset of our language. The overhead of a dynamic language can come findings, we show that the main lessons also apply to V8 [7], from multiple aspects of the language design space. This a high-performance run-time for Javascript. study explores three different aspects of the overhead to The major contributions of this paper include: provide a more comprehensive view. First, at the language 1) A comprehensive breakdown study of the CPython inter- level, some features of the dynamic language may lead to preter execution time for a large number of benchmarks. 2) Microarchitectural parameter sweeps to better under- State co_code + - State • globals • globals stand which aspects of hardware designs affect perfor- ^ switch(op) { • locals if (error) { • locals opptr op case … • co_name / !!! mance of both the interpreter-only CPython and PyPy x …. • co_consts !!! …. with and without JIT. pop % } push } 3) Analysis of the trade-off of cache performance and stack & | stack garbage collection time for PyPy. Dispatch Decode Read Execute Error Check Write We gain the following new insights regarding the opportuni- ties to improve the performance of Python and other dynamic Fig. 1: Overview of CPython virtual machine architecture. languages: 1) We find that C function calls represent a major source Program 1. Bytecode Loop or Counter Pass 2. Profile of overhead not previously identified. Initialization Interpreter Function Incremented Threshold Execution 2) Our microarchitectural study shows that dynamic lan- Fallback guages exhibit low instruction-level parallelism but pres- Guard 4. Compiled 5. Deoptimization 3. Compilation ence of JIT lowers sensitivity to branch predictor accu- Failure Code racy and increases sensitivity to memory system performance. Fig. 2: Steps in just-in-time compilation. 3) We find that nursery sizing has a large impact on dynamic language performance and needs to be done the stack or other storage variables. opptr will be updated in an application-specific manner, considering the trade- and the process will repeat until the program completes. off between cache performance and garbage collection B. Just-in-Time Compilation overhead, for the best result. Our paper is organized as follows. SectionII introduces Just-in-time compilation can optimize run-time performance backgrounds on run-time designs. Section III explains our by converting interpreted bytecode to machine code. In addi- experimental setup. SectionIV discusses our study on the tion, run-time information about object types and values can be sources of overhead for Python. SectionV analyzes the used to perform additional optimizations that cannot be done interaction between the run-time and the underlying hardware. ahead-of-time. Running the just-in-time compiler during run- SectionVI discusses related work, and Section VII concludes time is relatively expensive and the cost of compilation must the paper. be amortized by the performance improvement in the compiled code. For this reason, JIT compilers focus on frequently II. BACKGROUND ON RUN-TIME DESIGN executed code, such as frequently executed loops or functions. In this section, we present some background on essential As shown in Figure2, counters are used to track the number aspects of dynamic languages. Unlike static languages, most of times that loops or functions execute. Once the counter dynamic languages translate source code at run-time to an reaches a threshold, the loop or function is considered a intermediate representation. The language run-time interprets good candidate for compilation. An additional profiling stage the intermediate representation to execute the program. Since collects information for the compiler optimizations. The code interpretation is slow, just-in-time compilation can be used is then compiled and the machine code is executed in place to further compile the intermediate representation to machine of the interpreted bytecode. code. Regardless of the execution strategy, automatic memory To generate optimized code, the compiler makes assump- management ensures that memory is allocated for objects as tions about variable types and values, and it inserts guards needed without explicit calls by the programmer. Garbage to check whether those assumptions are valid during the collection amortizes the cost of freeing memory for dead execution. If there is a failed guard, the compiled state is objects. rolled back to a valid interpreted state and the bytecode interpreter continues execution. This is called deoptimization

Load more