Python and Parallel Programming

Python and Parallel Programming Shichao An New York University [email protected] Abstract programmers to fully leverage multiple processors. However, it also brings overhead in process switching and inter-process This paper studies how to utilize parallelism in Python communication. applications by comparing three different parallel program- Given the nature of GIL and the expense of multiprocessing, ming methods. CPython’s multithreading interface threading native parallelism should be considered while still keeping an and process-based “threading” interface multiprocessing eye on development efficiency. Cython with OpenMP is such are discussed in terms of employing built-in functionalities, and a solution that seems less intuitive but appeals to those famil- OpenMP through cython.parallel in Cython are introduced iar with OpenMP. It makes it easier to integrate with existing in terms of thread-based native parallelism. Python code and releases GIL for better parallel performance. CPU-bound and I/O-bound programs are parallelized using This paper provides a comprehensive study of the three ap- these methods to demonstrate features and to compare perfor- proaches by comparing them under different circumstances and mances. As essential supplements, a derivative CPU-bound ap- analyzing features in regard to the comparison of the perfor- plication and a matrix multiplication program are parallelized mances. It aims at providing an evaluation report to help pro- to study the overhead and memory aspect. grammers better decide the programming method to efficiently parallelize their various applications. 1 Introduction 2 Literature Survey The difficulty involved with writing and debugging parallel code has become one obvious obstacle to developing parallel Use of Python on shared-memory multiprocessor or multicore programs. With the development effort being the bottleneck systems is prevalent. Much of the work employs Python in sci- and the prevalence of multiprocessor and multicore systems, it entific computing field, e.g. parallel numerical computations. is a rational approach to give priority to efficient development Other work discusses specific aspects in a more general field. and testing techniques and reusable parallel libraries rather than Masini et al. [3] present the porting process of a multicore machine efficiency [1]. This led to the increasing use of high- eigensolver to Python and discuss the pros and cons of using level languages such as Python, which supports multiprocess- Python in a high-performance parallel library. They focus on ing and multithreading, and guarantees highly efficient devel- Python’s advantages in improving code readability, flexibility opment. and correctness compared to C and Fortran within parallel nu- The standard library of Python’s reference implementa- merical computations. The work also introduces performance tion, CPython (2.x), provides the following built-in mod- issues by explaining impact of the Global Interpreter Lock. ules: the lower-level thread, the higher level threading and However, their work does not discuss in detail the multithread- multiprocessing module. The threading module provides ing in terms of parallelism, and fails to give an alternative or an easy-to-use threading API built on top of the thread mod- solution to the performance overhead and contention. ule, and the multiprocessing module applies a similar inter- The parallelization experiments using multiprocessing face to spawning and managing subprocesses. and threading by Eli Bendersky [4] reveals more relevance Most programmers will intuitively starts with the default to my work. The demonstration of parallelizing CPU-bound multithreading module, threading. It makes use of real sys- tasks has greatly influenced my thinking. It gives a complete tem threads and provides friendly thread management interface benchmark of the serial, multithreading, and multiprocessing and synchronization mechanisms such as Lock, RLock, Con- programs to demonstrate Python thread’s weakness in speed- dition and Semaphore. However, the Global Interpreter Lock ing up CPU-bound computations and the simplicity of paral- ensures that only one thread can execute Python code at once. lelizing code using multiprocessing. Another experiment This makes it easier for the interpreter to be multi-threaded, but in threading also mentions the applicable scope of Python is at the expense of much of the parallelism afforded by multithread. However, these experiments have not presented an com- core and multiprocessor machines [2]. parison of multithreading and multiprocessing in parallelizing When realizing the restrictions of GIL and requiring to make I/O bound and memory-bound programs to make a thorough better use of the computational resources, programmers may conclusion. choose multiprocessing to side-step GIL. It gives a simi- David Beazley’s [5][6] articles and presentations provide lar interface to threading but uses subprocesses to enable the elaborate illustrations of Global Interpreter Lock (GIL) with 1 exploration from Python script to C source code. His work loss in performance when implementing other approaches. In delves into the behind-the-scenes details of GIL by analyzing Cython, for instance, the global interpreter lock (GIL) cannot behavior and implementations from both the interpreter and op- be released if manipulating Python built-ins or objects in par- erating system perspectives. While many aspects in his work allel sections, which makes it impossible to achieve native par- are far beyond the scope of this paper, his focus is on GIL’s allelism. Therefore, a better solution is to find a balance that impact on Python thread in doing CPU-bound and I/O-bound deals with both aspects and put the analysis of annoying factors tasks. The study is essentially about the threading behavior and into an independent experiment. As a promising result, fur- potential improvement of the GIL itself, not the more general ther parallelization can be performed to demonstrate the side- approach of parallel programming in Python. effects compared to the serial version, such as GIL restrictions, In other related work [7][8], the implementations simply overhead in managing threads or processes, and contentions in adopt either of threading or multiprocessing without compar- using locks, which will be included in the overhead analysis ing the two and generalizing pros and cons. They don’t discuss programs. under what circumstances one method is applicable. My work With the abovementioned parallelization work done, com- provides an overview that encapsulates different parallel pro- parison of the parallelized programs makes more sense, partic- gramming approaches by discussing the advantage and draw- ularly in two major criteria: development efficiency and per- back of each and reckoning their applicable scopes as result of formance. The complexity of the parallelized program, which parallelizing CPU-bound, I/O-bound and memory-bound pro- may be reflected by many factors such as total number of grams. lines of code, determines the primary development effort. The compatibility is also one aspect that programmers pay much attention to in order to interface with existing applications. 3 Proposed Idea This generally involves the conversion of data structures under these approaches. Unlike scripting, Cython/OpenMP re- quires extra setup script and compile procedure that is platform- Despite the fact that threading, multiprocessing and dependent, which also affects development efficiency with in- Cython/OpenMP address parallel problems at different levels creased DevOps workload and portability impact. of granularity, they are still often used as alternatives to each other, especially when the drawbacks of one becomes the bot- Besides development, the performance is important in most tleneck under a specific circumstance. This is largely due to cases and is obtained by benchmarking. It provides basis their implementation discrepancies, which may in return bring on which the overhead, produced by various factors such as improvement of either performance or development efficiency. thread/process creation and scheduling, the GIL and data struc- In order to demonstrate various aspects of the three approaches, tures, can be analyzed. By comparing the performances of the several sample programs are chosen and parallelized using implementation of these approaches, the overall applicability these approaches. The concentration is on the parallelization of each approach can be determined, which comprises the ulti- of the CPU-bound and I/O-bound programs, and paralleliza- mate goal of this paper. tion of the memory-bound program and the overhead analysis focuses on comparing the efficiency of different types of supported data structures and performance loss of multi-threaded 4 Experimental Setup and multi-processing programs. Parallelizing Python programs does not have much differ- 4.1 Environment ence from parallelizing programs in other languages. It is a common approach to identify parallelism, enforce dependen- The experiments will be conducted on a MacBook Pro with cies and handle synchronization. The essential part is to think an Intel Core i7 (2.9 GHz, 4 MB on-chip L3 cache, 2 physi- of the problem in terms of a Python programmer who wants cal cores, 4 logical cores [9]). The operating system is OS X things done in the most “intuitive way”. This means it is bet- 10.8.5 (Darwin 12.5.0, x86 64). The applications and their par- ter to use the simplest supported interfaces

Load more