ISSN XXXX XXXX © 2017 IJESC

Research Article Volume 7 Issue No.3 Automatic Parallelization Tools: A Review Varsha. K.. R Department of CSE RV College of Engineering, Bangalore, India

Abstract: With the advent of multi core architectures, the concept of parallel programming is not restricted anymore to super-computers and enterprise level applications alone. are always ready to take advantage of the available potential of programs to be parallelized. The objective of the paper is to review various tools that are available for performing automatic parallelization.

Keywords: Multi-core, DOALL, Loop-level Parallelis m, concurrency, data-dependence, profiling

I. INTRODUCTION III. EXISTING MODELS

The rise of the concept of parallel or concurrent programming The general procedure for these tools is to take a sequential code has led to the development of multi-core architectures on a wider as input and generate parallel codes. Using the sequential code, a range. Programmers find parallel programs executing on dependency matrix or graph is implemented which stores multiple cores or processors to be more efficient than serial ones. dependency information that exists between consecutive But it is a tedious task to write parallel programs directly or to statements in a loop. The next step is to detect a code section convert all existing serial programs to parallel. This limitation which has potential parallelism. Finally the tools generate a can be eliminated by automating this procedure. Tools are parallel code by inserting parallel programming constructs/ developed which can automatically accept a serial code and directives to potential DOALL loops. A loop can be considered generate a parallel code by inserting suitable parallel constructs. DOALL if all the iterations can be executed concurrently without any conflicts. The loop cannot be DOALL if it has LCD A. ARCHITECTURES (Loop carried dependence). Various tools are listed based on the According to Flynn’s classification, computer architectures can differences in procedure employed to carry out each of these be classified into four classes. They are SISD (Single Instruction three phases. Single Data), SIMD (Single Instruction Multiple Data), MISD (Multiple Instruction Single Data) and MIMD (Multiple A. SEQUENTIAL CODE PARS ING Instruction Multiple Data) [1]. SISD forms the normal execution Input code is parsed to find required features in pre-defined procedure with single instruction executed on single data. SIMD form. There can be two kinds of tools based on the way a code is are those which execute single or same instruction on multiple parsed to extract data from the code. The tool can be data instances. MISD is never noticed in real scenarios. MIMD based or profiling based. A compiler based tool can easily executes multiple instructions on multiple data streams. support . A profiling based tool is preferred for

B. PARALLEL PROGRAMMING LANGUAGES /C++ due to pointer analysis limitation of compiler based tools. There are a wide variety of languages available to write parallel When it is not possible to keep track of flow of data owing to the codes. OpenMP is the most widely used parallelization construct usage of pointers, profiling technique analyses data dynamically which is available in the form of APIs. OpenMP pragmas can be in runtime. inserted in the available codes with little or no modification to the source codes. MPI is a framework which is used for message B. DETECTION OF POTENTIAL CODE BLOCKS passing programs. For electronics applications, languages such Sections of code which are executed more frequently and carry as Verilog and VHDL are used for event-driven and hardware no dependency among the sets executed one after the other are description applications. Elixir, Erlang are examples of detected as blocks with a potential to be parallelized. While languages for . For GPU applications, checking dependency among statements, the application has to CUDA framework is used to write programs which are executed check for RAW (read after write) or True dependency, WAW in parallel on GPUs. In recent times, parallel execution can also (write after write) or Output dependency and WAR (write after be seen in platforms such as Apache Hadoop and Apache Spark read) or Anti dependency. in the form of Map-Reduce and in-memory processing respectively. Incremental Analysis- In incremental method, parsing is done in incremental steps II. OBJECTIVE instead of parsing the entire file at once. This procedure is time The objective of the paper is to review various tools that are saving because the parse structure is built and can be analyzed at available for automatic parallelization, the procedure employed the same time. This is equivalent to first parsing and then by the tools, the source codes that the tool accepts, the parallel iterating over the file but here parsing and iterating is done in codes generated by the tool and other distinctive features. parallel so that potential blocks are detected while parsing. This

International Journal of Engineering Science and Computing, March 2017 5780 http://ijesc.org/ analysis method is used in incremental call graph generation parallelization which require changes to be made to the source scheme for IDE. code. Here, the tool is interactive and hence the developer can witness parallel constructs and make changes to source code C. GENERATION OF PARALLEL CODE accordingly [24] [42]. With the aid of the dependency structure created in the previous phase, potential blocks for parallelization are detected. In this GRED: GRED is short for Graphical Editor. It provides visual step, parallelization constructs are inserted before the detected representation of complex sections of code in GRADE blocks, with little modification made to the block if necessary. environments thus providing the user an insight into the program Scope of variables, distribution of iterations among threads and flow of parallel process execution [40]. synchronization of threads are important parameters for inserting parallelization directives. ICU_PFC: It is a research compiler accepting FORTRAN source codes and generating parallel FORTRAN code with OpenMP IV. EXISTING TOOLS AND TECHNIQUES directives [2].

Alchemist: Alchemist is a novel profiling system for Intel Compiler: Intel provides C++ and Fortran parallelizing C codes. During the dependency detection phase it supporting three features: OpenMP API, auto parallelization and provides facility for manual intervention to break dependencies auto vectorization. The first two features are based on but requires no assistance for the choice of code sections. It level parallelism and vectorization on instruction level considers all programming constructs (procedure, loops among parallelism [6]. others) because potential parallelism can be extracted even from regions which are executed less frequently. This is called iPAT/OMP: It shows static analysis results and performance Generality of the tool [25]. analysis data. It enables manual parallelizing and allows the user to execute parallelizing assistance functions in a commodity Auto future: Auto future generates codes that are (a) editor environment [14]. Recognizable without much change from the original code (b) identifies set of recurring patterns with greater recognition Kis met: The tool identifies the potential region that can be values (c) correct or parallel codes giving similar outputs as parallelized. Along with identification, the tool determines the serial codes. It provides parallelism to be used in various levels time reduction or speedup that can be gained by parallelizing. of granularity. Auto future is similar to Alchemist in that, the This estimate on the upper bound is done for a spectrum of core dependency graph generated also considers distance between counts [32]. statements to evaluate the effectiveness gained by parallelization [36]. Kremlin: The tool is interactive with the . When given a sequential code, the tool gives suggestions in the form of Capo: This tool generates FORTRAN codes with OpenMP a list of code sections which can be parallelized. It considers directives. It supports nested level parallelism. It follows four coverage, self parallelization and time reduction occurring the step procedure to code analysis which combines both process and provides user or programmer with an opportunity to dependency analyses and directives insertion. The first step decide if the section has to be modifies according to starts off by analyzing the outermost loop and the second step requirements [31]. involves identification in nested regions of the scope defined in the first level. However the insertion of directives is proceeded Loop Sampler: A loop centric profile is used to detect in the reverse direction with second level insertion being done to parallelism in programs. The tool samples various loop maintain a proper state of variables before first level insertion is executions and record the time spent in each of them. This is done on the outer loops. It is based on CAP (Computer Aided projected in a hierarchical view to provide ease of analysis [28]. Parallelization) Tools [29]. Omni OpenMP Compiler: The compiler can be used for C and Fortran 77 programs and generates codes with OpenMP Cetus: Cetus is considered the most powerful and easy to use constructs. The tool supports distributed shared memory systems source-to-source compiler for C programs. It is a very fast tool by using cluster enabled OpenMP [16]. for programs with lesser iteration size of for loops and those with minimal number of instructions in the body of the loop [4]. OSCAR compiler: The tool was developed with a view to DProf: A profiling tool giving dynamic runtime information on improve performance, cost-effectiveness and productivity of the dependences that exist in various memory accesses in a loop software in SMPs. It accepts Fortran programs, checks for coarse and also on the percentage of occurrences of these dependencies. grained parallelism using static scheduling. The output has Parallelization is done based on the information collected during parallel program in OpenMP API [9]. the execution [26]. Par4All: This is an open source project aiming at providing a EasyPar: EasyPar - EASY development of PARallel codes is a parallelization and optimization compiler. It accepts C and unique tool which combines code development and code FORTRAN sequential programs and supports generation of parallelization. The application includes an IDE which can be openMP, CUDA and openCL parallel programs [12]. used for development. Simultaneously there is concurrency analysis taking place and giving parallelization suggestions. This PDT-PLDT: This is more of an analysis tool used in Eclipse tool eliminates most of the problems encountered during auto environment. It supports content analysis of MPI, OpenMP and

International Journal of Engineering Science and Computing, March 2017 5781 http://ijesc.org/ LAPI programs. The static analyses tool and detect deadlocks TRACO: The tool is implemented in python for C/C++ with barrier analyses and concurrency analysis of the programs applications. It uses the concepts of Iteration Space Slicing and [39]. Free Schedule Framework. Here, loop dependencies are represented using relations and Omega calculator for analysis PLUTO: The tool is based on polyhedral model which provides [18]. abstraction to carry out advanced transformations like loop-nest. It is used for generating parallel C codes. The transformations of Vector Fabrics: Gives insight information about the program the code occur simultaneously for data locality and which is crucial for writing highly optimized parallel code. The parallelization [13]. tool performs data- dependency analysis; convert it to parallel code for a particular architecture [37]. Polaris: It is a FORTRAN 77 compiler that generates parallel FORTRAN codes. Unlike most other tools, Polaris generates VISO: The occum based tool uses the concepts of incremental parallel codes in multiple-passes additional to the regular passes parsing, incremental dependency analysis, incremental profiling, that other tools perform. The tool performs advanced operational incremental flow graph analysis [41]. tasks such as array privatization, data dependence testing, variable recognition, inter procedural analysis, and symbolic YUCCA: YUCCA (User code parallelization application) is program analysis. As the conversion is done in multiple passes, developed by KPIT technologies Ltd. [11] iterating over the same code, it can be time consuming [3]. V. CHALLENGES PortLand’s Compilers: This set of tools introduced by PortLand Group Inc (PGI) has compilers for FORTRAN, C and C++. A few tools require manual intervention to break dependencies These tools support OpenMP 3.1 for multi-cores and OpenACC and to choose parallelization parameters like Alchemist. Another for accelerators. The C++ tool also includes support for CUDA important factor being the field of view of the tool, i.e. the extensions. The PortLand’s set also includes OpenMP and MPi constructs that the tool detects as potential blocks. Tools range based parallel debuggers [22]. from those detecting only one kind of loop (e.g.) to those which detect all programming constructs (e.g. Alchemist) exist. The Prospector: This is a profile-based tool which uses dynamic complexity of tool depends on this range of detection. dependency results [27]. VI. CONCLUSION: S2P (Serial to Parallel): It is a static analyzer tool and supports both task and loop level parallelis m. An additional feature used Around 40 tools are reviewed and various methods employed by in the tool is called ‘run ahead mechanism’ where the ‘if’ the tools are studied. It is observed that the procedure followed statements and else statements corresponding to the ‘if’ are by most of the tools is as explained with flowchart. The most executed simultaneously thus improving performance of the tool important phase is the detection of potential blocks which is also [35]. most time consuming. It can be observed that most tools are developed for FORTRAN language as it is easier to understand SD3: The major problems occurring with profiling techniques the operational flow and the objective of the code. It can also be are runtime and memory overheads. This scalable dynamic data- noticed that the most widely used parallel construct is openMP dependence profiling tool addresses both these issues. Runtime because it requires very less or no modification in the existing overhead is reduced by parallelizing the profiling step. Memory serial code and also insertion of few statements are carried out. overhead is reduced by minimizing memory accesses using There are wide variety of tools in the market which can be used stride patterns [34]. for auto parallelization. The tool has to be chosen wisely by considering the programming language of the code to be SequenceL: A tool to parallelize c++ source codes with no parallelized and the requirement of parallelization model. explicit manual instruction from the user [19]. VII. REFERENCES SGI Origin 2000 Compilers: The tool is used for Fortran source codes. It is designed to parallelize codes with simple structures [1]. Mandeep Kaur, Rajdeep Kaur, A Comparative Analysis of and loops using Mpi statements with minimal code modification SIMD and MIMD Architectures, International Journal of [23]. Advanced Research in Computer Science and Software Engineering Available online at: www.ijarcsse.com SUIF (Stanford University Intermediate Format): The tool includes loop level parallelize and locality optimizer. The tool [2]. H. Kim, Y. Yoon, S. Na, D. Han, “ICU-PFC: An Automatic can be used for C and Fortran languages. It proceeds by Parallelizing Compiler,” International Conference on High maximizing the granularity and minimizing communications and Performance Computing in the Asia-Pacific Region - HPCASIA, synchronizations [5]. A limitation of the tool is that generation 2000. happens in multiple passes making the procedure time- consuming [7]. [3]. D. A. Padua, R. Eigenmann, J. Hoeflinger, P. Petersen, P. Tu, S. Weatherford, and K. Faigin, “Polaris: A new-generation Timing Architects Compiler: It is simulation based approach to parallelizing compiler for MPPs,” Technical Report CSRD- improve task allocation and also task parallelization. It is used 1306, Center for Supercomputing Research and Development, for Embedded C programs [17]. Univ. of Illinois at Uurbana-Champaign, June 1993

International Journal of Engineering Science and Computing, March 2017 5782 http://ijesc.org/ [4]. C. Dave, H, Bae, S. Min, S. Lee, R. Eigenmann, and S. [16]. Performance Evaluation of the Omni OpenMP Compiler Midkiff, “Cetus: A Source-to-Source Compiler Infrastructure for Kazuhiro Kusano, Shigehisa Satoh and Mitsuhisa Sato RWCP Multicores,” IEEE Computer, vol. 42, no. 12, pp. 36-42, Dec. Tsukuba Research Center, Real World Computing Partnership 2009 1-6-1, Takezono, Tsukuba-shi, Ibaraki, 305-0032, JAPAN

[5]. M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. [17]. Memory Architecture Exploration for Programmable Murphy, S.-W.Liao, E Bugnion, and M.S. Lam, “Maximizing Embedded Systems By Peter Grun, Nikil D. Dutt, Multiprocessor Performance with the SUIF compiler,” IEEE AlexandruNicolau Computer, December1996. [18]. TRACO Parallelizing Compiler - Marek Palkowski [6]. Intel compiler. Available: http://software.intel. com/enus/ articles/intel-compilers [19]. A PARALLEL COMPILER FOR SequenceL by PER ANDERSEN, B.E., M.S. A DISSERTATION IN COMPUTER [7]. D. Kwon and S. Han, “MPI Backend for an Automatic SCIENCE Submitted to the Graduate Faculty of Texas Tech Parallelizing Compiler,” 1999 International Symposium on University in Partial Fulfillment of the Requirements for die Parallel Architectures, Algorithms and Networks (ISPAN '99), Degree of DOCTOR OF PHILOSOPHY Fremantle, Australia , June 1999. [20]. A Comparison of Automatic Parallelization [8]. C. Nugteren, H. Corporaal, and B. Mesman, “Skeleton- Tools/Compilers on the SGI Origin2000 Michael Frumkin, based Automatic Parallelization of Image Processing Algorithms Michelle Hribar, Haoqiang Jin, Abdul Waheed and Jerry Yan for GPUs,” ICSAMOS, 2011, pp. 25-32. MRJ Technology Solutions, Inc., NASA Ames Research Center

[9]. H. Kasahara, M. Obata, K. Ishizaka, K. Kimura, H. [21]. C. S. Ierotheou, S. P. Johnson, M. Cross, and P. F. Leggett, Kaminaga, H. Nakano, K. Nagasawa, A. Murai, H. Itagaki, and "Computer Aided Parallelisation Tools (CAPTools)-Conceptual J. Shirako, “Multigrain Automatic Parallelization in Japanese Overview and Performance on the Parallelisation of Structured Millennium Project IT21 Advanced Parallelizing Compiler,” Mesh Codes," Parallel Computing, vol. 22, pp. 163-195, 1996. PARELEC IEEE Computer Society (2002) , p. 105-111. [22]. "PGHPF", Portland Group, http://www.pgroup.com [10] .T. Miyamoto, S. Asaka, H. Mikami, M. Mase, Y. Wada, H. Nakano, K. Kimura, and H. Kasahara, “Parallelization with [23]. A Comparison of Automatic Parallelization Automatic Parallelizing Compiler Generating Consumer Tools/Compilers on the SGI Origin2000 Michael Frumkin, Electronics Multicore API,” International Symposium on Michelle Hribar, HaoqiangJin, Abdul Waheed and Jerry Yan Parallel and Distributed Processing with Applications, 2008. MRJ Technology Solutions, Inc., NASA Ames Research Center ISPA '08, pp,600-607 [24]. A Review of Parallelization Tools and Introduction to [11]. KPIT Annual Report 2012-13 Easypar SudhakarSah Symbiosis Institute of Research & Innovation (SIRI) Pune, India Vinay G. Vaidya Symbiosis [12]. Par4All: Auto-Parallelizing C and Fortran for the CUDA Institute of Research & Innovation (SIRI) Pune, India Architecture Christopher D. Carothers4 Fabien Coelho2 Béatrice Creusillet1 Laurent Daverio2 Stéphanie Even3 Johan Gall1,3 [25]. Tournavitis, G. and Franke, B., “Semi-automatic extraction Serge Guelton2,3 François Irigoin2 Ronan Keryell1,3 Pierre and exploitation of hierarchical pipeline parallelism using Villalon1 1HPC Project 2Mines ParisTech/CRI 3 Institut profiling information,” in Proceedings of the 19th international TÉLÉCOM/TÉLÉCOM Bretagne/HPCAS 4Rensselaer conference on Parallel architectures and compilation Polytechnic Institute/CS techniques, PACT ’10, (New York, NY, USA), pp. 377–388, ACM, 2010. [13]. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer UdayBondhugula, Albert Hartono J, Ramanujam, P. [26] . Steffan, J. G., Colohan, C. B., Zhai, A., and Mowry, T. C., Sadayappan “A scalable approach to thread-level speculation,” in Proceedings of the 27th annual international symposium on [14]. Interactive Parallelizing Assistance Tool for OpenMP: Computer architecture, ISCA ’00, (New York, NY, USA), pp. iPat/OMP Makoto Ishihara (*1), Hiroki Honda (*1), 1–12, ACM, 2000. ToshitsuguYuba(*1), Mitsuhisa Sato(*2) Fifth European Workshop on OpenMP EWOMP `03 Aachen University, [27]. Aditi Athavale, Priti Ranadive, M. N. Babu, Prasad Pawar, Germany September 22-26, 2003 *1:The Graduate School of SudhakarSah, Vinay G. Vaidya, Chaitanya Rajguru, "Automatic Information Systems, University of Electro-Communications Sequential to Parallel Code Conversion - The S2P Tool and *2:Institute of Information Sciences and Electronics, Performance Analysis", Journal of Computing, GSTF, Oct 2011. Universityof Tsukuba. [28]. Moseley, T., Connors, D. A., Grunwald, D., and Peri, R., [15]. VFC: The Vienna Fortran Compiler-Siegfried Benkner- “Identifying potential parallelism via loop-centric profiling,” in Institute for Software Technology and Parallel Systems, Proceedings of the 4th international conference on Computing University of Vienna, Liechtenstein Strasse 22, A‐ 1090 Vienna, frontiers, CF ’07, (New York, NY, USA), pp. 143–152, ACM, Austria 2007.

International Journal of Engineering Science and Computing, March 2017 5783 http://ijesc.org/ [29]. C. Ding, X. Shen, K. Kelsey, C. Tice, R. Huang, and C. [42]. S. Sah and VinayG. Vaidya, Paradyn: A Dynamic Parallel Zhang.Software Behavior Oriented Parallelization. In PLDI’07, Programming Tool, ICDCN, 12th International Conference on pages 223–234, San Diego, CA Distributed Computing and Networking, PhD forum (2010);ParMA - Parallel Programming for Multi-core [30]. Y. He, C. Leiserson, and W. Leiserson. “The Cilkview Architectures, Available : http://www.parma-itea2.org Scalability Analyzer.” In SPAA ’10, Proceedings of the Symposium on Parallelism in Algorithms and Architectures, 2010

[31]. McKinsley, Karthryn S. Automatic and Interactive Parallelization, PhD. Thesis, Computer Science, Rice University. Houston, Texas (1994) 12-32.

[32]. Jeon, D., Garcia, S., Louie, C., and Taylor, M. B., “Kis met: parallel speedup estimates for serial programs,” in Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications, OOPSLA ’11, (New York, NY, USA), pp. 519–536, ACM, 2011.

[33]. Prism: an analysis exploration and verification environment for software implementation and optimization on multicore architectures from CriticalBlue. http://www.criticalblue.com

[34]. Minjang Kim, dynamic program analysis algorithms to assist parallelization, PhD. thesis proposal, Georgia Institute of Technology, 2011. pp 5-28.

[35] . Aditi Athavale, Priti Ranadive, M. N. Babu, Prasad Pawar, SudhakarSah, Vinay G. Vaidya, Chaitanya Rajguru, "Automatic Sequential to Parallel Code Conversion - The S2P Tool and Performance Analysis", Journal of Computing, GSTF, Oct 2011.

[36]. KorbinianMolitoris z, JochenSchimmel, Frank Otto, Automatic Parallelization Using AutoFutures, Automatic Parallelization Using AutoFutures, Multicore Software Engineering, Performance, and Tools, Lecture Notes in Computer Science Volume 7303, 2012, pp 78- 81

[37]. vfAnalyst: Analyze your sequential C code to create an optimized parallel implementation from VectorFabrics, http://www.vectorfabrics.com/

[38]. B.Q. Warber, C.R. Brode and F. L. Orlando, DEEP: a development environment for parallel programs, Parallel Processing Symposium, IPPS/SPDP (1998) 588-593.

[39]. Beth Tibbitts, PTP - PLDT Parallel Language Development Tools Overview, Status & Plans. Technical Report by IBM, (2007)

[40]. Peter Kacsuk, Gabor Dozsa, TiborFadgyas, Robert Lovas, “The GRED Graphical Editor for the GRADE Parallel Program Development Environment”, Budapest, Hungary : MTA-SZTAKI Computer and Automation Research Institute, Hungarian Academy of Science.

[41]. Muhammed S. Al-Mulhem, Concurrent programming in VISO, Concurrency: Practice and Experience Concurrency, Pract. Exper. (2000) 281–288

International Journal of Engineering Science and Computing, March 2017 5784 http://ijesc.org/