Parallelization of Sequential Applications Using .NET Framework 4.5

Total Page:16

File Type:pdf, Size:1020Kb

Parallelization of Sequential Applications Using .NET Framework 4.5 Parallelization of Sequential Applications using .NET Framework 4.5 An evaluation, application and discussion about the parallel extensions of the .NET managed concurrency library. JOHAN LITSFELDT KTH Information and Communication Technology Master of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:160 Parallelization of Sequential Applications using .NET Framework 4.5 An evaluation, application and discussion about the parallel extensions of the .NET managed concurrency library. JOHAN LITSFELDT Degree project in Program System Technology at KTH Information and Communication Technology Supervisor: Mats Brorsson, Marika Engström Examiner: Mats Brorsson TRITA-ICT-EX-2013:160 Abstract Modern processor construction has taken a new turn in that adding on more cores to processors appears to be the norm instead of simply relying on clock speed improve- ments. Much of the responsibility of writing efficient appli- cations has thus been moved from the hardware designers to the software developers. However, issues related to scal- ability, synchronization, data dependencies and debugging makes this troublesome for developers. By using the .NET Framework 4.5, a lot of the men- tioned issues are alleviated through the use of the parallel extensions including TPL, PLINQ and other constructs de- signed specifically for highly concurrent applications. Anal- ysis, profiling and debugging of parallel applications has also been made less problematic in that the Visual Studio 2012 IDE provides such functionality to a great extent. In this thesis, the parallel extensions as well as explicit threading techniques are evaluated along with a paralleliza- tion attempt on an application called the LTF/Reduce In- terpreter by the company Norconsult Astando AB. The ap- plication turned out to be highly I/O dependent but even so, the parallel extensions proved useful as the running times of the parallelized parts were lowered by a factor of about 3.5–4.1. Sammanfattning Parallellisering av Sekventiella Applikationer med .NET Framework 4.5 Modern processorkonstruktion har tagit en vändning i och med att normen nu ser ut att vara att lägga till fler kär- nor till processorer istället för att förlita sig på ökningar av klockhastigheter. Mycket av ansvaret har således flyttats från hårdvarutillverkarna till mjukvaruutvecklare. Skalbar- het, synkronisering, databeroenden och avlusning kan emel- lertid göra detta besvärligt för utvecklare. Många av de ovan nämnda problemen kan mildras ge- nom användandet av .NET Framework 4.5 och de nya pa- rallella utbyggnaderna vilka inkluderar TPL, PLINQ och andra koncept designade för starkt samverkande applika- tioner. Analys, profilering och debugging har också gjorts mindre problematiskt i och med att Visual Studio 2012 IDE tillhandahåller sådan funktionalitet i stor utsträckning. De parallella utbyggnaderna samt tekniker för explicit trådning utvärderas i denna avhandling tillsammans med ett parallelliseringsförsök av en applikation vid namn LT- F/Reduce Interpreter av företaget Norconsult Astando AB. Applikationen visade sig vara starkt I/O beroende men även om så var fallet så visade sig de parallella utbyggna- derna vara användbara då körtiden för de parallelliserade delarna kunde minskas med en faktor av c:a 3.5–4.1. Contents List of Tables List of Figures I Background 1 1 Introduction 2 1.1 The Modern Processor .......................... 2 1.2 .NET Framework 4.5 Overview ..................... 3 1.2.1 Common Language Runtime .................. 3 1.2.2 The Parallel Extension ...................... 3 1.3 Problem Definition ............................ 4 1.4 Motivation ................................ 4 II Theory 7 2 Modern Processor Architectures 8 2.1 Instruction Handling ........................... 8 2.2 Memory Principles ............................ 9 2.3 Threads and Processes .......................... 10 2.4 Multi Processor Systems ......................... 11 2.4.1 Parallel Processor Architectures . 11 2.4.2 Memory Architectures ...................... 11 2.4.3 Simultaneous Multi-threading . 13 3 Parallel Programming Techniques 14 3.1 General Concepts ............................. 14 3.1.1 When to go Parallel ....................... 14 3.1.2 Overview of Parallelization Steps . 15 3.2 .NET Threading ............................. 16 3.2.1 Using Threads .......................... 17 3.2.2 The Thread Pool ......................... 17 3.2.3 Blocking and Spinning ...................... 19 3.2.4 Signaling Constructs ....................... 19 3.2.5 Locking Constructs ........................ 20 3.3 Task Parallel Library ........................... 21 3.3.1 Task Parallelism ......................... 22 3.3.2 The Parallel Class ....................... 22 3.4 Parallel LINQ ............................... 23 3.5 Thread-safe Data Collections ...................... 25 3.6 Cancellation ................................ 26 3.7 Exception Handling ........................... 27 4 Patterns of Parallel Programming 28 4.1 Parallel Loops ............................... 28 4.2 Forking and Joining ........................... 29 4.2.1 Recursive Decomposition .................... 29 4.2.2 The Parent/Child Relation ................... 29 4.3 Aggregation and Reduction ....................... 30 4.3.1 Map/Reduce ........................... 30 4.4 Futures and Continuation Tasks .................... 31 4.5 Producer/Consumer ........................... 32 4.5.1 Pipelines ............................. 32 4.6 Asynchronous Programming ....................... 33 4.6.1 The async and await Modifiers . 33 4.7 Passing Data ............................... 34 5 Analysis, Profiling and Debugging 35 5.1 Application Analysis ........................... 35 5.2 Visual Studio 2012 ............................ 36 5.2.1 Debugging Tools ......................... 36 5.2.2 Concurrency Visualizer ..................... 37 5.3 Common Performance Sinks ....................... 38 5.3.1 Load Balancing .......................... 38 5.3.2 Data Dependencies ........................ 38 5.3.3 Processor Oversubscription ................... 39 5.3.4 Profiling .............................. 39 IIIImplementation 41 6 Pre-study 42 6.1 The Local Traffic Prescription (LTF) . 42 6.2 Application Design Overview ...................... 42 6.3 Choice of Method ............................. 43 6.4 Application Profile ............................ 45 6.4.1 Hardware ............................. 45 6.4.2 Performance Overview ...................... 45 6.4.3 Method Calls ........................... 46 7 Parallel Database Concepts 48 7.1 I/O Parallelism .............................. 48 7.2 Interquery Parallelism .......................... 49 7.3 Intraquery Parallelism .......................... 49 7.4 Query Optimization ........................... 49 8 Solution Design 50 8.1 Problem Decomposition ......................... 50 8.2 Applying Patterns ............................ 53 8.2.1 Explicit Threading Approach . 53 8.2.2 Thread Pool Queuing Approach . 55 8.2.3 TPL Approaches ......................... 56 8.2.4 PLINQ Approach ......................... 57 IVExecution 59 9 Results 60 9.1 Performance Survey ........................... 60 9.2 Discussion ................................. 64 9.2.1 Implementation Analysis .................... 64 9.2.2 Parallel Extensions Design Analysis . 65 9.3 Conclusion ................................ 67 9.4 Future Work ............................... 67 V Appendices 69 A Code Snippets 70 A.1 The ReadData Method .......................... 70 A.2 The ThreadObject Class ........................ 71 B Raw Data 72 B.1 Different Technique Meaurements .................... 72 B.2 Varying the Thread Count using Explicit Threading . 73 B.3 Varying the Core Count using TPL ................... 73 Bibliography 74 List of Tables 2.1 Specifications for some common CPU architectures. [1] . 9 2.2 Typical specifications for different memory units (as of 2013). [2] . 9 2.3 Specifications for some general processor classifications. 11 3.1 Typical overheads for threads in .NET. [3] . 17 3.2 Typical overheads for signaling constructs. [4] . 20 3.3 Properties and typical overheads for locking constructs. [4] . 21 6.1 Processor specifications of the hardware used for evaluating the applica- tion. ..................................... 45 6.2 Other hardware specifications of the hardware used for evaluating the application. ................................. 45 6.3 Methods with respective exclusive instrumentation percentages. 46 6.4 Methods with respective CPU exclusive samplings. 47 9.2 Perceived difficulty- and applicability levels of implementing the tech- niques of parallelization. .......................... 64 9.1 Methods with the highest elapsed exclusive time spent executing for the sequential- and TPL based approach. ................... 64 B.1 Running times measured using the different techniques of parallelization. 72 B.2 Running time measurements using explicit threading with different amounts of threads. .................................. 73 B.3 Running time measurements using TPL with different amounts of cores. 73 List of Figures 1.1 Overview of .NET threading concepts. Concepts in white boxes with dotted borders are not covered to great extent in this thesis. [3] . 6 1.2 A high level flow graph of typical parallel execution in the .NET Frame- work. [5] .................................... 6 2.1 Difference between UMA (left) and NUMA (right). Processors are de- noted P1,P2, ..., Pn and memories M1,M2, ..., Mn. [6] . 12 3.1 Thread 1 steals work from thread 2 since both its local- as well
Recommended publications
  • Technical Report Aaron Councilman
    Extensible Parallel Programming in ableC Aaron Councilman Department of Computer Science and Engineering University of Minnesota, Twin Cities May 23, 2019 1 Introduction There are many different manners of parallelizing code, and many different languages that provide such features. Different types of computations are best suited by different types of parallelism. Simply whether a computation is compute bound or I/O bound determines whether the computation will benefit from being run with more threads than the machine has cores, and other properties of a computation will similarly affect how it performs when run in parallel. Thus, to provide parallel programmers the ability to deliver the best performance for their programs, the ability to choose the parallel programming abstractions they use is important. The ability to combine these abstractions however they need is also important, since different parts of a program will have different performance properties, and therefore may perform best using different abstractions. Unfortunately, parallel programming languages are often designed monolithically, built as an entire language with a specific set of features. Because of this, programmer's choice of parallel programming abstractions is generally limited to the choice of the language to use. Beyond limiting the available abstracts, this also means, that the choice of abstractions must be made ahead of time, since any attempt to change the parallel programming language at a later time is likely to be be prohibitive as it may require rewriting large portions of the codebase, if not the entire codebase. Extensible programming languages can offer a solution to these problems. With an extensible compiler, the programmer chooses a base programming language and can then select the set of \extensions" for that language that best fit their needs.
    [Show full text]
  • Intel® Cilk™ Plus
    Overview: Programming Environment for Intel® Xeon Phi™ Coprocessor One Source Base, Tuned to many Targets Source Compilers, Libraries, Parallel Models Multicore Many-core Cluster Multicore Multicore CPU CPU Intel® MIC Multicore Multicore and Architecture Cluster Many-core Cluster Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 Phase Product Feature Benefit Intel® Threading design assistant • Simplifies, demystifies, and speeds Advisor XE (Studio products only) parallel application design • C/C++ and Fortran compilers • Intel® Threading Building Blocks • Enabling solution to achieve the Intel® • Intel® Cilk™ Plus application performance and Composer XE • Intel® Integrated Performance scalability benefits of multicore and Build Primitives forward scale to many-core • Intel® Math Kernel Library • Enabling High Performance Scalability, Interconnect Intel® High Performance Message Independence, Runtime Fabric † MPI Library Passing (MPI) Library Selection, and Application Tuning Capability ® Intel Performance Profiler for • Remove guesswork, saves time, VTune™ optimizing application makes it easier to find performance Amplifier XE performance and scalability and scalability bottlenecks Memory & threading dynamic • Increased productivity, code quality, ® Intel analysis for code quality and lowers cost, finds memory, Verify Inspector XE threading , and security defects & Tune Static Analysis for code quality
    [Show full text]
  • Part V Some Broad Topic
    Part V Some Broad Topic Winter 2021 Parallel Processing, Some Broad Topics Slide 1 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised First Spring 2005 Spring 2006 Fall 2008 Fall 2010 Winter 2013 Winter 2014 Winter 2016 Winter 2019* Winter 2021* *Chapters 17-18 only Winter 2021 Parallel Processing, Some Broad Topics Slide 2 V Some Broad Topics Study topics that cut across all architectural classes: • Mapping computations onto processors (scheduling) • Ensuring that I/O can keep up with other subsystems • Storage, system, software, and reliability issues Topics in This Part Chapter 17 Emulation and Scheduling Chapter 18 Data Storage, Input, and Output Chapter 19 Reliable Parallel Processing Chapter 20 System and Software Issues Winter 2021 Parallel Processing, Some Broad Topics Slide 3 17 Emulation and Scheduling Mapping an architecture or task system onto an architecture • Learn how to achieve algorithm portability via emulation • Become familiar with task scheduling in parallel systems Topics in This Chapter 17.1 Emulations Among Architectures 17.2 Distributed Shared Memory 17.3 The Task Scheduling Problem 17.4 A Class of Scheduling Algorithms 17.5 Some Useful Bounds for Scheduling 17.6 Load Balancing and Dataflow Systems Winter 2021 Parallel Processing, Some Broad Topics Slide 4 17.1 Emulations Among Architectures Need for scheduling: a.
    [Show full text]
  • An Introduction to Parallel Programming
    In Praise of An Introduction to Parallel Programming With the coming of multicore processors and the cloud, parallel computing is most cer- tainly not a niche area off in a corner of the computing world. Parallelism has become central to the efficient use of resources, and this new textbook by Peter Pacheco will go a long way toward introducing students early in their academic careers to both the art and practice of parallel computing. Duncan Buell Department of Computer Science and Engineering University of South Carolina An Introduction to Parallel Programming illustrates fundamental programming principles in the increasingly important area of shared-memory programming using Pthreads and OpenMP and distributed-memory programming using MPI. More important, it empha- sizes good programming practices by indicating potential performance pitfalls. These topics are presented in the context of a variety of disciplines, including computer science, physics, and mathematics. The chapters include numerous programming exercises that range from easy to very challenging. This is an ideal book for students or professionals looking to learn parallel programming skills or to refresh their knowledge. Leigh Little Department of Computational Science The College at Brockport, The State University of New York An Introduction to Parallel Programming is a well-written, comprehensive book on the field of parallel computing. Students and practitioners alike will appreciate the rele- vant, up-to-date information. Peter Pacheco’s very accessible writing style, combined with numerous interesting examples, keeps the reader’s attention. In a field that races forward at a dizzying pace, this book hangs on for the wild ride covering the ins and outs of parallel hardware and software.
    [Show full text]
  • Lecture Notes : Intel Xeon Phi Coprocessor - an Overview
    C-DAC Four Days Technology Workshop ON Hybrid Computing – Coprocessors/Accelerators Power-Aware Computing – Performance of Application Kernels hyPACK-2013 Mode 3 : Intel Xeon Phi Coprocessors Lecture Notes : Intel Xeon Phi Coprocessor - An Overview Venue : CMSD, UoHYD ; Date : October 15-18, 2013 C-DAC hyPACK-2013 Xeon-Phi Coprocessors : An Overview 1 An Overview of Prog. Env on Intel Xeon-Phi Lecture Outline Following topics will be discussed Understanding of Intel Xeon-Phi Coprocessor Architecture Programming on Intel Xeon-Phi Coprocessor Performance Issues on Intel Xeon-Phi Coprocessor C-DAC hyPACK-2013 Xeon-Phi Coprocessors : An Overview 2 Intel Xeon Host : An Overview of Xeon - Multi-Core and Systems with Devices Part-I Background : Xeon Host - Multi-Core & Devices C-DAC hyPACK-2013 Xeon-Phi Coprocessors : An Overview 3 Programming paradigms-Challenges Large scale data Computing – Current trends HowTrends to run Computing Programs - Observationsfaster ? You require Super Computer Era of Single - Multi-to-Many Core - Heterogeneous Computing Sequential Computing How to run Programs faster ? Fetch/Store Fast Access of data Compute Fast Processor More Memory to Manage data C-DAC hyPACK-2013 Xeon-Phi Coprocessors : An Overview 4 Multi-threaded Processing using Hyper-Threading Technology Time taken to process n threads on a single processor is significantly more than a single processor system with HT technology enabled. Multithreading Hyper-threading Technology App 0 App 1 App 2 App 0 App 1 App 2 T0 T1 T2 T3 T4 T5 T0 T1 T2 T3 T4 T5 Thread Pool Thread Pool CPU CPU CPU T0 T1 T2 T3 T4 T5 LP0 T0 T2 T4 LP1 Time T1 T3 T5 CPU 2 Threads per Processor Time Source : http://www.intel.com ; Reference : [6], [29], [31] P/C : Microprocessor and cache; SM : Shared memory C-DAC hyPACK-2013 Xeon-Phi Coprocessors : An Overview 5 Relationship among Processors, Processes, &Threads Processors Processors µPi Processes .
    [Show full text]
  • The Relevance of Opencl to HPC
    The Relevance of OpenCL to HPC . Paul Preney, OCT, M.Sc., B.Ed., B.Sc. [email protected] School of Computer Science University of Windsor Windsor, Ontario, Canada Copyright © 2015 Paul Preney. All Rights Reserved. March 4, 2015 The Relevance of OpenCL to HPC Paul Preney 2 Announced Abstract Today’s high performance computing programs are evolving to be increas- ingly more parallel, increasingly more wait-free, increasingly deployed on many different kinds of hardware including general purpose CPUs, GPGPUs, FPGAs, other custom specific-purpose hardware, etc. The OpenCL standards are platforms providing interfaces that enable deployment of programs to vir- tually any heterogeneous computing device. The OpenCL standard defines a highly-vectorizable programming language, OpenCL C, which enables the deployment of programming logic to arbitrary hardware without requiring low-level, “machine-coding” knowledge of such. The OpenCL standard is a critical component of exascale initiatives given that it is hardware neu- tral, with significant support and participation from all the major processor vendors. Unfortunately the main source of information about OpenCL is in the form of its final specifications so there is a lot of misinformation about it. This talk will explain the relevance of the OpenCL standard to the HPC community, and offer a glimpse into what high-level abstractions for OpenCL, under development by software engineers, might look like. The Relevance of OpenCL to HPC Paul Preney 3 Presentation Overview 1. HPC and OpenCL 2. The Relevance Of OpenCL To HPC 3. A Glimpse Into Possible High-Level Abstractions 4. The Future and Discussion 5. References The Relevance of OpenCL to HPC Paul Preney 4 Table of Contents 1.
    [Show full text]
  • Speculative Parallelism in Cilk++
    Speculative Parallelism in Cilk++ Ruben Perez Gregory Malecha MIT Harvard University SEAS [email protected] [email protected] ABSTRACT and are based around backtracking search, often optimized Backtracking search algorithms are useful in many domains, with heuristics. Algorithms such as these can benefit tremen- from SAT solvers to artificial intelligences for playing games dously from parallel processing because separate threads can such as chess. Searching disjoint branches can, inherently be used to search independent paths in parallel. We call be done in parallel though it can considerably increase the this strategy speculative parallelism because only one solu- amount of work that the algorithm does. Such parallelism is tion matters and as soon as that is found, all other work can speculative, once a solution is found additional work is irrele- be stopped. vant, but the individual branches each have their own poten- tial to find the solution. In sequential algorithms, heuristics While simple to describe, this use of parallelism is not always are used to prune regions of the search space. In parallel easy to implement. One parallel programming platform, implementations this pruning often corresponds to abort- cilk5 [FLR98], included language support for this type of ing existing computations that can be shown to be pursuing parallelism, but it has since been removed in more recent dead-ends. While some systems provide native support for versions of the platform adapted to C++, cilk++ [Lei09]. aborting work, Intel's current parallel extensions to C++, Our goal is to regain the power of speculative execution in a implemented in the cilk++ compiler [Lei09], do not.
    [Show full text]
  • Parallel Programming with Matlabmpi ∗
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by CERN Document Server Parallel Programming with MatlabMPI ∗ Jeremy Kepner ([email protected]) MIT Lincoln Laboratory, Lexington, MA 02420 July 23, 2001 Abstract MatlabMPI is a Matlab implementation of the Message Passing Interface (MPI) standard and allows any Matlab program to exploit multiple processors. MatlabMPI currently implements the basic six functions that are the core of the MPI point-to-point communications standard. The key technical innovation of MatlabMPI is that it implements the widely used MPI \look and feel" on top of standard Matlab file I/O, resulting in an extremely compact ( 100 lines) and \pure" implementation which runs anywhere Matlab runs. The performance has been tested on∼ both shared and distributed memory parallel computers. MatlabMPI can match the bandwidth of C based MPI at large message sizes. A test image filtering application using MatlabMPI achieved a speedup of 70 on a parallel computer. ∼ 1 Introduction Matlab [1] is the dominant programming language for implementing numerical computations and is widely used for algorithm development, simulation, data reduction, testing and system evaluation. Many of these computations can benefit from faster execution on a parallel computer. There have been many previous attempts to provide an efficient mechanism for running Matlab programs on parallel computers [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. These efforts of have faced numerous challenges and none have received widespread acceptance. In the world of parallel computing the Message Passing Interface (MPI) [2] is the de facto standard for implementing programs on multiple processors.
    [Show full text]
  • (12) United States Patent (10) Patent No.: US 8.209,664 B2 Yu Et Al
    USOO8209.664B2 (12) United States Patent (10) Patent No.: US 8.209,664 B2 Yu et al. (45) Date of Patent: Jun. 26, 2012 (54) HIGH LEVEL PROGRAMMING 2008, OO86442 A1 4/2008 Dasdan et al. 2008.OO98375 A1 4/2008 Isard EXTENSIONS FOR DISTRIBUTED DATA 2008/0120314 A1 5/2008 Yang et al. PARALLEL PROCESSING 2008/O127200 A1 5/2008 Richards et al. 2008/0250227 A1 10, 2008 Linderman et al. (75) Inventors: Yuan Yu, Cupertino, CA (US); Ulfar 2009/0043803 A1* 2/2009 Frishberg et al. ............. 707/102 Erlingsson, Reykjavik (IS); Michael A 2009/0300656 A1* 12/2009 Bosworth et al. ............. T19,320 Isard, San Francisco, CA (US); Frank OTHER PUBLICATIONS McSherry, San Francisco, CA (US) Blelloch, Guy E., et al., “Implementation of a Portable Nested Data (73) Assignee: Microsoft Corporation, Redmond, WA Parallel Language.” School of Computer Science, Carnegie Mellon (US) University, Pittsburgh, PA, Feb. 1993, Abstract, pp. 1-28. Isard, Michael, et al., “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks.” Microsoft Research, Sillicon Val (*) Notice: Subject to any disclaimer, the term of this ley, http://research.microsoft.com/research/dv/dryad/eurosysO7. patent is extended or adjusted under 35 pdf). Mar. 23, 2007, 14 pages. U.S.C. 154(b) by 704 days. Ferreira, Renato, et al., “Data parallel language and compiler Support for data intensive applications*1.” Elsevier Science B.V., Parallel (21) Appl. No.: 12/406,826 Computing, vol. 28, Issue 5, May, 2002, 3 pages. Carpenter, Bryan, et al., “HPJava: Data Parallel Extensions to Java.” (22) Filed: Mar 18, 2009 NPAC at Syracuse University, Syracuse, NY, Feb.
    [Show full text]
  • Parallel Programming in Openmp About the Authors
    Parallel Programming in OpenMP About the Authors Rohit Chandra is a chief scientist at NARUS, Inc., a provider of internet business infrastructure solutions. He previously was a principal engineer in the Compiler Group at Silicon Graphics, where he helped design and implement OpenMP. Leonardo Dagum works for Silicon Graphics in the Linux Server Platform Group, where he is responsible for the I/O infrastructure in SGI’s scalable Linux server systems. He helped define the OpenMP Fortran API. His research interests include parallel algorithms and performance modeling for parallel systems. Dave Kohr is a member of the technical staff at NARUS, Inc. He previ- ously was a member of the technical staff in the Compiler Group at Silicon Graphics, where he helped define and implement the OpenMP. Dror Maydan is director of software at Tensilica, Inc., a provider of appli- cation-specific processor technology. He previously was an engineering department manager in the Compiler Group of Silicon Graphics, where he helped design and implement OpenMP. Jeff McDonald owns SolidFX, a private software development company. As the engineering department manager at Silicon Graphics, he proposed the OpenMP API effort and helped develop it into the industry standard it is today. Ramesh Menon is a staff engineer at NARUS, Inc. Prior to NARUS, Ramesh was a staff engineer at SGI, representing SGI in the OpenMP forum. He was the founding chairman of the OpenMP Architecture Review Board (ARB) and supervised the writing of the first OpenMP specifica- tions. Parallel Programming in OpenMP Rohit Chandra Leonardo Dagum Dave Kohr Dror Maydan Jeff McDonald Ramesh Menon Senior Editor Denise E.
    [Show full text]
  • Dryadlinq for Scientific Analyses
    DryadLINQ for Scientific Analyses Jaliya Ekanayake1,a, Atilla Soner Balkirc, Thilina Gunarathnea, Geoffrey Foxa,b, Christophe Poulaind, Nelson Araujod, Roger Bargad aSchool of Informatics and Computing, Indiana University Bloomington bPervasive Technology Institute, Indiana University Bloomington cDepartment of Computer Science, University of Chicago dMicrosoft Research {jekanaya, tgunarat, gcf}@indiana.edu, [email protected],{cpoulain,nelson,barga}@microsoft.com Abstract data-centered approach to parallel processing. In these frameworks, the data is staged in data/compute nodes Applying high level parallel runtimes to of clusters or large-scale data centers and the data/compute intensive applications is becoming computations are shipped to the data in order to increasingly common. The simplicity of the MapReduce perform data processing. HDFS allows Hadoop to programming model and the availability of open access data via a customized distributed storage system source MapReduce runtimes such as Hadoop, attract built on top of heterogeneous compute nodes, while more users around MapReduce programming model. DryadLINQ and CGL-MapReduce access data from Recently, Microsoft has released DryadLINQ for local disks and shared file systems. The simplicity of academic use, allowing users to experience a new these programming models enables better support for programming model and a runtime that is capable of quality of services such as fault tolerance and performing large scale data/compute intensive monitoring. analyses. In this paper, we present our experience in Although DryadLINQ comes with standard samples applying DryadLINQ for a series of scientific data such as Terasort, word count, its applicability for large analysis applications, identify their mapping to the scale data/compute intensive scientific applications is DryadLINQ programming model, and compare their not studied well.
    [Show full text]
  • Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models
    Link¨opingStudies in Science and Technology Licentiate Thesis No. 1716 Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models by Mahder Gebremedhin Department of Computer and Information Science Link¨opingUniversity SE-581 83 Link¨oping,Sweden Link¨oping2015 This is a Swedish Licentiate’s Thesis Swedish postgraduate education leads to a doctor’s degree and/or a licentiate’s degree. A doctor’s degree comprises 240 ECTS credits (4 year of full-time studies). A licentiate’s degree comprises 120 ECTS credits. Copyright c 2015 June Mahder Gebremedhin ISBN 978-91-7519-048-8 ISSN 0280–7971 Printed by LiU Tryck 2015 URL: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-117338 Abstract The move from single-core processor systems to multi-core and many-processor systems comes with the requirement of implementing computations in a way that can utilize these multiple computational units efficiently. This task of writing efficient parallel algorithms will not be possible without improving programming languages and compilers to provide the supporting mecha- nisms. Computer aided mathematical modeling and simulation is one of the most computationally intensive areas of computer science. Even simplified models of physical systems can impose a considerable computational load on the processors at hand. Being able to take advantage of the potential computational power provided by multi-core systems is vital in this area of application. This thesis tries to address how to take advantage of the poten- tial computational power provided by these modern processors in order to improve the performance of simulations, especially for models in the Mod- elica modeling language compiled and simulated using the OpenModelica compiler and run-time environment.
    [Show full text]