Parallelization of Sequential Applications using .NET Framework 4.5

An evaluation, application and discussion about the parallel extensions of the .NET managed concurrency .

JOHAN LITSFELDT

KTH Information and Communication Technology

Master of Science Thesis Stockholm, Sweden 2013

TRITA-ICT-EX-2013:160

Parallelization of Sequential Applications using .NET Framework 4.5

An evaluation, application and discussion about the parallel extensions of the .NET managed concurrency library.

JOHAN LITSFELDT

Degree project in Program System Technology at KTH Information and Communication Technology Supervisor: Mats Brorsson, Marika Engström Examiner: Mats Brorsson

TRITA-ICT-EX-2013:160 Abstract

Modern processor construction has taken a new turn in that adding on more cores to processors appears to be the norm instead of simply relying on clock speed improve- ments. Much of the responsibility of writing efficient appli- cations has thus been moved from the hardware designers to the software developers. However, issues related to scal- ability, synchronization, data dependencies and debugging makes this troublesome for developers. By using the .NET Framework 4.5, a lot of the men- tioned issues are alleviated through the use of the parallel extensions including TPL, PLINQ and other constructs de- signed specifically for highly concurrent applications. Anal- ysis, profiling and debugging of parallel applications has also been made less problematic in that the Visual Studio 2012 IDE provides such functionality to a great extent. In this thesis, the parallel extensions as well as explicit threading techniques are evaluated along with a paralleliza- tion attempt on an application called the LTF/Reduce In- terpreter by the company Norconsult Astando AB. The ap- plication turned out to be highly I/O dependent but even so, the parallel extensions proved useful as the running times of the parallelized parts were lowered by a factor of about 3.5–4.1. Sammanfattning

Parallellisering av Sekventiella Applikationer med .NET Framework 4.5

Modern processorkonstruktion har tagit en vändning i och med att normen nu ser ut att vara att lägga till fler kär- nor till processorer istället för att förlita sig på ökningar av klockhastigheter. Mycket av ansvaret har således flyttats från hårdvarutillverkarna till mjukvaruutvecklare. Skalbar- het, synkronisering, databeroenden och avlusning kan emel- lertid göra detta besvärligt för utvecklare. Många av de ovan nämnda problemen kan mildras ge- nom användandet av .NET Framework 4.5 och de nya pa- rallella utbyggnaderna vilka inkluderar TPL, PLINQ och andra koncept designade för starkt samverkande applika- tioner. Analys, profilering och debugging har också gjorts mindre problematiskt i och med att Visual Studio 2012 IDE tillhandahåller sådan funktionalitet i stor utsträckning. De parallella utbyggnaderna samt tekniker för explicit trådning utvärderas i denna avhandling tillsammans med ett parallelliseringsförsök av en applikation vid namn LT- F/Reduce Interpreter av företaget Norconsult Astando AB. Applikationen visade sig vara starkt I/O beroende men även om så var fallet så visade sig de parallella utbyggna- derna vara användbara då körtiden för de parallelliserade delarna kunde minskas med en faktor av c:a 3.5–4.1. Contents

List of Tables

List of Figures

I Background 1

1 Introduction 2 1.1 The Modern Processor ...... 2 1.2 .NET Framework 4.5 Overview ...... 3 1.2.1 ...... 3 1.2.2 The Parallel Extension ...... 3 1.3 Problem Definition ...... 4 1.4 Motivation ...... 4

II Theory 7

2 Modern Processor Architectures 8 2.1 Instruction Handling ...... 8 2.2 Memory Principles ...... 9 2.3 Threads and Processes ...... 10 2.4 Multi Processor Systems ...... 11 2.4.1 Parallel Processor Architectures ...... 11 2.4.2 Memory Architectures ...... 11 2.4.3 Simultaneous Multi-threading ...... 13

3 Parallel Programming Techniques 14 3.1 General Concepts ...... 14 3.1.1 When to go Parallel ...... 14 3.1.2 Overview of Parallelization Steps ...... 15 3.2 .NET Threading ...... 16 3.2.1 Using Threads ...... 17 3.2.2 The Pool ...... 17 3.2.3 Blocking and Spinning ...... 19 3.2.4 Signaling Constructs ...... 19 3.2.5 Locking Constructs ...... 20 3.3 Task Parallel Library ...... 21 3.3.1 ...... 22 3.3.2 The Parallel Class ...... 22 3.4 Parallel LINQ ...... 23 3.5 Thread-safe Data Collections ...... 25 3.6 Cancellation ...... 26 3.7 Exception Handling ...... 27

4 Patterns of Parallel Programming 28 4.1 Parallel Loops ...... 28 4.2 Forking and Joining ...... 29 4.2.1 Recursive Decomposition ...... 29 4.2.2 The Parent/Child Relation ...... 29 4.3 Aggregation and Reduction ...... 30 4.3.1 Map/Reduce ...... 30 4.4 Futures and Continuation Tasks ...... 31 4.5 Producer/Consumer ...... 32 4.5.1 Pipelines ...... 32 4.6 Asynchronous Programming ...... 33 4.6.1 The async and await Modifiers ...... 33 4.7 Passing Data ...... 34

5 Analysis, Profiling and Debugging 35 5.1 Application Analysis ...... 35 5.2 Visual Studio 2012 ...... 36 5.2.1 Debugging Tools ...... 36 5.2.2 Concurrency Visualizer ...... 37 5.3 Common Performance Sinks ...... 38 5.3.1 Load Balancing ...... 38 5.3.2 Data Dependencies ...... 38 5.3.3 Processor Oversubscription ...... 39 5.3.4 Profiling ...... 39

IIIImplementation 41

6 Pre-study 42 6.1 The Local Traffic Prescription (LTF) ...... 42 6.2 Application Design Overview ...... 42 6.3 Choice of Method ...... 43 6.4 Application Profile ...... 45 6.4.1 Hardware ...... 45 6.4.2 Performance Overview ...... 45 6.4.3 Method Calls ...... 46

7 Parallel Database Concepts 48 7.1 I/O Parallelism ...... 48 7.2 Interquery Parallelism ...... 49 7.3 Intraquery Parallelism ...... 49 7.4 Query Optimization ...... 49

8 Solution Design 50 8.1 Problem Decomposition ...... 50 8.2 Applying Patterns ...... 53 8.2.1 Explicit Threading Approach ...... 53 8.2.2 Queuing Approach ...... 55 8.2.3 TPL Approaches ...... 56 8.2.4 PLINQ Approach ...... 57

IVExecution 59

9 Results 60 9.1 Performance Survey ...... 60 9.2 Discussion ...... 64 9.2.1 Implementation Analysis ...... 64 9.2.2 Parallel Extensions Design Analysis ...... 65 9.3 Conclusion ...... 67 9.4 Future Work ...... 67

V Appendices 69

A Code Snippets 70 A.1 The ReadData Method ...... 70 A.2 The ThreadObject Class ...... 71

B Raw Data 72 B.1 Different Technique Meaurements ...... 72 B.2 Varying the Thread Count using Explicit Threading ...... 73 B.3 Varying the Core Count using TPL ...... 73

Bibliography 74 List of Tables

2.1 Specifications for some common CPU architectures. [1] ...... 9 2.2 Typical specifications for different memory units (as of 2013). [2] . . . . 9 2.3 Specifications for some general processor classifications...... 11

3.1 Typical overheads for threads in .NET. [3] ...... 17 3.2 Typical overheads for signaling constructs. [4] ...... 20 3.3 Properties and typical overheads for locking constructs. [4] ...... 21

6.1 Processor specifications of the hardware used for evaluating the applica- tion...... 45 6.2 Other hardware specifications of the hardware used for evaluating the application...... 45 6.3 Methods with respective exclusive instrumentation percentages. . . . . 46 6.4 Methods with respective CPU exclusive samplings...... 47

9.2 Perceived difficulty- and applicability levels of implementing the tech- niques of parallelization...... 64 9.1 Methods with the highest elapsed exclusive time spent executing for the sequential- and TPL based approach...... 64

B.1 Running times measured using the different techniques of parallelization. 72 B.2 Running time measurements using explicit threading with different amounts of threads...... 73 B.3 Running time measurements using TPL with different amounts of cores. 73 List of Figures

1.1 Overview of .NET threading concepts. Concepts in white boxes with dotted borders are not covered to great extent in this thesis. [3] . . . . . 6 1.2 A high level flow graph of typical parallel execution in the .NET Frame- work. [5] ...... 6

2.1 Difference between UMA (left) and NUMA (right). Processors are de- noted P1,P2, ..., Pn and memories M1,M2, ..., Mn. [6] ...... 12

3.1 Thread 1 steals work from thread 2 since both its local- as well as the global queue is empty. [7] ...... 18 3.2 The different states of a thread. [3] ...... 19 3.3 A possible execution plan for a PLINQ query. Note that the ordering may change after execution...... 24

4.1 Example of dynamic task parallelism. New child tasks are forked from their respective parent tasks...... 29 4.2 An illustration of typical Map/Reduce execution...... 31

5.1 Load imbalance as shown by the concurrency visualizer...... 38 5.2 Data dependencies as shown by the concurrency visualizer...... 39 5.3 Processor oversubscription as shown by the concurrency visualizer. . . 39

6.1 The internal steps of the application showing database accesses. . . . . 44 6.2 The figure shows how the LTF/Reduce Interpreter is used by other ap- plications...... 44 6.3 The CPU usage of the application over time...... 45 6.4 The call tree of the application showing inclusive percentages of time spent executing methods (instrumentation)...... 46 6.5 The call tree of the application showing inclusive percentages of CPU samplings of methods...... 47

9.1 Average total running times for different techniques...... 61 9.2 Total running times for the techniques represented as a box-and-whiskers graph...... 61 9.3 Box-and-whiskers diagrams for the running times of the methods MapLastzon, Map, Geolocate and Stretch...... 62 9.4 Average total running times for TPL using different amount of cores. . 63 9.5 Total running times for different amount of threads using the explicit threading approach...... 63 List of Listings

3.1 Explicit thread creation...... 17 3.2 Locking causes threads unable to be granted the lock to block. ... 20 3.3 Explicit task creation (task parallelism)...... 22 3.4 The Parallel.Invoke method...... 23 3.5 The Parallel.For loop...... 23 3.6 An example of a PLINQ query...... 24 4.1 Applying the aggregation pattern using PLINQ...... 30 4.2 Example of the future pattern for calculating f4(f1(4),f3(f2(7))). .. 31 4.3 Usage of the BlockingCollection class for a producer/consumer problem...... 32 4.4 An example of the async/await pattern for downloading HTML. .. 33 4.5 Downloading HTML asynchronously without async/await...... 34 5.1 A loop with unbalanced work among iterations (we assume that the factorial method is completely independent between calls)...... 38 8.1 Bottleneck #1: Reducing LTFs as they become available...... 51 8.2 Bottleneck #2.1: Geolocating service days...... 51 8.3 Bottleneck #2.2: Stretching of LTFs...... 51 8.4 Bottleneck #3: Storage/Persist of modified LTFs...... 52 8.5 Bottleneck #1: Reduce using explicit threading...... 53 8.6 Bottleneck #2.1: Geolocate using explicit threading...... 54 8.7 Bottleneck #2.2: Stretch using explicit threading...... 54 8.8 Bottleneck #1: Reduce using thread pool queuing...... 55 8.9 Bottleneck #2.1: Geolocate using thread pool queuing...... 55 8.10 Bottleneck #2.2: Stretch using thread pool queuing...... 56 8.11 Bottleneck #1: Reduce using TPL...... 56 8.12 Bottleneck #2.1: Geolocate using TPL...... 56 8.13 Bottleneck #2.2: Stretch using TPL...... 57 8.14 Bottleneck #1: Reduce using PLINQ...... 57 8.15 Bottleneck #2.1: Geolocate using PLINQ...... 58 8.16 Bottleneck #2.2: Stretch using PLINQ...... 58 A.1 The ReadData method used in bottleneck #1...... 70 A.2 The ThreadObject class used for the explicit threading technique. . 71 Part I

Background

1 Chapter 1

Introduction

The art of parallel programming has long been considered difficult and not worth the investment. This chapter describes why parallel programming has become ever so important and why there is a need for modern parallelization technologies.

1.1 The Modern Processor

A system composed of a single core processor executes instructions, performs cal- culations and handles data sequentially switching between threads when necessary. Moore’s law states that the amount of transistors used in single processor cores doubles approximately every two years. Even so, this law which is predicted to continue only for a few more years has been subject to claim. [8] Recent developments in processor construction have enabled a shift of paradigm. Instead of naively relying on Moore’s law for future clock speeds, other technologies have emerged. After all, a transistor cannot be made infinitely small as problems with cost efficiency, heat issues and other physical capabilities are limited. [9] In later years we have witnessed a turn towards usage of multi core processors as well as systems consisting of multiple processors. There are a number of benefits that comes with such an approach including less context switching and the ability to solve certain problems more efficiently. Unfortunately the increased complexity also introduces problems with synchronization, data dependencies and large overheads.

“The way the processor industry is going, is to add more and more cores, but nobody knows how to program those things. I mean, two, yeah; four, not really; eight, forget it.” - Steve Jobs, Apple. [10]

Even though the subject is regarded problematic, all hope is not lost. Developed by , the .NET Framework (pronounced dot net) is a software framework which includes large libraries while being highly interoperable, portable and scal- able. A feature called the Parallel Extensions were introduced in version 4.0 released in 2010 targeting parallel- and distributed systems. With the parallel extensions, writing parallel applications using .NET has been made tremendously less painful.

2 CHAPTER 1. INTRODUCTION

1.2 .NET Framework 4.5 Overview

The .NET Framework is a large and complex system framework with several layers of abstraction. This section scratches the surface by providing the basics of the Common Language Runtime as well as the new parallel extensions of .NET 4.0.

1.2.1 Common Language Runtime The Common Language Runtime (CLR) is one of the main components in .NET. It constitutes an implementation of the Common Language Infrastructure (CLI) and represents a virtual machine as well as an execution environment. After code has been written in a programming language supported by .NET (e.g. C#,F# or VB.NET) it is compiled into an assembly consisting of Common Intermediate Language (CIL) code. When the assembly is executed, the CIL code is compiled into machine code by the Just In Time (JIT) compiler at runtime (alternatively at compile time for performance reasons). [11] See 1.2 for an overview. The CLR also supports handling of issues regarding memory, threads and ex- ceptions as well as garbage collection∗ and enforcement of security and robustness. Code executed by the CLR is called which means that the frame- work handles the above mentioned issues so that the programmer does not have to tend to them. Native code in contrary represents machine specific code executed on the operative system directly. This is typically associated with C/C++ and has certain potential performance benefits since such applications are of lower level than managed ones. [12]

1.2.2 The Parallel Extension The Task Parallel Library (TPL) is made to simplify the of parallelizing code by offering a set of types and application programming interfaces (). The TPL has many useful features including work partitioning and proper thread scheduling. TPL may be used to solve issues related to , task parallelism and other patterns (see chapter 4). Another parallelization method introduced in the .NET Framework 4.0 is called Parallel LINQ (PLINQ) which is a parallel extension to the regular LINQ query. PLINQ offers a declarative and typically cleaner style of programming than that of TPL usage. The parallel extensions further introduce a set of concurrent lock-free data struc- tures as well as slimmed constructs for locking and event handling specifically de- signed to meet the new standards of highly concurrent programming models. See figure 1.1 for an overview.

∗The garbage collector reclaims objects no longer in use from memory when certain conditions are met and thus prevents memory leaks and other related problems.

3 CHAPTER 1. INTRODUCTION

1.3 Problem Definition

This thesis presents the steps and best practices for parallelizing a sequential appli- cation using the .NET Framework. Parallelization techniques, patterns and analysis are discussed and evaluated in detail along with an overview of modern processor design and profiling/analysis methods of Visual Studio 2012. The thesis also in- cludes a parallelization attempt of an industry deployed application developed by the company Norconsult Astando AB.

This thesis does cover: • Decomposition and (potential parallelism, granularity, etc.). • Coordination (data races, locks, thread safety, etc.) • Regular Threading Concepts in .NET (incl. thread pool usage). • Parallel Extensions (PLINQ, TPL, etc.). • Patterns of Parallel Programming (i.e. best practices). • Profiling and debugging using Visual Studio 2012. • Implementation details, results and conclusions. This thesis does not cover:

• Basics of .NET programming (C#, delegates, LINQ, etc.). • Advanced parallel algorithms. • Thread pool optimizations (e.g. custom thread schedulers).

Note that this thesis is based on the C# programming language although the con- cepts are more or less the same when written in one of the other .NET languages such as VB.NET or F#. The goal of this thesis is to provide an evaluation of .NET parallel programming concepts and in doing so giving insights on how sequential applications should be parallelized, especially using .NET technology. The theory presented along with experiments, results and insights should provide a useful reference, both for Nor- consult Astando and for future academic research.

1.4 Motivation

With rapid advancements in multi core application programming it is often the case that companies using .NET technology may not be able to catch up with emerging techniques for parallel programming leaving them with inefficient software. It is therefore of utmost importance that best practices for identifying, applying and examining parallel code using modern technologies are investigated and thoroughly evaluated. Applying patterns of parallel programming to already existing code is often the case more difficult than writing parallel code from scratch as extensive profiling and

4 CHAPTER 1. INTRODUCTION possible redesign of system architecture may be an issue. Because of the fact that many .NET applications in use today are designed for sequential execution, this thesis targets the iterative approach of parallelization which also includes thorough analysis and decomposition of sequential code.

5 CHAPTER 1. INTRODUCTION

Task Parallel Library .NET 4.0

Structured Data Parallelism

Parallel Class PLINQ Lazy Init Types

Concurrent Spinning Slim Signaling Task Parallelism Collections Primitives Constructs

.NET x Tools

Concurrency CLR Thread Pool Visualizer CHESS

Parallel Debugger Threading Windows

Figure 1.1. Overview of .NET threading concepts. Concepts in white boxes with dotted borders are not covered to great extent in this thesis. [3]

.NET Program Intermediate Language

PLINQ C# Compiler Queries Execution Engine

TPL and Parallel VB Compiler Algorithms Structures

C++ Compiler Threads

Execution on F# Compiler Processor Cores

Figure 1.2. A high level flow graph of typical parallel execution in the .NET Frame- work. [5]

6 Part II

Theory

7 Chapter 2

Modern Processor Architectures

The term processor has been used since the early 1960 s and has since undergone many changes and improvements to become what it is today [13]. This chapter includes an overview of modern processor technology with a focus on multi core processors.

2.1 Instruction Handling

The processor architecture states how data paths, control units, memory compo- nents and clock circuitries are composed. The main goal of a processor is to fetch and execute instructions to perform calculations and handle data. The only part of the processor normally visible to the programmer is the registry where variables and results of calculations are stored. [14] Instructions to be carried out by the processor include arithmetic-, load/store- and jump instructions among others. After an instruction has been fetched from memory using the value of the program counter (PC) registry, the program counter needs to be updated and the instruction decoded. This is followed by fetching the operands of the instruction from the registry (or the instruction itself) so that the instruction can be executed. The execution is typically an ALU-operation, a memory reference or a jump in the program flow. If there is a result from the execution it will be stored in the registry. [14] For a processor to be able to execute instructions, certain hardware is needed. A memory holding the instructions is needed as well as space for the registry. An adder needs to be in place for incrementing the program counter along with a clock signal for synchronization. The arithmetic logic unit (ALU) is used for performing arithmetic and logical operations on binary numbers with operands fetched from the registry. Combining the ALU with multiplexes and control units will allow for the different instructions to be carried out. [14] RISC stands for Reduced Instruction Set Computer and has certain properties such as easy-to-decode instructions, many registry locations without special func- tions and only allowing special load/store instructions to reference memory. The

8 CHAPTER 2. MODERN PROCESSOR ARCHITECTURES

Architecture Bits Design Registers Year x86 32 CISC 8 1978 x86-64 64 CISC 16 2003 MIPS 64 CISC 32 1981 ARMv7 32 RISC 16 1983 ARMv8 64 RISC 30 2011 SPARC 64 RISC 31 1985

Table 2.1. Specifications for some common CPU architectures. [1] contrasting architecture is called CISC which stands for Complex Instruction Set Computer and is built on the philosophy that more complex instructions would make it easier for programmers and compilers to write assembly code. [14]

2.2 Memory Principles

Memories are often divided into two groups: dynamic random access memory (DRAM) and static random access memory (SRAM). SRAMs are typically easier to use and faster while DRAMs are cheaper with less complex design. [14] When certain memory cells are used more often than others, the cells show locality of reference. Locality of reference can further be divided into the groups temporal locality where cells recently accessed are likely to be accessed again and spatial locality in which cells previously accessed are more likely than others to be accessed again. Reference of locality can be taken advantage of by placing the working set in a smaller and faster memory. [14] The CLR of .NET improves locality of reference automatically. Examples of this includes that objects allocated consecutively are allocated adjacently and that the garbage collector defragments memory so that objects are kept close together. [15] The memory hierarchy of a system indicates the different levels of memories available to the processor. Closest to the processor are the SRAM cache memories which may further be divided into several sub levels (L1, L2, etc.). At the next level is the DRAM primary memory followed by the secondary hard drive memory. [14]

Memory Size Access time Processor registers 128B 1 cycle L1 Cache 32KB 1-2 cycles L2 Cache 256KB 8 cycles Primary Memory GB 30-100 cycles Secondary Memory GB+ > 500 cycles

Table 2.2. Typical specifications for different memory units (as of 2013). [2]

9 CHAPTER 2. MODERN PROCESSOR ARCHITECTURES

When data is requested from memory the entire cache line on where it is located is fetched for spatial locality reasons. The cache lines may vary in size depending of the level of memory but are always aligned at multiples of its size. After the line has been fetched it will be stored in a lower level cache for temporal locality. Different levels of cache typically have different sizes the reason being that finding an item in a smaller cache is faster when temporally referenced. [16] The amount of associativity of a cache refers to the number of positions in the memory that maps to a position in the cache. Increasing associativity may lead to a reduced possibility of conflicts in the cache. A directly mapped cache is a cache where each line in memory maps to exactly one position. In a fully associative cache, any position in memory can map to any line of cache. This approach is however very complex and is thus rarely implemented. [16] On multi core processors, cores typically share a single cache. Two issues related to these caches are capacity misses and conflict misses. In a conflict cache miss one thread causes data needed by another thread to be evicted. This can potentially lead to thrashing where multiple threads map its data to the same cache line. This is usually not an issue but may cause problems on certain systems. Conflict cache misses can be solved by using padding and spacing of data. [16] Capacity misses occur when the cache only fits a certain amount of threads. Adding more threads then cause data to be fetched from higher level caches or from the main memory, the threads are thus no longer cache resilient. Because of these issues, a high level of associativity should be preferred. [16]

2.3 Threads and Processes

A software thread is a stream of instructions for the processor to execute whereas a hardware thread represents the resources that execute a single software thread. Processors usually have multiple hardware threads (also called virtual CPUs or strands) which are considered equal performance wise. [16] Support for multiple threads on a single chip can be achieved in several ways. The simplest way is to replicate the cores and have each of them share an interface with the rest of the system. An alternative approach is to have multiple threads run on a single core, cycling between the threads. Having multiple threads share a core means that they will get a fair share of the resources depending on activity and currently available resources. Most modern multi core processors use a combination of the two techniques e.g. a processor with two cores each capable of running two threads. From the perspective of the user, the system appears to have many virtual CPUs running multiple threads, this is called chip multi threading (CMT). [16] A process is a running application and consists of instructions, data and a state. The state consists of processor registers, currently executing instructions and other values that belong to the process. Multiple threads may run in a single process but not the other way around. A thread also has a state although it is much simpler

10 CHAPTER 2. MODERN PROCESSOR ARCHITECTURES than that of a process. Advantages of using threads over processes is that they can perform a high degree of communication via the shared heap space, it is also often very natural to decompose problems into multiple threads with low costs of data sharing. Processes have the advantage of isolation although it also means that they require their own TLB entries. If one thread fails, the entire application might fail. [16]

2.4 Multi Processor Systems

2.4.1 Parallel Processor Architectures Proposed by M. J. Flynn, the taxonomy of computer systems can be divided into four different classifications (see table 2.3) based on the number of concurrent in- struction and data streams available. [17]

Classification I-Parallelism D-Parallelism Example SISD No No Uniprocessor machines. SIMD No Yes GPU or array processors. MISD Yes No Fault tolerant systems. MIMD Yes Yes Multi core processors. Table 2.3. Specifications for some general processor classifications.

The Single Instruction, Single Data (SISD) architecture cannot constitute a parallel machine as there is only one instruction carried out per clock cycle which only operates on one data element. Using multiple data streams however (SIMD), the model is extended to having instructions operate on several data elements. The instruction type is however still limited to one per clock cycle. This is useful for graphics and array/matrix processing. [17] The Multiple Instruction, Single Data stream (MISD) architecture supports dif- ferent types of instructions to be carried out for each clock cycle but only using the same data elements. The MISD architecture is rarely used due to its limitations but can provide fault tolerance for usage in aircraft systems and the like. Most modern, multi processor computers however fall into the Multiple Instruction, Multiple Data (MIMD) stream category where both the instruction as well as the data stream is parallelized. This means that every core can have its own instructions operating on their own data. [17]

2.4.2 Memory Architectures When more than one processor is present in the system, memory handling becomes much more complex. Not only can data be stored in main memory but also in caches in one of the other processors. An important concept is which essentially means that data requested from some memory should always be the most up-to-date version of it.

11 CHAPTER 2. MODERN PROCESSOR ARCHITECTURES

M1 M2 Mn

Interconnection Network Interconnection Network

P1 P2 Pn P1 M1 P2 M2 Pn Mn

Figure 2.1. Difference between UMA (left) and NUMA (right). Processors are denoted P1,P2, ..., Pn and memories M1,M2, ..., Mn.[6]

There are several methods to maintain cache coherence often built on the concept of tracking the state of sharing of data blocks. One method is called directory based tracking in which the state of blocks is kept at a single location called the directory. Another approach is called snooping in which no centralized state is kept. Instead, every cache has its own sharing status of blocks and snoops other caches to determine whether the data is relevant. [18]

Shared Memory When every processor have access to the same global memory space it constitutes a architecture. The advantages of such an approach are that data sharing is fast due to the short distance between processors and memory. Another advantage is that programming applications for such architecture is rather simple even though it is the programmers responsibility to provide synchronization between processors. The shared memory approach however lacks scalability between memory and processor count and is rather difficult and expensive to design. [6] Memory may be attached to CPUs in different constellations and for every link between the CPU and the memory requested there is some latency involved. Typically one wants to have similar memory latencies for every CPU. This can be achieved through an interconnection network which every process communicates with memory through. This is what (UMA) is based on and what is typically used in modern multi processor machines. [6] The other approach to shared memory is non-uniform memory access (NUMA) in which processors have local memory areas. This results in constant access times when requested data is present in the local storage but slightly slower than UMA when not. [6] See figure 2.1 for an illustration of UMA- and NUMA designs.

12 CHAPTER 2. MODERN PROCESSOR ARCHITECTURES

Distributed Memory Using , every processor have their own memory spaces and ad- dresses. Advantages to the distributed memory approach are that the processor count scales with memory and that it is rather cost effective. Each processor typi- cally has instant access to its own memory as no external communication is needed for such cases. [6] The disadvantages regarding distributed memory are that the programmer is responsible for intra-processor communication as there is no straightforward way of sharing data. It may also prove difficult to properly store data structures between processor memories. [6]

2.4.3 Simultaneous Multi-threading Simultaneous multi-threading (often used in conjunction with the Intel technology hyper-threading) is a technology in which cores are able to execute two streams of instructions concurrently, improving CPU efficiency. This means that each core is divided into two logical ones with their own states and instruction pointers but with the same memory. By switching between the two streams, the instruction stream of the pipeline can be made more efficient. When one of the instruction streams is stalled waiting for other steps to finish (e.g. memory is being fetched), the CPU may execute instructions from the other instruction stream until the stalling is complete. Studies have shown that this technology may improve performance with up to 40 percent. [15] One of the key elements when optimizing code for simultaneously multi-threaded hardware is to make good use of the CPU cache. Storing data which is often accessed in close proximity is highly important due to locality of reference. Memory usage should generally be kept to a minimum. Instead of caching easily calculated values, it may be more efficient to just recalculate it. Besides the issues related to memory, all of the problems associated with multi core programming applies to simultaneous multi-threading as well. Note also that the CLR in .NET is able to optimize code for simultaneous multi-threading and different cache sizes to a certain extent. [15]

13 Chapter 3

Parallel Programming Techniques

Many techniques have been developed to make parallel programming a more ac- cessible subject for developers. This chapter discusses topics related to identifying problems suited for parallelization as well as modern techniques used in parallel programming.

3.1 General Concepts

Identifying, decomposing and synchronizing units of work are all examples of general problems related to parallel programming. This section gives an overview of these subjects.

3.1.1 When to go Parallel The main reason for writing parallel code is for performance reasons. Parallelizing an application so that it runs on four cores instead of one can potentially cut down the computation time by a factor of 4. The parallel approach is however not always suited and one should always investigate the gains of parallelization in contrast to the introduced costs of increased complexity and overhead. Amdahl’s law is an approximation of potential runtime gains when parallelizing applications. Let S represent time spent executing serial code and P time spent executing parallelized code. Amdahl’s law states that the total runtime is S + P/N where N is the number of threads executing the parallel code. This is obviously very unrealistic as the overheads (synchronization, thread handling etc.) of using multiple threads are not taken into account. [16] Let the overhead of N threads be denoted F (N). The estimate of F (N) could vary between a constant to linear or even exponential running time depending on implementation. A fair estimate is to let F (N) = K · ln (N) where K is some constant communication latency. The logarithm could for example represent the communication costs when threads form a tree structure. The total runtime includ-

14 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES ing the overhead cost would now be updated to

S + (P/N) + K ln (N). (3.1)

By plotting the running time over an increasing number of threads it is apparent that the performance at some point will start decreasing. By differentiating the running time over thread count, this exact point may be calculated as P K − + = 0 (3.2) N 2 N and when solved for N, P N = . (3.3) K In other words, a proportional number of threads for a particular task is de- termined by the runtime of the parallelizable code divided by the communication latency K. This also means that the scalability of an application can be increased by finding larger proportions of code to parallelize or by minimizing the synchro- nization costs. [16] A side note is that lower communication latencies can be achieved when threads share the same level of cached data than if they were to communicate through memory. Multi core processors therefore have the opportunity to lower the value on K through efficient memory handling. [16]

3.1.2 Overview of Parallelization Steps When the sequential application has been properly analyzed and profiled (see chap- ter 5), the steps for parallelizing the application typically are as follows:

1. Decomposition of code into units of work. 2. Distribution of work among processor cores. 3. Synchronizing work.

Decomposition and Scalability An important concept when parallelizing an application is that of potential paral- lelism. Potential parallelism means that an application should utilize the cores of the hardware regardless of how many of them are present. The application should be able to run on both single core systems as well as multi-core ones and perform thereafter. For some applications the level of parallelism may be hard coded based on the underlying hardware. This approach should typically be used only when the hardware running the application is known beforehand and is guaranteed to not change over time. Such cases are rarely seen in modern hardware but still exist to some extent, for example in gaming consoles. [7] To provide potential parallelism in a proper way, one might use the concept of a task. Tasks are mostly independent units of work in which an application can

15 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES be divided into. They are typically distributed among threads and works towards fulfilling a common goal. The size of the task is called its granularity and should be carefully chosen. If the granularity is too fine grained, the overheads of managing threads might dominate while a too coarse grained granularity leads to a possible loss of potential parallelism. The general guideline to choosing task granularity is that the task should be as large as possible while properly occupying the cores and being as independent as possible of one another. Making this choice requires good knowledge of the underlying algorithms, data structures and overall design of the code to be parallelized. [7]

Data Dependencies and Synchronization Tasks of a parallel program are usually created to run in parallel. In some cases there is no synchronization between tasks. Such problems are called embarrass- ingly parallel and as the name suggests imposes very few problems while providing good potential parallelism. One does not always have the luxury of encountering such problems which is why the concept of synchronization is important. Task synchronization typically has different designs depending on the pattern used for parallelization (see chapter 4). Common for all patterns is however that the tasks must be coordinated in one way or another. When data needs to be shared between tasks the problem of data races becomes prevalent. Data races can be solved in numerous ways. One way is to use a locking structure around the variable raced for (see section 3.2.5). Another solution is to make variables immutable which can be enforced by using copies of data instead of references where possible. A final approach to utilize when all else fails is to redesign code for lower reliance on shared variables. It is important to note that synchronization limits parallelism and may in worst case serialize the application. There is also the possibility of when us- ing locking constructs. As an alternative to explicit locking a number of lock-free, concurrent collections were introduced in .NET Framework 4.0 to minimize syn- chronization for certain data structures such as queues, stacks and dictionaries (see section 3.5). Note that these constructs comes with a set of quite heavy limitations which should be studied before designing an application based on them. One should generally try to eliminate as much synchronization as possible al- though it is important to note that it is not always possible to do so. Choosing the simplest, least error prone solution to the problem is then recommended, as parallel programming in itself is difficult enough as it is.

3.2 .NET Threading

Before the introduction of .NET Framework 4.0, developers were limited to using explicit threading methods. Even though the new parallel extensions have been introduced, explicit threading is still used to great extent. This section describes how threading is carried out in .NET along with other relevant subjects.

16 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

3.2.1 Using Threads Threads in .NET are handled by the thread scheduler provided by the CLR and are typically pre-empted after a certain time slice which depends on the underly- ing operating system (typically ranges between 10 to 15 ms[19]). On multi core systems, time slicing is mixed with true concurrency as multiple threads may run simultaneously on different cores.

Listing 3.1. Explicit thread creation.

1 new Thread (() => { 2 Work(); }).Start();

Threads created explicitly are called foreground threads unless the property IsBackground is set to true in which case the thread is a background thread. When every foreground thread has terminated the application ends and background threads are terminated as a result. Waiting for threads to finish is typically done using event wait handles (see section 3.2.4). Note that exception handling should be carried out within the thread from which the catch-block typically is used for signaling another thread or logging the error. [3] As mentioned in previous sections, writing threaded code has issues regarding complexity. A good practice to follow is to encapsulate as much of the threaded code as possible for unit testing. The other main problem is the overhead of creating and destroying threads as seen in table 3.1 below. These overheads can however be limited by using the .NET thread pool as described in the following section. [3]

Action Overhead Allocating a thread 1MB stack space Context switch 6000-8000 CPU cycles Creation of a thread 200 000 CPU cycles Destruction of a thread 100 000 CPU cycles

Table 3.1. Typical overheads for threads in .NET. [3]

3.2.2 The Thread Pool The thread pool consists of a queue of waiting threads to be used. The easiest way of accessing the thread pool is by adding units of work to its global queue by calling the ThreadPool.QueueUserWorkItem method. The upper limit of threads for the thread pool in .NET 4.0 is 1023 and 32768 for 32-bit and 64-bit systems respectively while the lower limit is determined by the number of processor cores. The thread pool is dynamic in that it typically starts out with few threads and injects more as long as performance is gained. [3] The .NET Framework is designed to run applications using millions of as tasks each being possibly as small as a couple of hundred clock cycles each. This would

17 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

Empty local queue of thread 1

Work steal! Thread 1

Empty global queue Local queue of thread 2

Thread 2

Figure 3.1. Thread 1 steals work from thread 2 since both its local- as well as the global queue is empty. [7] normally not be possible using a single, global thread pool because of synchro- nization overheads. However, the .NET Framework solves this issue by using a decentralized approach. [7] In the .NET Framework, every thread of the thread pool is assigned a local task queue in addition to having access to the global queue. When new tasks are added, they are sometimes put on local queues (sub-level) and sometimes on the global one (top-level). Threads not in the thread pool always have to place tasks on the global queue. [7] The local queues are typically double headed and lock-free which opens up for the possibility of a concept called . A thread with a local queue operates on one end of that queue while others may pull work from the other, public end (see figure 3.1). Work stealing has also shown to provide good cache properties and fairness of work distribution. [7] In certain scenarios, a task has to wait for another task to complete. If threads have to wait for other tasks to be carried out it might lead to long waits and in the worst case deadlocks. The thread scheduler can detect such issues and let the thread waiting for another task to run it inline. It is important to note that top-level as well as long running (see section 3.3.1) threads are unable to be inlined. [7] The number of threads in the pool is automatically managed in .NET using complex heuristics. Two main approaches should be noted. The first one tries to reduce starvation and deadlocks by injecting more threads if little progress is made. The other one is hill-climbing which maximizes throughput while keeping the number of threads to a minimum. This is typically done by monitoring if injected threads increase throughput or not. [7] Threads can be injected when a task is completed or at 500 ms intervals. A good reason for keeping tasks short is therefore to let the task scheduler get more opportunities for optimization. A second approach is to implement a custom thread scheduler that injects threads in sought ways. Note that the lower and upper limits of threads in the pool may also be explicitly set (within certain limits). [7]

18 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

WaitSleepJoin Abort Thread Thread Blocks Unblocks Abort Unstarted Running Abort Requested Start Thread ResetAbort Ends Thread Stopped Aborted Ends

Figure 3.2. The different states of a thread. [3]

3.2.3 Blocking and Spinning When a thread is blocked its execution is paused and its time slice yielded resulting in a context switch. The thread is unblocked and again context switched either when the blocking condition is met, by operation timeout, by interruption or abortion. To have a thread blocked until a certain condition is met, signaling and/or locking constructs may be used. Instead of having a thread block and perform the context switch it may spin for a short amount of time, constantly polling for the signal or lock. Obviously this wastes processing time but may be effective when the condition is expected to be met in a very short time. A set of slimmed variants of locks and signaling constructs were introduced in .NET Framework 4.0 targeting this issue. Slimmed constructs can be used between threads but not between processes. [3]

3.2.4 Signaling Constructs Signaling is the concept of having one thread wait until it receives a notifica- tion from one or more other threads. Event wait handles are the simplest of signaling constructs and comes in a number of forms. One particular useful fea- ture of signaling constructs is to wait for multiple threads to finish using the WaitAll(WaitHandle[]) method where the wait handles usually are distributed among the threads waited for. [3]

• Using a ManualResetEvent allows communication between threads using a Set call. Threads waiting for the signal are all unblocked until the event is reset manually. If threads wait using the WaitOne call, they will form a queue.

• The AutoResetEvent is similar to ManualResetEvent with the difference that the event is automatically reset when a thread has been unblocked by it.

19 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

• A CountdownEvent unblocks waiting threads once its counter has reached zero. The counter is initially set to some number and is decremented by one for each signal.

Lock Cross-process Overhead AutoResetEvent Yes 1000 ns ManualResetEvent Yes 1000 ns ManualResetEventSlim No 40 ns CountdownEvent No 40 ns Barrier No 80 ns Wait and Pulse No 120 ns (for Pulse) Table 3.2. Typical overheads for signaling constructs. [4]

3.2.5 Locking Constructs Locking is used to ensure that a set amount of threads are able to be granted access to some resource at the same time. For ensuring thread safety, locking should be performed around any writable shared field independently of its complexity. Writing thread-safe code is however usually more time consuming and typically induces performance costs. [3] It is important to note that making one method of a class thread-safe does not mean that the whole object is thread-safe. Locking a complete, thread-unsafe object with one lock may prove to be inefficient. It is therefore important to choose the right level of locking so that the program may utilize multiple threads in a safe way while being as efficient as possible.

Exclusive Locks The locking structures for exclusive locking in .NET are lock and Mutex which both lets at most one thread claim the lock at the same time. The lock structure is typically faster but cannot be shared between processes in contrast to the Mutex construct. [3]

Listing 3.2. Locking causes threads unable to be granted the lock to block.

1 lock(syncObj) { 2 // thread-safe area 3 }

A Spinlock is a lock which continuously polls a lock to see if it has been released. This consumes processor resources but typically grants good performance when locks are estimated to be held only for short amounts of time.

20 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

Non-Exclusive Locks A variant of the mutex is the Semaphore which lets n threads hold the lock. Typical usage of the semaphore is to limit the maximum amount of database connections or to limit certain CPU/memory intensive operations to avoid starvation. The ReaderWriterLock allows for multiple threads to simultaneously read a value while at most one thread may update it. Two locks are used, one for reading and one for writing. The writer will acquire both locks when writing to ensure that no reader gets inconsistent data. The ReaderWriterLock should be used when reading the locked variables is performed frequently in contrast to updating the variables. [3]

Lock Cross-process Overhead Mutex Yes 1000 ns Semaphore Yes 1000 ns SemaphoreSlim No 200 ns ReaderWriterLock No 100 ns ReaderWriterLockSlim No 40 ns Table 3.3. Properties and typical overheads for locking constructs. [4]

Thread-Local Storage Data is typically shared among threads but may be isolated through thread-local storage methods; this is highly useful for parallel code. One such example is the usage of the Random class which is not thread-safe. To use this class properly the object either has to be locked around or be local to every thread. The latter is typically preferred due to performance reasons and may be implemented using thread-local storage. [3] Introduced in .NET 4.0, the ThreadLocal class offers thread local storage for both static- as well as instance fields. A useful feature with the ThreadLocal class is that the data it holds is lazily∗ evaluated. A second way of implementing thread-local storage is to use the GetData and SetData methods of the Thread class. These are used to access thread-specific slots in which data can be stored and retrieved. [3]

3.3 Task Parallel Library

The task parallel library (TPL) consists of two techniques: task parallelism and the Parallel class. The two methods are quite similar but typically have different areas of usage. The two techniques are described in this section.

∗Lazy evaluation delays the evaluation of an expression until its value is needed. In some circumstances, lazy evaluation can also be used to avoid repeated evaluations among shared data.

21 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

3.3.1 Task Parallelism The Task represents an object responsible for carrying out some independent unit of work. Both the Parallel class as well as PLINQ is built on top of task parallelism. Task parallelism offers the lowest level of parallelization without using threads ex- plicitly while offering a simple way of utilizing the thread pool. It may be used for any concurrent application even though it was meant for multi core applications.

Listing 3.3. Explicit task creation (task parallelism).

1 Task.Factory.StartNew (() => 2 Work());

Tasks in .NET have several features which makes them highly useful: • The scheduling of tasks may be altered. • Relationships between tasks can be established. • Efficient cancellation and exception handling. • Waiting on tasks and continuations.

Parallel Options If no parallel options are provided during task creation there is no explicit fairness, no logical parent and the task is assumed to be running for a short amount of time. These options may be specified using TaskCreationOptions.[3] The PreferFairness task creation option forces the default task scheduler to place the task in the global (top-level) queue. This is also the case when a task is created from a thread which does not belong to one of the worker threads of the thread pool. Tasks created under this option typically follows a FIFO ordering if scheduled by the default task scheduler. [3] Sometimes it is not preferred to have tasks using worker threads from the thread pool. This is often the case when there are few threads that are known to be running for a great period of time (e.g. long I/O and other background work). When this is the case, including the LongRunning task creation option creates a new thread which bypasses the thread pool. As usual with parallel programming, one should generally carry out performance tests before deciding on whether to use this option or not. [3] A parent-child relationship may be established using the AttachedToParent task creation option. When such an option is provided it is imposed that the parent will not finish until all of its children finishes. [20]

3.3.2 The Parallel Class The Parallel class includes the methods Parallel.Invoke, Parallel.For and Parallel.ForEach. These methods are highly useful for both data- as well as task parallelism and they all block until all of the work has completed in contrast to usage of explicit tasks.

22 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

Parallel.Invoke The Parallel.Invoke static method of the Parallel class is used for executing multiple Action delegates in parallel and then wait for the work to complete. The main difference between this method of performing parallel work in comparison explicit task creation and waiting is that the work is partitioned into properly sized batches. Note that the tasks needs to be known beforehand to properly utilize the method. [3]

Listing 3.4. The Parallel.Invoke method.

1 Parallel.Invoke ( 2 () => Work1(), 3 () => Work2());

Parallel.For and Parallel.ForEach The Parallel.For and Parallel.ForEach methods are used for performing steps of looping constructs in parallel. Just like the Parallel.Invoke method these methods are properly partitioned. Parallel.ForEach is used for iterating over an enumerable data set just like its sequential counterpart with the exception that the parallel method will use multiple threads. The data set should implement the IEnumerable† interface. [3]

Listing 3.5. The Parallel.For loop.

1 Parallel.For(0, 100, 2 i => Work(i));

The two important concepts parallel break and parallel stop are used for exiting a loop. A parallel break at index i guarantees that every iteration indexed less than i will or have been executed. It does not guarantee that iterations indexed higher than i has or has not been executed. The parallel stop does not guarantee anything else than the fact that the iteration indexed i has reached this statement. This is typically used when some particular condition is searched for using the parallel loop. Canceling a loop externally is typically done using cancellation tokens (see section 3.6). [3]

3.4 Parallel LINQ

The Parallel LINQ technique offers the highest level of parallelism by automat- ing most of the implementation such as partitioning of work into tasks, execution of the tasks using threads and collation of results into a single output sequence.

†The IEnumerable interface ensures that the underlying type implements the GetEnumerator method which in turn should return an enumerator for iterating through some collection.

23 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

The PLINQ programming model works by combining set-based operators such as projections, filters aggregations and more. PLINQ does not work with non-local collections such as LINQ-to-SQL or LINQ- to-Entities. The reason for this is that the IQueryProvider implementation trans- lates LINQ queries to single SQL queries that are processed by the SQL engine. The results however, may be digested in a parallel manner. [21] Some query operators cannot be parallelized by PLINQ for practical reasons: Take, TakeWhile, Skip, SkipWhile and the indexed versions of Select, SelectMany and ElementAt. The following operators are slow and should be used with care: Join, GroupBy, GroupJoin, Distinct, Union, Intersect and Except. Other op- erators are parallelizable although PLINQ may decide to run them sequentially if otherwise estimated to result in bad performance. [3] Usually when parallelizing a query using PLINQ the order in which the results are collated are not always the same as they were in the original collection. This can be resolved by supplying the AsOrdered() operator. Every operator supplied after the AsOrdered() operator will have their order preserved. To turn off ordering one may supply the AsUnordered() operator. Note that preserving order can impact the performance of the application. [22]

Listing 3.6. An example of a PLINQ query.

1 (new int[] {1,2,3,4,5}).AsParallel() 2 .Select(i => i+1) 3 .ToArray();

The PLINQ method should typically be used when the problem can be expressed as a query. PLINQ should however not be used when the subproblems are heavily coupled or does not have to have their results collated. Figure 3.3 below shows a possible execution plan for the PLINQ query in listing 3.6. As soon as AsParallel() is called, work is forked into tasks to be run on several threads. As threads finish, work is collated back into an array.

Thread 1 1 2 2 3 Thread 2 1 2 3 4 5 4 5 5 6 4 2 3 5 6 .AsParallel()Thread 3 .Select() 3 4

Figure 3.3. A possible execution plan for a PLINQ query. Note that the ordering may change after execution.

24 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

3.5 Thread-safe Data Collections

The .NET Framework 4.0 contains a number of concurrent collection classes. These classes should typically be preferred to locking around similar, non-concurrent data structures because of their simple uses, safety precautions and optimized behav- ior. It is important to have an understanding of the scalability of the concurrent collections and how well they fare against their lock based counterparts. One should note that the main use for the concurrent collections is for paral- lel applications. Using them for sequential code may lead to unnecessary perfor- mance costs. Just because a concurrent collection is thread-safe does not mean that code using it is. Enumerating over a concurrent collection while another one is modifying it will not throw an exception. ConcurrentStack, ConcurrentBag and ConcurrentQueue are less memory efficient because of their implementations based on linked lists. [23]

BlockingCollection The BlockingCollection class is useful for producer- consumer scenarios. Multiple threads may add or take items from the collection concurrently. Bounding and blocking is supported in that if a thread tries to insert or remove an item from a collection that is full or empty respectively, it will block until the action can be carried out. Cancellation may be performed using cancellation tokens and the collection can be enumerated using either read-only enumeration or removal of items during enu- meration. A lock based implementation in which a queue is locked around would lead to similar performance although the BlockingCollection should be pre- ferred due to its rich functionality. [23]

ConcurrentQueue The ConcurrentQueue provides a thread-safe FIFO or- dered queue. The obvious lock based implementation would be locking around a regular Queue object. Studies have shown that ConcurrentQueue outperforms the lock based vari- ant for almost every case. The only case when the lock based implementation can prove better is when the workload is very light (close to none) and the number of threads working on the collection is higher than two. For scenarios where the work is a few hundred FLOPS or larger, ConcurrentQueue scales and performs better. [23]

ConcurrentStack ConcurrentStack is a thread-safe, LIFO ordered stack. The lock based counterpart provides almost identical performance and scalability to the concurrent variant. Just like the concurrent queue, ConcurrentStack however provides richer functionality. One example is the PushRange() method for efficient pushing of several items onto the stack. [23]

25 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

ConcurrentBag The ConcurrentBag collection is a new type for .NET 4 with no distinct counterpart. It works much like a stack or a queue with the dif- ference that there is no order preservation. The collection uses ThreadLocal so that every thread using the collection has a local list of the items. Because of this, threads may add or take items from the collection with very little syn- chronization overhead. If ordering does not matter, ConcurrentBag provides an excellent alternative to BlockingCollection for mixed producer-consumer scenarios. ConcurrentBag has shown to provide huge performance gains when compared to ConcurrentStack and ConcurrentQueue for such scenarios. A note of caution is that the thread-local lists use quite a lot of memory which is not disposed until the garbage collector collects it. [23]

ConcurrentDictionary The concurrent dictionary class is a thread- safe variant of the regular Dictionary. The regular dictionary may be used in multi-threaded scenarios if only read from although it has shown to per- form worse than the concurrent equivalent. When the level of concurrency is high, the concurrent dictionary also provides better scalability and performance for sce- narios of both reading and writing. Note that the properties and methods Count, Keys, Values and ToArray() completely locks the data structure and should be used with care. The GetEnumerator() method allows for lock free enumeration although the results may be a mixture of new and old data if other threads have used the object concurrently. [23]

Why is there no concurrent list? There is no concurrent version of IList in the .NET Framework. There are several reasons behind this choice but they mainly stem from the fact that data lookup should have an O(1) access time. To have constant access one typically needs to use an array for storage of elements instead of linked lists. The problems with using arrays though, is that they have to be resized at certain intervals. At certain Add operations, new arrays needs to be allocated and copied into. Several solutions have been proposed to solve this issue so there is hope for a future implementation for a concurrent list. [24] Until then, it is recommended to simply lock around a regular list or rewrite the code so that it is not dependent on concurrent lists.

3.6 Cancellation

Cancellation of tasks is usually done by providing a cancellation token. The .NET Framework does not force tasks to be finished, instead the token has to be polled for explicitly. If the cancellation token is passed to several tasks, a single cancellation can shut down all of the involved tasks. Cancellation tokens are used in a sim- ilar fashion for the parallel static methods Parallel.Invoke, Parallel.For and Parallel.ForEach as well as task parallelism and PLINQ. [3]

26 CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

If no check for cancellation is performed, an exception is thrown after the current iteration has finished. Note that for PLINQ, the cancellation token polling is unless otherwise specified performed automatically at regular intervals. [7]

3.7 Exception Handling

A handled exception is one that has been observed. Exceptions may be observed in multiple ways, either by using one of the Wait methods or by getting the Exception property of the faulted task. The Parallel static methods as well as task paral- lelism and PLINQ handles observing implicitly. If an exception goes unhandled, the policy of .NET exception escalation will force the process to be terminated once garbage collected. [7] The exceptions thrown when multiple tasks are waited for are bundled together into an AggregateException. To handle the inner exceptions of the aggregate exceptions, the Handle method is used. The Handle method invokes a delegate for every inner exception used to observe them. The Flatten method can be used for observing nested exceptions by transforming the tree structure of exceptions into a flat sequence. [7]

27 Chapter 4

Patterns of Parallel Programming

Parallel programming patterns are proven abstractions of the most important and frequently used ways of parallelizing code. Usage of these patterns typically leads to more readable, easier to maintain, better performing and less error-prone code.

4.1 Parallel Loops

Parallel loops should be used when the same independent operation is to be per- formed on every iteration of a loop. Many such problems are of kind because of their few to non-existent dependencies. The parallel loop pattern can be applied either by usage of explicit threading or by usage of the Parallel class. Usage of PLINQ is typically not recommended because of both design- as well as technical issues: [25]

1. It is easier to think of the pattern as a loop than a query. 2. PLINQ overheads may prove too heavy. 3. Cannot specify maximum , only exact amount of threads.

When parallelizing the for loop using explicit threads, loop iterations needs to be distributed and provided as delegates for threads. The work may be queued onto the thread pool for efficiency. Explicit threading is useful when results does not have to be collated or canceled and when the overheads of other approaches proves too large. The Parallel class is typically the most straightforward method for this pat- tern. The static methods Parallel.For and Parallel.ForEach uses the thread pool by default and provides rich functionality (see section 3.3.2). Having this is mind, the Parallel class should be the go to alternative when faced with the parallel loop pattern. The Parallel methods typically scales as more cores are introduced and continues to do so until the overhead of synchronization starts to limit throughput. [7] The iteration steps of a parallel loop must be independent to the point where iterations may safely execute in any order. If the loop index value goes from a higher

28 CHAPTER 4. PATTERNS OF PARALLEL PROGRAMMING value to a lower it often means that there is some kind of dependency between iterations. Such loops should be re factored into a loop which goes from 1 to n. Another problematic loop to watch out for is one where the looping index is incremented by anything other than 1.[7] The main overhead of a for loop implemented using the Parallel class are delegate invocations and synchronization between threads for load balancing. The mentioned overhead costs are only noticeable when the loop bodies are very small. One way to overcome the problem is to use a partitioner to amortize the cost of delegate invocation while minimizing synchronization. [3]

4.2 Forking and Joining

When program flow can be divided into several chunks of operations it is possible to use parallel tasks. Work is forked into several tasks which are run asynchronously and later joined to ensure all processing has been completed. The parallel loop is an example of this and is typically regarded as a subproblem of the fork/join pattern.

4.2.1 Recursive Decomposition Dynamic task parallelism is used for situations where tasks are added over time as computation proceeds. Many such problems are of divide and conquer character e.g. traversing of trees or graphs. It is also useful in scenarios where the problem can be partitioned spatially such as for geographic or geometric problems. A typical problem solvable using dynamic tasks is the Quicksort algorithm where the recursive steps can be divided among tasks. [7] Note that recursive decomposition should be limited in some way or another so that the granularity does not become too fine. In the Quicksort example, one could for example limit the depth or keep track of how much data is left to process.

Figure 4.1. Example of dynamic task parallelism. New child tasks are forked from their respective parent tasks.

4.2.2 The Parent/Child Relation Tasks may be chained together into parent/child relations. Two tasks are called attached if one of the tasks is a child of the other task called the parent. Tasks

29 CHAPTER 4. PATTERNS OF PARALLEL PROGRAMMING

not involved in parent/child relations are called detached tasks. When a task has finished but has at least one child task waiting it will enter the blocking state WaitingForChildrenToComplete. Only when the child tasks have finished will the task leave the blocking state. The waiting is done implicitly. [26]

4.3 Aggregation and Reduction

Parallel aggregation deals with the issue of performing some associative binary op- eration on the elements of a set in parallel and then combining the results. An example of this would be calculating the sum of n integers. This is normally an issue because of the fact that steps performed concurrently will operate on some shared data area. There are several ways to solve the issue such as by usage of locks or one of the concurrent collections. The PLINQ way of solving this issue probably leads to the simplest and most straightforward code:

Listing 4.1. Applying the aggregation pattern using PLINQ.

1 int[] seq = {1,2,3,4,5}; 2 return (from i in sequence.AsParallel() 3 select i).Sum();

Other approaches such as Parallel.For or Parallel.ForEach leads to slightly more complex code. Threads needs to maintain a thread-local storage (see section 3.2.5) for storing intermediate values before they are aggregated into the global collection. Because of such issues it is recommended to stick with the simpler PLINQ approach for the Aggregation and Reduction pattern. [25]

4.3.1 Map/Reduce A variation of the aggregation pattern is the Map/Reduce pattern used to simplify the problem of work distribution. The model is focused around the Map, Reduce and Group functions where most of the logic of the problem at hand is specified. To conceive the pattern, imagine the problem of counting the number of words with lengths 1, 2, ..., k in n documents. [27] The input to the Map/Reduce pattern is the n documents which are distributed to workers (e.g. processor cores) into groups of m. Each of the workers runs the Map function which goes through the documents and collects every word along with a mapping to its length. This data is passed on to the intermediate layer. In the intermediate layer, the Group function is called for the mappings obtained from the workers of the previous step. This function puts the words into one out of k buckets depending on the mapping of the word to its length, note that the grouping function does not have to perform the task of counting the word length since it has already been done. The k buckets are distributed to workers for reduction.

30 CHAPTER 4. PATTERNS OF PARALLEL PROGRAMMING

Worker

Worker

Worker Input Intrm. Worker

Worker

Map Group Reduce

Figure 4.2. An illustration of typical Map/Reduce execution.

The workers performing the Reduce function takes one of the k buckets and counts how many words it contains. These numbers are then provided as output to the Map/Reduce problem. The Map/Reduce pattern has the advantages that for each layer there is no synchronization which makes the pattern highly applicable for parallelization.

4.4 Futures and Continuation Tasks

A Future is a task which returns a value initially unknown but available at a later time. Problems involving futures can typically be depicted as a graph of dependen- cies between tasks where the futures are blocked until reached in the task graph. Futures are often implemented using the Task class so that its result may be accessed through the Result property. A related concept is continuation tasks in which the specified task will start only when a certain task has finished executing. This is useful for chaining tasks, passing data, specifying precise continuation options etc. Continuation chaining is typically carried out using the Task.ContinueWith or Task.ContinueWhenAll method. Listing 4.2 below shows an example of such futures.

Listing 4.2. Example of the future pattern for calculating f4(f1(4),f3(f2(7))).

1 var t1 = Task.Factory.StartNew(() => f1(4)); 2 var t2 = Task.Factory.StartNew(() => f3(f2(7))); 3 var t3 = Task.Factory.ContinueWhenAll( 4 new[] {t1, t2}, 5 (tasks) => f4(t1.Result, t2.Result));

Speculative execution of threads is a technique where multiple tasks are depen- dent on each other but is run in parallel anyway. This is done by having a task A dependent on another task B speculate on the result produced by B and start executing using the speculated parameters. If the speculation is wrong, A will have to start over.

31 CHAPTER 4. PATTERNS OF PARALLEL PROGRAMMING

4.5 Producer/Consumer

In a Producer/Consumer scenario, certain entities are responsible for producing items while others are responsible for consuming them. There may be multiple producers, multiple consumers, producers being consumers and consumers being producers. Synchronization is typically an issue with this pattern as producers needs to store their items somewhere and the consumers have to retrieve the items from the same place. Additionally, consumers needs to be aware of when there is work to be consumed and producers needs to be aware of when the collection is full. The BlockingCollection is very well suited for the producer/consumer pattern as it solves many of the above mentioned issues in an elegant way.

Listing 4.3. Usage of the BlockingCollection class for a producer/consumer problem.

1 BlockingCollection collection = new BlockingCollection(); 2 3 var consumer = Task.Factory.StartNew(() => { 4 foreach (T item in collection.GetConsumingEnumerable()) 5 Consume(item); 6 }); 7 8 var producer = Task.Factory.StartNew(() => { 9 foreach (T item in producerCollection) 10 collection.Add(item); 11 collection.CompleteAdding(); 12 }); 13 14 producer.Wait(); 15 Task.Wait(consumer);

4.5.1 Pipelines Pipelines are similar to futures in that they are used when tasks form a graph of sequential dependencies. The biggest difference is that in the pipeline pattern, tasks communicate via queues. Pipelines therefore represent series of producers and consumers. Note that slow steps of the process indicates bottlenecks and that such steps may have to be divided into multiple tasks to balance load. The pipeline pattern is useful in scenarios where data is received from real- time streams such as packets arriving over a network. The packets may be passed through several filters each representing a stage of the pipeline structure. When several dependent steps are involved in the pipelining process the method does typically not scale with the amount of cores. [7] If exceptions in pipelines are not handled properly they may result in deadlocks. If an exception is thrown somewhere in the middle of the pipeline stages the other stages are not implicitly notified. To handle a coordinated shutdown of the pipeline

32 CHAPTER 4. PATTERNS OF PARALLEL PROGRAMMING

a linked cancellation token needs to be provided to all of the tasks involved. The token should be canceled by the faulting task. [7] Thread starvation may be an issue if there are not enough threads to support all of the pipeline stages. This is normally not a problem since the task scheduler will inject new threads when it monitors that no progress is made. To guarantee that enough threads are present, the LongRunning option may be provided to tasks. [7]

4.6 Asynchronous Programming

The concept of asynchronous programming builds on running certain events or ac- tions that would otherwise block the program to execute in a non-blocking way. Instead of having the caller manage concurrency, the callee does in asynchronous programming. Built on the concept of tasks, version 4.5 of the .NET Framework (released in 2012) introduces the modifiers async and await to version 5 of the C# language to greatly simplify this matter. Note that asynchronous programming is not associated with parallel programming but shares some similarities. [28] Asynchronous programming is useful when one wants the main thread to con- tinue running while waiting for other threads or processes to finish. This is often the case when using UI threads (e.g. for WPF or ) as the window would otherwise freeze if called from synchronously. The blocking calls are often of I/O-bound character such as calls to some web server or database. [28]

4.6.1 The async and await Modifiers The async modifier indicates that the method or expression it modifies is asyn- chronous while the await operator applied to a task in an asynchronous method suspends further execution of the method until the awaited task completes. In fact, using this pattern (see figure 4.4) is almost identical to creating a task and hooking on ContinueWith methods for the other iterations (see figure 4.5). [28]

Listing 4.4. An example of the async/await pattern for downloading HTML.

1 private async void Execute() { 2 for (string url in urls) { 3 string result = await DownloadHTML(url); 4 Process(result); 5 } 6 } 7 8 private async Task DownloadHTML(string url) { 9 WebClient client = new WebClient(); 10 string html = await client.DownloadStringTaskAsync(url); 11 return html; 12 }

33 CHAPTER 4. PATTERNS OF PARALLEL PROGRAMMING

Listing 4.5. Downloading HTML asynchronously without async/await.

1 Task.Factory.StartNew(() => DownloadHTML(urls[0])) 2 .ContinueWith(t => DownloadHTML(t.Result, urls[1])) 3 .ContinueWith(t => DownloadHTML(t.Result, urls[2])) 4 ..., 5 TaskScheduler.FromCurrentSynchronizationContext());

The FromCurrentSynchronizationContext method tells the task scheduler that the continuation must run on the thread calling it. When using await, this is assumed implicitly. Note in listing 4.4, that the signature of DownloadHTML says Task while the body returns a string object. [28]

4.7 Passing Data

Passing data between threads is an issue which can be solved in multiple ways. [29]

1. Data may be shared using the concept of closures. The variable is made accessible to the delegates that close over it. One should watch out for closing over data which is not shared in a proper way (e.g. the index variable of a for loop). Creating local copies of such data is one possible solution to the problem.

2. Many constructs for creating asynchronous work accepts a state object to be passed into the work body. Such constructs are ThreadPool.QueueUserWorkItem and Task.Factory.StartNew. Note that when passing multiple data items they have to be wrapped into a single object such as the Tuple class.

3. A final approach is to supply data to member variables of an object repre- senting the unit of work which is then referenced from within the state object using the this statement.

34 Chapter 5

Analysis, Profiling and Debugging

When working with both sequential and parallel code it is important to be familiar with appropriate profiling- and debugging tools. This chapter includes an overview on application analysis, common performance sinks and how to use Visual Studio 2012 for performance evaluation.

5.1 Application Analysis

There are two issues which are especially important when writing parallel code:

1. The first issue is performance which is the sole reason parallel code is even written to begin with. If a considerable decrease in running time cannot be achieved it is usually better to stick with sequential code as it generally is both easier to understand as well as being less prone to bugs. It is also important to note that just because the application at hand runs more efficiently after parallelization it may hinder the performance of other processes on the system.

2. The second issue is correctness which means that the parallel implementation should exactly the same results as its sequential counterpart every time it is run with the same input. Proving correctness in a parallel system is typically more difficult than in the sequential case because of the more complex and error prone code. Parallel code can also produce irregular results if not implemented correctly.

Before applying patterns of parallelization it is important to understand the application design. To acquire such and understanding it may be helpful to create UML diagrams as well as program flow charts. Understanding how the application operates before parallelizing it is often key to getting good results. Profiling the code base as described in section 5.3.4 is helpful for identifying the performance bottlenecks. Often there are some sorts of standards that need to be upheld such as a re- sponsive UI or executing within a certain time limit. Understanding the amount

35 CHAPTER 5. ANALYSIS, PROFILING AND DEBUGGING of resources available, which tasks are important and the price of parallelization is essential to gaining from the potential benefits.

5.2 Visual Studio 2012

Microsoft Visual Studio is an integrated development environment (IDE) including compilers for the .NET languages. The first version was released in 1997 and has since reached version 11.0 (Visual Studio 2012 as of June 2013). [30]

5.2.1 Debugging Tools Parallel code is usually more difficult to debug than sequential. This section de- scribes the two windows Parallel Tasks and Parallel Stacks used for debugging parallel applications using Visual Studio 2012.

Parallel Tasks Window The Parallel Tasks window shows information about tasks and is typically used as a complement to the debugger (e.g. using breaking points). This is especially useful when Task objects are used in managed code. Following are descriptions of the most important information available about tasks through the Parallel Tasks window. [31]

• Icons indicates which task is currently running task on the current thread (yellow arrow), the task that were current when the debugger was invoked (white arrow) or whether the thread was frozen by the user (pause icon).

• Status represents the current state of the class which could be scheduled, running, deadlocked or waiting. An important note about the status is that it is only shown for blocked tasks using synchronization primitives supported by Wait Chain Traversal (WCT). [31] If not, the task is marked as waiting.

• Location is the current location of the task in its call stack.

• Thread Assignment states the ID and name of the thread running the task.

• Process ID states the ID of the process running the task.

Parallel Stacks Window The Parallel Stacks window is able to show a bird’s view over threads or tasks along with their call stacks or methods. By selecting one of the options Threads or Tasks from the toolbar it is possible to get a graphical representation of the threads or tasks as nodes along with arrows showing how they are related. The call path of the current thread is highlighted. [31]

36 CHAPTER 5. ANALYSIS, PROFILING AND DEBUGGING

The icons of methods in threads or tasks indicate whether the method contains the active stack frame of the current thread (yellow arrow) or of a non-current thread. If the method shows a green arrow it contains the current stack frame. The Parallel Stacks window is highly useful for debugging concurrent applications in that it gives the user a great overview of how tasks interact. [31]

5.2.2 Concurrency Visualizer The concurrency visualizer consists of the three views: CPU Utilization View, Threads View and the Cores View. This subsection describes the different views and which problems are typically detected using them.

CPU Utilization View The CPU Utilization View of the concurrency visualizer view shows the core activity i.e. which processes are occupying them at certain time units. Typically, when parallelizing an application one wants to look for areas of execution where the CPU utilization is high. Other potential issues to be investigated through this view are the following: [32]

• Parallel code not occupying the expected number of cores. • Other processes occupying processor time. • Load imbalance (see section 5.3.1).

Threads View The Threads View of the concurrency visualizer view shows details about thread execution during the profiling session. This view shows every thread of the system, not only the ones run by the evaluated process. The view does not show which task the thread is running which may be problematic although this can to some degree be alleviated through the stack traces of threads. [32] Issues to identify through the Threads view are in particular the following:

• Processor oversubscription when yellow areas are frequent (see section 5.3.3). • Lock contention when there are stair shaped green regions (see figure 5.2). • Load imbalance (see section 5.3.1).

Cores View Finally, the Cores View of the concurrency visualizer shows which threads the pro- cessor cores have been running over time. Ensuring that work has been divided up properly among cores is one issue that can be visualized using this view. The cores view is also helpful when threads or tasks have been explicitly allocated to certain cores and the implementation needs to be evaluated for proper thread allocation. [32]

37 CHAPTER 5. ANALYSIS, PROFILING AND DEBUGGING

5.3 Common Performance Sinks

Issues related to parallel programming often appears as certain patterns. This section contains the most common performance sinks and how they can be observed using the concurrency visualizer.

5.3.1 Load Balancing Load balancing issues arise when work is not evenly distributed between threads so that some threads runs for a longer time than others. This may result in inefficient usage of the processor cores (see figure 5.1). An example of a problem which when parallelized may show load imbalance is shown in listing 5.1.

Listing 5.1. A loop with unbalanced work among iterations (we assume that the factorial method is completely independent between calls).

1 for (var i = 0; i < n; i++) { 2 factorial(i); 3 }

Higher indexed iterations are obviously more costly than lower ones. When statically partitioned, a parallelization may lead to load imbalance where some threads runs for considerably more time than others. One solution to the problem is to let threads process iterations in a round-robin fashion. Choosing the right amount of dynamism is typically a trade off between less synchronization and better load balancing. [7]

Core Utilization

Time

Figure 5.1. Load imbalance as shown by the concurrency visualizer.

5.3.2 Data Dependencies Data Dependencies occur when threads are dependent on one another to continue. An example of this is when a thread tries to acquire a lock held by another thread resulting in blocking of the former thread. This may lead to efficiency problems as certain tasks become bottlenecks of the systems and that additional time is spent

38 CHAPTER 5. ANALYSIS, PROFILING AND DEBUGGING blocking. In worst case this may lead to a serialization of the program as seen in figure 5.2. Data Dependencies can be identified using the Threads View of the concurrency visualizer. [7]

Core Utilization

Time

Figure 5.2. Data dependencies as shown by the concurrency visualizer.

5.3.3 Processor Oversubscription Processor oversubscription is the problem of having too many threads in relation to cores. This results in a high amount of context switches which incurs additional processing time as well as caching issues. Processor oversubscription may be ob- served through the Threads view of the concurrency visualizer as depicted in figure 5.3.[7]

Thread Utilization

Time

Figure 5.3. Processor oversubscription as shown by the concurrency visualizer.

5.3.4 Profiling Visual Studio 2012 offers the four methods Sampling, Instrumentation, Concurrency and .NET Memory for collecting performance data described in this section.

Sampling The main usage of the sampling profiling method is to get an overview of the performance of the application and get insight into performance issues related to the CPU (CPU-bound applications). This method works by sampling the appli- cation at set intervals and incrementing exclusive and inclusive counts for methods. Exclusive time is the time spent within a method excluding time spent in methods

39 CHAPTER 5. ANALYSIS, PROFILING AND DEBUGGING called from that method. Inclusive time on the other hand also includes time spent in methods called from that method. [33]

Instrumentation The instrumentation method collects timing information for method calls in the application. Instrumentation also identifies function calls into the OS such as writing to a file. The method is thus useful for investigating I/O bottlenecks and more closely examining of method calls. [33]

Concurrency The concurrency profiling method collects data regarding threads. The resource contention report contains call stack information of scenarios where competing threads waits for access to shared resources. This information includes total number of contentions, time spent waiting for resources as well as the actual lines and instructions where the contention occurred along with a time line graph. This profiling method also contains the concurrency visualizer described earlier. [33]

.NET Memory During profiling for .NET memory, the program is interrupted when objects are created and destroyed. Memory information such as type, size and number of created/destroyed objects is collected at these points. The Object Lifetime report displays the number and size of objects used in the application and their generations in the garbage collector. [33]

40 Part III

Implementation

41 Chapter 6

Pre-study

This chapter contains an analysis of the application to be parallelized including background information, design overview and its profile. This analysis is highly important for the next steps of the parallelization process.

6.1 The Local Traffic Prescription (LTF)

The Swedish Transport Agency (sv. Transportstyrelsen) has with the help of Swedish municipalities and the count administrative boards of Sweden and IT companies developed a standard for formulating traffic regulations. The national database for traffic regulations is called RDT from which the national road database (NVDB) is populated with traffic regulation data. [34] Local Traffic Regulations contains rules written in clear text explaining what is allowed for road users on certain road segments e.g. ”On A-Street between the cross- ing with B-street and 50 meters north of the crossing with the C-street, vehicles are allowed to...”. Alternatively, the text contains an attached map sketch containing the information: ”On A-street, according to the attached map sketch, vehicles are allowed to...”. [34] Along with the descriptions, the RDT also handles linking of prescriptions to the road network of the national road database (NVDB). The road network in NVDB consists of nodes and links where the nodes represent connected roads. [34] As an extension to the traffic regulations, a concept called BTR has been devel- oped for machine interpretation of the traffic prescriptions. The BTR is described in BNF (Backus Naur Form) in which a traffic prescription typically consists of a header, an ingress, several traffic regulations and data for which the rule is active. [34]

6.2 Application Design Overview

The LTF/Reduce interpreter is a server side application built for interpretation and packaging of local traffic prescriptions (LTFs). The two external databases N2-LV

42 CHAPTER 6. PRE-STUDY and N2-LTF are queried for LTFs which are then returned in a single table. The application iterates through the LTFs and reduces the them into a simpler format. The transformed LTFs are finally stored in a PostGIS database for further usage in web, mobile or other applications. The program flow of the LTF/Reduce interpreter can be summarized as follows:

1. Create attribute phrases to be used in reduction. 2. Retrieval of LTFs from Oracle DB. 3. Reduce LTFs. For each row.. a) Parse and convert phrases. b) Load features from Oracle DB depending on phrases. c) Store converted LTFs in lists. 4. Post process LTFs.. a) Validate permissions and prohibitions. b) Derive and stretch service days. 5. Persist LTFs into Oracle DB. 6. Divide LTFs into smaller chunks and move them to PostGIS DB.

6.3 Choice of Method

The choice of method for parallelizing the sequential application is to begin by profiling it for both sampling and instrumentation. By sampling the application it is possible to investigate where CPU intensive work is spent and by instrumentation observe the total wall times. In doing so, it is possible to conclude where bottlenecks are located and where attempts for parallelization are favorable. Once the bottlenecks have been found, separate implementations for the dif- ferent techniques of parallelization (Explicit threading, thread pool queuing, TPL and PLINQ) are implemented and compared using the Stopwatch class. Asyn- chronous programming was not tested as it is not regarded as a method of parallel programming. Two additional tests are conducted to gain further insight. To test for different amounts of cores the ProcessorAffinity method is used to alter the amount of available cores available to the process running the application. A test for different amounts of threads is also carried out using the explicit threading method. In doing so, one may investigate the optimal number of database connections. Usage of the Stopwatch class enables the user to investigate several parts of the application properly. It is also important to run the applications several times to limit diverging results. Certain methods are used to obtain proper results. The applications were all run in Release mode (instead of Debug mode) and external running of applications were kept to a minimum. Further, by running the applica- tions once before actual evaluation, the involved systems were warmed up and thus not run from a cold start.

43 CHAPTER 6. PRE-STUDY

1. Get Attributes

2. Get LTFs

3. Reduce LTFs Load Features

4. Post Process

5. Store LTFs

Figure 6.1. The internal steps of the application showing database accesses.

End users

DMZ

LTF WebAPI Drift Portal ?

Application

Traffic Office P-Register GeoServer LTF/Reduce

Database PostGIS TEKIS LV/LTF (Oracle Spatial)

Figure 6.2. The figure shows how the LTF/Reduce Interpreter is used by other applications.

44 CHAPTER 6. PRE-STUDY

6.4 Application Profile

This section consists of the profile for the entire application when run sequentially. It includes the hardware on which the application is run on as well as the performance overview.

6.4.1 Hardware

Processor Intel(R) Core(TM) i7-3820QM CPU Clock Rate 2.70 GHz L1 D-Cache 4x32 KB L1 I-Cache 4x32 KB L2 Cache 4x256 KB L3 Cache 6144 KB Cores / Threads 4 / 8 (Hyper-Threading) Instruction Set 64-bit Table 6.1. Processor specifications of the hardware used for evaluating the applica- tion.

Memory 1x8192 MB (DDR3-SDRAM) GPU Intel(R) HD Graphics 4000 Operating System Windows 7 Ultimate (64-bit, SP1). Table 6.2. Other hardware specifications of the hardware used for evaluating the application.

6.4.2 Performance Overview The total wall clock time for running the application was 3 hours, 18 minutes and 47 seconds. The number of database accesses amount to a total of 123 985. Below is the CPU usage of the application over time.

100 CPU % Usage

80

60

40

20

0 0 1270 2556 3834 5112 6390 7668 8946 10224 11502

Figure 6.3. The CPU usage of the application over time.

45 CHAPTER 6. PRE-STUDY

6.4.3 Method Calls One of the possible ways of illustrating the results of the profile is by listing the method calls. This subsection consists of instrumentation and sampling depicted as both tables over the hottest (time spent/CPU samplings) methods as well as the call trees.

Instrumentation Results Following are the instrumentation results from running the sequential code. From the results depicted below it is evident that most method calls eventually results in database accesses.

Method Name Excl. elapsed time% IDbCommand.ExecuteNonQuery 61,2 IDbCommand.ExecuteReader 37,0 IDisposable.Dispose 1,0 IDbTransaction.Commit 0,2 Invoke 0,2 Table 6.3. Methods with respective exclusive instrumentation percentages.

Main (100,0 %)

MapLastzon. Output.PostProcess Reduce Map.Reduce (21,9 %) Output.Persist (61,5 %) (10,7 %) (5,3 %) Extent. Stretch Geolocate Loader.Load Extent.Loader.Load (19,5 %) ExecuteNonQuery (61.2 %) (5,1 %) (5,6 %) (5,1 %) GetDistricts LoadWith Execute Execute Execute StreetName Reader Reader Reader (9,5 %) (7,1 %) (5,1 %) (5,5 %) (5,1 %)

ExecuteReader ExecuteReader (9,4 %) (6,9 %)

Figure 6.4. The call tree of the application showing inclusive percentages of time spent executing methods (instrumentation).

CPU Sampling Results When measuring for CPU sampling it can be observed that fewer samples are exclu- sive to database access methods. From this along with the instrumentation results we can conclude that not a lot of CPU processing is performed during the database calls.

46 CHAPTER 6. PRE-STUDY

Method Name Excl. Samples Excl. Samples % IDbCommand.ExecuteReader 5144 39,1 OleDbConnection.ExecuteNonQuery 1742 13,2 OleDbConnection.Open 1310 10,0 DbDataReader.Dispose 1034 7,9 Activator.CreateInstance 706 5,4 Table 6.4. Methods with respective CPU exclusive samplings.

Main (99.8 %)

Output.Persist Output.PostProcess MapLastzon. Map.Reduce (53,8 %) (18,9 %) (16,23 %) Reduce (6,1 %)

Time.Loader.Load ExecuteNonQuery Extent.Loader.Load (36,7 %) GetCreateSQL Stretch (14.7 %) (12 %) (13.2 %) (5.4 %)

LoadWithoutStreetName LoadWithStreetName DataAccess.Eval DataAccess. (18.8 %) (13 %) (9 %) Eval (5.1 %)

DataAccess.Eval DataAccess.Eval (16.7 %) (10 %)

Figure 6.5. The call tree of the application showing inclusive percentages of CPU samplings of methods.

47 Chapter 7

Parallel Database Concepts

Because of the fact that the LTF/Reduce Interpreter is very database dependent it is useful to investigate how databases handles parallelism. This section covers I/O parallelism, inter- and intraquery parallelism as well as query optimization.

7.1 I/O Parallelism

I/O parallelism is a concept which refers to the distribution of relations on multiple disks in attempt to improve retrieval times. Several partitioning strategies have been proposed to support I/O parallelism with the most useful being hash partitioning and range partitioning: [35]

• The hash partitioning strategy uses attributes of relations as parameters of a hash function which returns a value i between 0, ..., n − 1. The relation will then be placed on disk i out of n disks. • Range partitioning uses a partitioning vector A = [v0, v1, ..., vn]. A tuple indexed i will then be placed on disk j where vj ≤ i < vj+1.

After defining the partitioning strategy, the tuples may be retrieved in parallel. Accessing tuples can be made in several ways where the most predominant are scan- ning (over the entire relation), point querying (seeking tuples with specific values) and range queries (seek tuples within a certain range). [35] The hash partitioning strategy is particularly useful for point queries as the disk to be scanned can be chosen in constant time leaving other disks free to process other queries. The strategy is however not very well suited for range queries as hash functions may not preserve proximity. If the hash function is properly defined, scanning over the entire relation should result in about 1/n of the time taken for sequential scanning. [35] Range partitioning is useful for both scanning, point querying and range queries as long as the tuples are properly distributed (low skew). Partitioning related skew is however more likely with range partitioning than with hash partitioning as the hashing function typically is designed to distribute the tuples evenly. [35]

48 CHAPTER 7. PARALLEL DATABASE CONCEPTS

7.2 Interquery Parallelism

Interquery parallelism deals with different queries and transactions being executed in parallel. This in turn increases throughput while individual response times are no faster than during sequential interquery processing. In a shared disk model, locking and logging may be needed to support interquery parallelism. These sorts of problems are very much like the ones discussed in section 3.1.2 with similar, more or less complex solutions. The Oracle database is an example of a shared disk parallel database supporting interquery parallelism. [35]

7.3 Intraquery Parallelism

Intraquery parallelism refers to executing a single query in parallel. Contrary to interquery parallelism, intraquery parallelism improves individual response times which may be useful when there are long running queries. Single queries can be parallelized by either intra- or interoperation parallelism. [35] Intraoperation parallelism deals with executing an operation such as sort, select, project or join in parallel. This sort of parallelism is very natural for database systems as the number of tuples typically is very large as well as being set based. A concept commonly used for intraoperation parallelism is the aggregation pattern (see section 4.3). [35] Interoperation parallelism refers to executing the different operations in parallel. Operations are typically either completely- , partially or non dependent of one an- other and different strategies may be used for the different cases. When completely dependent there is typically not much gains to be had from parallel execution except for the fact that writing to disk may possibly be avoided. When partially dependent one might use a pipeline similar to what is discussed in section 4.5.1. Finally, if the operations are non dependent they may be run completely in parallel (possibly with the added cost of aggregation of results). [35]

7.4 Query Optimization

A query optimization translates a query into an as efficient as possible execution plan. Optimizing queries to be performed in parallel is more complex than the already very complex problem of optimizing sequential queries. Partitioning costs, handling skew and resource contention are all parameters to weigh in while choosing how to parallelize the operations, which ones to pipeline and so forth. The number of possible execution plans for parallel queries are much larger than the ones for sequential queries and because of this, certain heuristics has to be used. [35]

49 Chapter 8

Solution Design

To be able to parallelize the LTF/Reduce Interpreter, the decomposition technique is used to identify independent units of work subject to parallelization. The division of work is based on the profile along with the code analysis described in the fol- lowing sections. Section 8.2 describes an overview of the implementations (explicit threading, thread pool queuing, TPL and PLINQ) used for evaluation.

8.1 Problem Decomposition

From the application design overview along with the profile it is obvious that the application is very I/O intensive. The number of database accesses scales linearly with the amount of LTFs to be interpreted. Since the Oracle database can handle several accesses simultaneously it is possible to query the database concurrently. The question however is how much performance gains can be achieved by utilizing multiple cores as most of the processing occurs in the database. Simply running multiple threads for the database accesses might be sufficient to effectively lower the wall time. The Map.Reduce method has the highest percentage of inclusive samples and should therefore be considered the first bottleneck. The problem at hand has the character of data parallelism where the same method (Reduce) is executed for every row of the initial query. The method Reduce is mostly independent with the ex- ception for when the results are aggregated into the shared List objects for which proper thread-safety has to be implemented. As previously stated in section 3.5, there is no straightforward way of replacing the list with a concurrent, lock-free equivalent. The simplest approach is to use a lock around the Add-operations to the list. As the work carried out within the locked area is quite small this should not result in a too big of a performance loss. If the locked work was larger one could for example use a thread-local state for each thread containing reduced LTFs. This would in turn lead to n accesses to the lock where n is the number of threads.

50 CHAPTER 8. SOLUTION DESIGN

Listing 8.1. Bottleneck #1: Reducing LTFs as they become available.

1 using (var reader = Eval(sql)) 2 { 3 while (reader.Read()) 4 { 5 var phrases = reader.GetValue(...); 6 var reduced = Reduce(phrases); // Further DB calls 7 AddToList(reduced); 8 } 9 }

The second bottleneck is comprised of the post processing of LTFs in which two subproblems are particularly time consuming: Geolocating LTFs and Stretching service days. The geolocalization problem consists of four independent methods which in turn involves several database calls. Running the already defined Geolocate methods in parallel is a simple yet effective way of speeding up this subproblem.

Listing 8.2. Bottleneck #2.1: Geolocating service days.

1 void PostProcess() 2 { 3 Geolocate(ParkingForBusses); 4 Geolocate(ParkingForDisabled); 5 Geolocate(ParkingForMotorcycles); 6 Geolocate(ParkingForTrucks); 7 ... 8 Stretch(ServiceDays); 9 }

The second subproblem is the stretching of service days which consists of a loop over several service days where the body of the loop consists of further database calls along with an Add method call to a shared collection which needs to be locked around. The problem is similar to that of the first bottleneck with the main differ- ence that all of the data is present in memory initially for this subproblem.

Listing 8.3. Bottleneck #2.2: Stretching of LTFs.

1 private List Stretch(IEnumerable features) { 2 ... 3 foreach (Foreskrift feature in features) 4 { 5 var copy = feature.Clone(); 6 copy.Extent = feature.Extent.Stretch(); 7 stretched.Add(copy); 8 } 9 }

51 CHAPTER 8. SOLUTION DESIGN

Persisting data to a database constitutes the third bottleneck. A large amount of SQL queries of INSERT character are carried out as a transaction leading to slow performance. There are however several ways to approach the problem. The first solution is to simply perform the INSERT queries completely in parallel. De- pending on the database this is more or less of a viable option. If the entire table is locked during the insertion this will not lead to performance gains but to connection timeouts and slow performance. A better solution than the previously described one is to use the concept of bulk insertion in which local data tables are created and populated before being pushed to the database. This has the benefit that populating the data tables can be performed locally in parallel while calls to the database are kept to a minimum. Furthermore, bulk insertion has proven to be one of the most efficient ways of quickly populating a database.

Listing 8.4. Bottleneck #3: Storage/Persist of modified LTFs.

1 void Persist() { 2 DataAccess.Reduce.Persist("LTFR_LASTZON", ..., Lastzon); 3 DataAccess.Reduce.Persist("LTFR_P_TILLATEN", ..., Permissions); 4 ... 5 } 6 7 void Persist(..., IEnumerable feautures) { 8 using (var connection = CreateConnection()) 9 { 10 IDbTransaction tx = connection.BeginTransaction(); 11 using (var command = connection.CreateCommand()) 12 { 13 foreach (var feature in feautures) 14 { 15 command.CommandText = feauture.GetCreateSQL(...); 16 command.ExecuteNonQuery(); 17 } 18 } 19 tx.Commit(); 20 } 21 }

Unfortunately this third bottleneck could not be parallelized due to reasons out of the boundaries of this work. For unknown reasons, the database crashed when INSERT queries were run concurrently against the database. There may be several reasons for this such as a bad configuration of the database or problems with the connection. Instead of spending too much work on solving this issue, the other bottlenecks were focused for this work.

52 CHAPTER 8. SOLUTION DESIGN

8.2 Applying Patterns

The following subsections describes how the different parallelization techniques are applied to bottlenecks 1, 2.1 and 2.2.

8.2.1 Explicit Threading Approach This section describes how the bottlenecks may be solved using different methods of basic threading. The approaches in this, as well as the next subsection are implemented using constructs available in all .NET Framework versions and are examples of how parallel programming was and to some extent still is implemented prior to the newly added constructs of version 4.0.

Bottleneck #1 For parallelizing the first bottleneck using threads there is typically a need for a thread object which stores work in local queues. This method uses explicit locking around the queue for adding and removing work which ensures that the thread objects are thread safe. As shown in listing 8.5, work is distributed in a round-robin fashion to ensure threads are started as soon as possible with interleaved locking. The workLeft variable is decremented from the ThreadObject (see A.2) using the Interlock feature and when its value reaches zero, the ManualResetEvent waited for is set.

Listing 8.5. Bottleneck #1: Reduce using explicit threading.

1 ThreadObject[] threads = new ThreadObject[thread_count]; 2 3 workLeft = thread_count; 4 5 for (var i = 0; i < thread_count; i++) 6 { 7 threads[i] = new ThreadObject(this); 8 threads[i].Start(); 9 } 10 11 int j = 0; 12 foreach (WorkObject work in this.ReadData()) 13 threads[j++ % thread_count].AddWork(work); 14 15 // This work will not be processed 16 for (var i = 0; i < thread_count; i++) 17 thread[i].AddWork(new WorkObject() { isLast = true }); 18 19 finished.WaitOne();

53 CHAPTER 8. SOLUTION DESIGN

Bottleneck #2 As shown in listing 8.6, applying multi threading to bottleneck 2.1 is a simple matter as methods can be supplied as delegates to Thread objects which are started and later joined. This is an example of the fork/join pattern.

Listing 8.6. Bottleneck #2.1: Geolocate using explicit threading.

1 Thread[] threads = new Thread[4]; 2 3 thread[0] = new Thread(() => { Geolocate(ParkingForBusses); }); 4 thread[1] = new Thread(() => { Geolocate(ParkingForDisabled); }); 5 thread[2] = new Thread(() => { Geolocate(ParkingForMotorcycles); }); 6 thread[3] = new Thread(() => { Geolocate(ParkingForTrucks); }); 7 8 for (Thread t in threads) 9 t.Start(); 10 11 for (Thread t in threads) 12 t.Join();

Finally, bottleneck 2.2 can be parallelized in a similar fashion to that of bot- tleneck 1. The difference is that since work is available in the memory from the beginning it may be partitioned in a way so that threads operate on different parts of the list concurrently. As shown in listing 8.7, thread i may be responsible for the work indexed

(f/t) · i (8.1) to (f/t) · (i + 1) (8.2)

where f is the number of features and t is the number of threads.

Listing 8.7. Bottleneck #2.2: Stretch using explicit threading.

1 var stretched = new List(); 2 ThreadObjectStretch[] thread = new ThreadObjectStretch[threads]; 3 globalInt = threads; 4 for (var i = 0; i < threads; i++) 5 { 6 thread[i] = new ThreadObjectStretch(this, features, stretched, 7 (features.Count() / threads) * i, (features.Count() / threads)); 8 thread[i].Start(); 9 } 10 11 finished.WaitOne(); 12 return stretched;

54 CHAPTER 8. SOLUTION DESIGN

8.2.2 Thread Pool Queuing Approach As mentioned in the previous section, thread pool queuing is the other method which can be used in .NET versions prior to 4.0. Thread pool queuing however, is typically easier to implement than explicit threading.

Bottleneck #1 When work is to be queued onto the thread pool it needs to be decomposed into a callback function which is fed the data to be used. Note that handling of the workLeft variable needs to be interlocked to ensure thread safety.

Listing 8.8. Bottleneck #1: Reduce using thread pool queuing.

1 workLeft = 0; 2 3 foreach (WorkObject work in this.ReadData()) 4 { 5 Interlocked.Add(ref workLeft, 1); 6 ThreadPool.QueueUserWorkItem(new WaitCallback(ReduceWorkAndDecrement), work); 7 } 8 9 finished.WaitOne(); // Manual Reset Event

Bottleneck #2 Geolocating and stretching work can be parallelized similarly to the reduction phase as seen in listing 8.9 and 8.10.

Listing 8.9. Bottleneck #2.1: Geolocate using thread pool queuing.

1 workLeft = 4; 2 3 ThreadPool.QueueUserWorkItem(new WaitCallback(GeolocateWork), ParkingForBusses); 4 ThreadPool.QueueUserWorkItem(new WaitCallback(GeolocateWork), ParkingForDisabled); 5 ThreadPool.QueueUserWorkItem(new WaitCallback(GeolocateWork), ParkingForMotorcycles); 6 ThreadPool.QueueUserWorkItem(new WaitCallback(GeolocateWork), ParkingForTrucks); 7 8 finished.WaitOne();

55 CHAPTER 8. SOLUTION DESIGN

Listing 8.10. Bottleneck #2.2: Stretch using thread pool queuing.

1 workLeft = features.Count(); 2 3 foreach (Foreskrift feature in features) 4 ThreadPool.QueueUserWorkItem( 5 new WaitCallback(StretchFeatureAndDecrement), 6 feature); 7 8 finished.WaitOne(); 9 10 return globalList;

8.2.3 TPL Approaches The TPL approaches described in this section includes both the Parallel class as well as task parallelism where applicable. The methods described in this section as well as the next one (PLINQ) requires .NET Framework 4.0 or later.

Bottleneck #1 Below is the reduction problem parallelized using TPL (The Parallel class). This step could also have been parallelized using task parallelism (Task.Factory.StartNew). Note that waiting for work to complete is done implicitly.

Listing 8.11. Bottleneck #1: Reduce using TPL.

1 Parallel.ForEach(this.ReadData(), work => 2 { 3 ReduceWork(work); 4 });

Bottleneck #2 The Parallel.Invoke method suits bottleneck 2.1 well because of the fact that the amount of tasks is known beforehand.

Listing 8.12. Bottleneck #2.1: Geolocate using TPL.

1 Action[] actions = { 2 () => Geolocate(ParkingForBusses), 3 () => Geolocate(ParkingForDisabled), 4 () => Geolocate(ParkingForMotorcycles), 5 () => Geolocate(ParkingForTrucks) 6 }; 7 8 Parallel.Invoke(actions);

56 CHAPTER 8. SOLUTION DESIGN

For bottleneck 2.2 we introduce the ConcurrentBag as a lock free collection for stretched features. Note in listing 8.13, that it can be passed as a closure to the parallel loop.

Listing 8.13. Bottleneck #2.2: Stretch using TPL.

1 var stretched = new ConcurrentBag(); 2 3 Parallel.ForEach(features, options, (feature) => 4 { 5 var copy = feature.Clone(); 6 copy.Extent = feature.Extent.Stretch(); 7 stretched.Add(copy); 8 }); 9 10 return stretched.ToList();

8.2.4 PLINQ Approach Finally, this section describes the PLINQ approach which in many ways is the most simple one.

Bottleneck #1 Below is the PLINQ query for reducing the work. Note its simplicity.

Listing 8.14. Bottleneck #1: Reduce using PLINQ.

1 this.ReadData() 2 .AsParallel() 3 .ForAll(work => 4 { 5 ReduceWork(work); 6 });

Bottleneck #2 Geolocating and stretching work as PLINQ queries. Parallelizing these types of problems is very similar to how it is done using TPL. LINQ however offers rich, elegant functionality for more advanced queries when applicable.

57 CHAPTER 8. SOLUTION DESIGN

Listing 8.15. Bottleneck #2.1: Geolocate using PLINQ.

1 List[] lists = { ParkingForBusses, 2 ParkingForDisabled, 3 ParkingForMotorcycles, 4 ParkingForTrucks}; 5 6 lists.AsParallel() 7 .ForAll(list => 8 { 9 Geolocate(list); 10 });

Listing 8.16. Bottleneck #2.2: Stretch using PLINQ.

1 var stretched = new ConcurrentBag(); 2 3 features.AsParallel() 4 .ForAll(feature => 5 { 6 var copy = feature.Clone(); 7 copy.Extent = feature.Extent.Stretch(); 8 stretched.Add(copy); 9 }); 10 11 return stretched.ToList();

58 Part IV

Execution

59 Chapter 9

Results

This chapter contains the results from running the implementations described in the previous chapter followed by discussion and conclusions to be drawn from the work carried out. The chapter ends with a section about possible improvements.

9.1 Performance Survey

The wall times presented in this section are measured using the Stopwatch class. Each technique was run five times and the running times are presented below as bar- as well as box-and-whiskers graphs. The first two figures shows the total running times and the following shows the running times for the different methods MapLastzon, Map, Geolocate and Stretch. The final two figures shows total running times when the amount of cores used is altered and the running time as a function over the amount of threads used in the TPL parallelization technique (see 9.5). The application was run in Release mode to ensure proper evaluation. No other heavy processes were run on the client machine. Although unlikely, there is no guarantee that other clients were not accessing the database while running the tests. Appendix B shows the raw data that was used to create the diagrams.

60 CHAPTER 9. RESULTS

5 000

4 000

3 000

seconds 2 000

1 000

0 . . Sequential 10 Threads TP Queuing TPL PLINQ

Figure 9.1. Average total running times for different techniques.

Total Running Times (seconds) . . . . . 10 Threads

. . . . . TP Queuing .

. . . . . TPL

. . . . . PLINQ

1020 1140 1260 1380

Figure 9.2. Total running times for the techniques represented as a box-and-whiskers graph.

61 CHAPTER 9. RESULTS

MapLastzon (seconds) . . . 10 Threads

. . . . . TP Queuing .

. . . . . TPL

. . . . . PLINQ

60 75 90 105

Map (seconds) . . . . . 10 Threads

. . . . . TP Queuing .

. . . . . TPL

. . . . . PLINQ

540 620 700 780

Geolocate (seconds) . . . . . 10 Threads

. . . . . TP Queuing .

. . . . . TPL

. . . . . PLINQ

330 360 390 420

Stretch (seconds) . . . . . 10 Threads

. . . . TP Queuing .

. . . . . TPL

. . . . . PLINQ

75 90 105 120

Figure 9.3. Box-and-whiskers diagrams for the running times of the methods MapLastzon, Map, Geolocate and Stretch.

62 CHAPTER 9. RESULTS

1 200

1 000

800

600 seconds

400

200

0 . . 1 core 2 cores 3 cores

Figure 9.4. Average total running times for TPL using different amount of cores.

1 900

1 800

1 700

1 600

1 500 seconds

1 400

1 300 . 1 200 . 2 4 6 8 10 12 14 16 18 20 threads Figure 9.5. Total running times for different amount of threads using the explicit threading approach.

63 CHAPTER 9. RESULTS

Technique Difficulty Applicability Explicit Threading Hard Higher Thread Pool Queuing Medium Lower TPL Easy Higher PLINQ Easy Lower Table 9.2. Perceived difficulty- and applicability levels of implementing the tech- niques of parallelization.

Sequential TPL IDbCommand.ExecuteReader (95,7 %) Threading.Monitor.Enter (50,9 %) IDisposable.Dispose (2,8 %) IDbCommand.ExecuteReader (41,7 %) ILog.ErrorFormat (0,5 %) Tasks.Parallel.Invoke (3,8 %) Table 9.1. Methods with the highest elapsed exclusive time spent executing for the sequential- and TPL based approach.

9.2 Discussion

This section includes a discussion of the new parallel extensions as well as the results obtained for the implementation of the LTF/Reduce Interpreter.

9.2.1 Implementation Analysis Depending on the method used, the results shows that parallelizing the LTF/Reduce Interpreter results in wall times lowered by a factor of 3.5–4.1 reducing the total wall time of the LTF/Reduce Interpreter with about one hour. This can be considered a relatively good result. There are however two things that are unfortunate about the work:

• The application was highly I/O dependent, it would have been interesting to investigate an application which was more CPU-bound. • Bottleneck #3 which constituted a high number of INSERT queries could not be parallelized due to database crashes for unknown reasons.

The parallel extensions of .NET 4.0 proved to be very easy to work with while being both comprehensive and efficient. The TPL methods proved to be useful for most patterns of parallel programming while PLINQ provides additional support for query like problems. The automatic thread pool usage assured that a proper amount of threads were used at all times while issues such as aggregation, cancellation and exceptions were straightforward to work with as opposed to the thread pool queuing technique. Explicit threading turned out to be most difficult as the actual threads had to be assigned work while being coordinated properly. The amount of cores used during execution turned out to not matter very much as observed in figure 9.1. This was however expected after observing the initial

64 CHAPTER 9. RESULTS profiling (see 6.4.2). After testing the application for different amounts of threads it was apparent that the efficiency gains started to stagnate after about 9 threads. This should correlate to the optimal amount of connections to the database. The reason why fewer threads results in worse performance is due to the lower num- ber of queries performed concurrently. A higher thread count however does not increase performance a lot since the database cannot handle such high amounts of connections. PLINQ shows a slightly higher running time than the other two methods uti- lizing the thread pool. The reason behind this is most likely the design of PLINQ which typically requires some sort of thread communication. Increasing the number of threads used over time is therefore more limited than for TPL and TP Queuing which may explain the results. This is also the reason as to why there is no way of specifying a maximum degree of parallelism using PLINQ. Table 9.1 shows the top 3 methods with the highest elapsed exclusive time spent executing (instrumentation). The TPL column is particularly interesting in that time spent in Monitor.Enter amounts to 50, 9% of the total. This in turn means that threads are blocking half of the time when the application is parallelized and that the thread pool has decided that this is the optimal number. This also makes sense given that for every thread accessing the database there should be one waiting for the thread to complete so that if every thread completed, the connections would immediately be occupied by other threads. Even though thread pool queuing shows similar performance to the TPL method, the latter should be preferred due to its richer functionality. When choosing between PLINQ and TPL one should typically look at the problem at hand to figure out which method suits it the best. If the maximum amount of parallelism is not known beforehand though, the TPL method may prove to perform better due to its dynamic thread balancing. It should also be noted that the implementation of explicit threading using 10 threads turned out pretty good. This is not very surprising as the other methods boils down to explicit threading when actually executed. The benefits to imple- menting such an approach are that the overhead becomes minimal and that earlier versions of the .NET Framework may be used. The drawbacks however typically outweighs the benefits: Decomposing work, handling errors, dynamically inserting threads etc. becomes much more difficult and is typically more time consuming to implement.

9.2.2 Parallel Extensions Design Analysis The first issue to be discussed is the division of the parallel extensions into TPL (which includes the Parallel class and task parallelism) and PLINQ. The design of the Parallel class of TPL can be discussed. Why did Microsoft provide explicit support for parallel loops but not for other patterns? A possible answer to that question could be that the parallel loop is a clearly defined problem which is also common and quite difficult to otherwise implement. But this also raises the design

65 CHAPTER 9. RESULTS question: Which patterns or algorithms should be provided in TPL? The design of PLINQ is also quite interesting. First of all one might question why PLINQ is not the default instead of LINQ and the answer to this question is that some properties such as element ordering are lost in the process while additionally, PLINQ introduces some overhead. Thus, there is a legit reason for separating LINQ from PLINQ. There are many fall pits in PLINQ which to the naive programmer might seem confusing. Some of the querying operators are more efficient than others, and some are even run sequentially when called (see section 3.4). It can also be difficult to decide when to use PLINQ and when not to. The naive programmer might simply try parallelizing every LINQ query for potential efficiency benefits. This is typically a bad approach as problems for which LINQ is the best solution already tend to execute efficiently. Instead, one should identify CPU intensive bottlenecks and try to express them as PLINQ queries. [3] The concurrent collections BlockingCollection, ConcurrentDictionary, ConcurrentQueue, ConcurrentStack and BlockingBag provides the pro- grammer with easy-to-use thread-safe data structures. The pros of using these con- structs are that they are well tested, efficient and simple. However, there are certain design issues which may be questioned. First of all, there is no straightforward way of determining the additional overhead posed by the constructs and while often the case suitable, the naive programmer might shoehorn usage of the collections instead of rethinking the overall application design. This also poses the question: Which data structures should have concurrent counterparts and be provided in libraries? As seen in section 3.5, there is currently no implementation of a concurrent list because such an implementation would typ- ically not yield large efficiency benefits. In fact, some data structures could poten- tially give even worse performance than their sequential counterparts due to heavy synchronization etc. These are all important questions that need to be discussed sooner or later. Along with the concurrent collections, slim constructs were introduced in .NET 4.0 to provide fast, intra-process usage of certain locks and event handlers. While useful and efficient, these classes also introduces some design issues. The addition of Slim to the class name might be contradictory to its actual functionality. It is true that the slim constructs are limited in that no inter-process communication is possible but instead they can handle cancellation tokens so in some ways their functionality actually proves richer than their respective counterparts. The profiling tools of Visual Studio 2012 worked well for evaluating the applica- tion though not optimal. The interface is quite disorderly and difficult to navigate. An example is the threads view of the concurrency visualizer which becomes dif- ficult to make out information from when the number of different threads used is high. Another slight annoyance is that it is not possible to perform more than one performance analysis per run (e.g. instrumentation and sampling). Visual Studio 2012 does its job but some UI improvements could probably make it significantly better.

66 CHAPTER 9. RESULTS

9.3 Conclusion

Modern processor construction has taken a new turn in that adding on more cores seems to be the new standard. As we transition from single core into multi core and from multi core into many core architectures, it is not surprising that better support for writing software supporting the hardware is highly necessary. The parallel extensions of .NET 4.0 offers such functionality and while possibly having some design issues, TPL and PLINQ has shown to be very efficient and simple ways of parallelizing applications in contrast to explicit threading techniques. For query like problems, PLINQ should be used and for others, TPL should. When the problem is of the embarrassingly parallel kind, Parallel.For and Parallel.ForEach should be used and when the number of tasks is known before- hand, Parallel.Invoke should be used. For patterns such as parent-child relations, dynamic task parallelism and producer/consumer problems, explicit task creation is recommended because of its flexibility. Additionally, the concurrent data structures and slimmed constructs introduced in .NET 4.0 are useful but should still be used with care and consideration (solutions based on immutability and low synchroniza- tion are typically more efficient). Accessing databases in parallel has shown to give certain performance benefits. As long as there are resources sitting idle on the database side there is a possibility for parallelization gains. Depending on the database and the underlying hardware, it may be difficult to decide on the level of concurrency when accessing the database. Because of this, one should typically try using different levels of parallelism while weighing in execution of other applications which may run on the database side. Such an experiment was carried out in this work and for the Oracle database which was used, the appropriate amount of threads (corresponds to connections) turned out to be around 9–10. The total running time of the parts parallelized was lowered by a factor of about 3.5–4.1 depending on the technique used. Un- fortunately, the amount of cores used for executing the application did not have a significant impact on the running time of the application due to the low amount of CPU intensive work carried out by the application.

9.4 Future Work

There are multiple ways in which the work of this thesis can be extended. First of all, one might investigate the problem of the database crashing when parallelizing the INSERT queries. An interesting approach would be to implement a bulk insert from a DataSet constructed locally using parallel extensions and then executed using just one query. A second possible extension would be to try out the asynchronous programming model for introducing concurrency to the application. In doing so, the application could potentially be run without parallelism but still give good results. A third possible extension would be to perform parallelization on more CPU

67 CHAPTER 9. RESULTS intensive work as such an experiment would better showcase threads actually per- forming work instead of just waiting for database calls to finish. Additionally, such problems are more likely to be of less embarrassingly parallel kind and thus be more interesting from an algorithmic standpoint. Finally it should be mentioned that technologies such as the .NET Framework changes over time. The techniques described in this thesis may have completely changed a couple of years from now and because of this, evaluations of newer .NET Frameworks versions will have to be carried out.

68 Part V

Appendices

69 Appendix A

Code Snippets

A.1 The ReadData Method

Listing A.1. The ReadData method used in bottleneck #1. 1 IEnumerable ReadData() 2 { 3 string sql = 4 string.format( 5 @"Select fs.ltf_beteckning, fs.ikrafttradandedatum, ... 6 From {0}.lvd_object_version ov 7 Inner Join {1}.ltf_foreskrift fs 8 On fs.foreteelse_id = ov.object_id 9 Left Outer Join {1}.lvd_attribute_value fras118AF ...", 10 ...); 11 12 using (IDbConnection connection = CreateConnection()) 13 { 14 using (var reader = Eval(sql, connection)) 15 { 16 while (reader.Read()) 17 { 18 DateTime? validFrom = DateTime.ParseExact( 19 reader.GetValue(...)); 20 int featureObjectID = Convert.ToInt32( 21 reader.GetValue(...)); 22 ... 23 yield return new WorkObject(validFrom, 24 featureObjectId, phrase188AF, ...); 25 } 26 } 27 } 28 }

70 APPENDIX A. CODE SNIPPETS

A.2 The ThreadObject Class

Listing A.2. The ThreadObject class used for the explicit threading technique. 1 class ThreadObject 2 { 3 private Queue workQueue = new Queue(); 4 private ManualResetEvent hasWork = new ManualResetEvent(false); 5 private User user; 6 7 public ThreadObject(User user) { 8 this.user = user; 9 new Thread(Run).Start(); 10 } 11 12 public void Run() { 13 bool threadAlive = true; 14 while (threadAlive == true){ 15 hasWork.WaitOne(); 16 WorkObject work = GetWork(); 17 if (work.isLast == true){ 18 if (Interlocked.Decrement(ref user.workLeft) == 0) { 19 user.finished.Set(); 20 } 21 threadAlive = false; 22 } 23 else { 24 user.ProcessWork(work); 25 } 26 } 27 } 28 29 private WorkObject GetWork() { 30 lock (workQueue) { 31 if (workQueue.Count.Count == 1) 32 hasWork.Reset(); 33 return workQueue.Dequeue(); 34 } 35 } 36 37 public void AddWork(WorkObject work) { 38 lock (workQueue) { 39 hasWork.Set(); 40 workQueue.Enqueue(work); 41 } 42 } 43 }

71 Appendix B

Raw Data

B.1 Different Technique Meaurements

Method MapLastzon Map Geolocate Stretch Total Sequential (1) 09:58.29 43:30.07 09:58.29 10:15.62 01:13:42 Sequential (2) 04:11.19 42:50.56 09:11.58 11:10.72 01:07:24 Sequential (3) 09:04.14 53:31.63 09:45.03 11:22.70 01:23:44 Sequential (4) 05:03.91 46:44.10 10:19.35 13:22.49 01:15:30 Sequential (5) 07:10.55 54:56.03 09:41.93 14:45.74 01:26:34 10 Threads (1) 01:08.22 11:36.62 05:52.34 01:23.85 00:20:01 10 Threads (2) 01:07.92 12:15.16 06:23.46 01:34.74 00:21:21 10 Threads (3) 01:13.87 12:26.47 06:53.59 01:59.65 00:22:33 10 Threads (4) 01:10.18 12:08.52 06:00:99 01:41.20 00:21:01 10 Threads (5) 01:09.80 12:31.27 06:45.77 01:50.61 00:22:17 TP queuing (1) 01:07.58 09:13.04 06:04.20 01:20.38 00:17:45 TP queuing (2) 01:05.90 09:21.08 05:55.97 01:18.25 00:17:41 TP queuing (3) 01:14.83 10:03.67 05:38.09 01:18.14 00:18:15 TP queuing (4) 01:09.79 09:37.58 06:22.84 01:22.65 00:18:33 TP queuing (5) 01:11.66 10:30.17 06:31.70 01:40.92 00:18:54 TPL (1) 01:13.09 10:01.60 05:50.30 01:13.95 00:18:19 TPL (2) 01:06.97 10:35.25 05:56.93 01:19.34 00:18:57 TPL (3) 01:19.51 08:18.30 05:58.66 01:30.18 00:17:07 TPL (4) 01:39.81 08:53.60 06:04.49 01:29.04 00:18:07 TPL (5) 01:14.94 09:57.72 06:37.03 01:13.33 00:19:03 PLINQ (1) 01:20.09 12:03.31 06:16.14 01:38.28 00:21:18 PLINQ (2) 01:05.27 11:33.07 06:02.76 01:35.59 00:20:17 PLINQ (3) 01:09.60 11:46.46 06:58.00 01:56.42 00:21:50 PLINQ (4) 01:40.65 12:08.31 06:56.41 01:41.07 00:22:26 PLINQ (5) 01:18.08 12:33.15 06:53.83 01:57.68 00:22:43 Table B.1. Running times measured using the different techniques of parallelization.

72 APPENDIX B. RAW DATA

B.2 Varying the Thread Count using Explicit Threading

Threads MapLastzon Map Geolocate Stretch Total 3 threads 01:43.34 18:53.67 05:58.17 04:32.33 00:31:08 4 threads 01:40.12 15:57.46 05:09.39 02:57.53 00:25:45 5 threads 01:24.21 14:38.08 05:06.90 02:35.55 00:23:45 6 threads 01:32.10 13:44.93 05:58.32 02:30.09 00:23:45 7 threads 01:18.96 13:29.32 05:04.01 02:14.00 00:22:06 8 threads 01:20.87 12:47.08 06:04.83 02:01.48 00:22:14 9 threads 01:31.24 13:04.60 04:29.09 01:47.00 00:20:52 10 threads 01:19.40 13:35.64 05:52.43 01:39.77 00:22:27 11 threads 01:16.58 12:55.68 06:02.42 01:42.16 00:21:57 12 threads 01:19.02 13:05.22 06:06.06 01:37.61 00:22:08 13 threads 01:22.49 12:36.58 04:42.55 01:30.29 00:20:12 14 threads 01:21.52 12:43.15 04:51.83 01:36.60 00:20:33 16 threads 01:16.17 12:45.63 06:29.05 01:28.23 00:21:59 18 threads 01:36.72 12:33.37 06:45.20 01:26.08 00:22:21 20 threads 01:14.97 12:21.05 06:36.20 01:32.46 00:21:45 Table B.2. Running time measurements using explicit threading with different amounts of threads.

B.3 Varying the Core Count using TPL

Cores MapLastzon Map Geolocate Stretch Total 1 core (1) 01:14.73 11:21.35 07:05.37 01:29.42 00:21:11 1 core (2) 01:20.25 11:26.30 05:08.84 01:33.71 00:19:29 1 core (3) 01:19.14 11:35.34 05:57.63 01:20.76 00:20:13 2 cores (1) 01:16.62 11:09.48 07:27.26 01:22.48 00:21:16 2 cores (2) 01:17.20 11:44.20 06:38.29 01:21.06 00:21:01 2 cores (3) 01:19.75 11:36.37 06:13.67 01:29.18 00:20.39 3 cores (1) 01:22.91 11:17.72 06:35.88 01:26.77 00:20:43 3 cores (2) 01:21.26 11:12.32 06:26.34 01:22.75 00:20:23 3 cores (3) 01:18.84 11:00.79 06:19.64 01:19.91 00:19:59 Table B.3. Running time measurements using TPL with different amounts of cores.

73 Bibliography

1 Comparison of CPU Architectures [Wikipedia Page]. Wikipedia; [cited 2013 May 13]. Available from: http://en.wikipedia.org/wiki/Comparison_of_CPU_ architectures/. 2 AMBA Level 2 Cache Controller Technical Reference Manual. ARM; 2010.

3 Threading in C#[homepage on the Internet]. Joseph Albahari; 2011 [updated 2011 Apr; cited 2013 Mar 11]. Available from: http://www.albahari.com/ threading/. 4 Omara E, editor. Performance Characteristics of New Synchronization Primi- tives in the .NET Framework 4; 2010. 5 Parallel Programming in the .NET Framework [Microsoft Developer Network]. Microsoft MSDN; [cited 2013 Jun 1]. Available from: http://msdn.microsoft. com/en-us/library/dd460693.aspx/. 6 Barney B, Livermore L, editors. Introduction to [homepage on the Internet]; [cited 2013 May 25]. Available from: https://computing. llnl.gov/tutorials/parallel_comp/. 7 Parallel Programming with Microsoft .NET [Microsoft Developer Network]. Mi- crosoft MSDN; [cited 2013 Jun 1]. Available from: http://msdn.microsoft. com/en-us/library/ff963553.aspx/. 8 Kanellos M, editor. Moore’s Law to Roll On For Another Decade; 2003. 9 Kanellos M, editor. New Life for Moore’s Law; 2005. 10 Markoff J, editor. Apple in Parallel: Turning the PC World Upside Down? New York Times; 2008. 11 Box D, Sells C, editors. .NET Common Language Runtime Components; 2003. 12 Toub S, Molloy R, McCrady D, editors. Concurrency and Parallelism: Native (C/C++) and Managed (.NET) Perspectives [Interview]; [cited 2013 Jun 1]. 13 Weik MH, editor. A Third Survey of Domestic Electronic Digital Computing Systems; 1961.

74 BIBLIOGRAPHY

14 Brorsson M, editor. Datorsystem: Program- och Maskinvara; 1999.

15 Pessach Y, editor. Juice Up Your App with the Power of Hyper-Threading; 2005.

16 Gove D, editor. Multicore Application Programming: for Windows, Linux and Oracle Solaris. 1st ed. Addison-Wesley; 2011.

17 Duncan R, editor. A Survey of Parallel Computer Architectures. Survey and Tutorial Series; 1990.

18 Hennessy JL, Patterson DA, editors. Computer Architecture: A Quantitative Approach; 2007.

19 Duffy J, editor. Concurrent Programming on Windows. 1st ed. Pearson Educa- tion; 2009.

20 Hoag JE, editor. A Tour of Various TPL Options; 2009.

21 Kimmel P, editor. Parallel LINQ; 2009.

22 Tan RP, editor. PLINQ’s Ordering Model; 2010.

23 Song C, Omara E, Liddell M, editors. Thread-safe Collections in .NET Frame- work 4 and Their Performance Characteristics; 2009.

24 Stroustrup B, Dechev D, Pirkelbauer P, editors. Lock-free Dynamically Resizable Arrays; 2006.

25 Vagata P, editor. When Should I Use Parallel.ForEach? When Should I Use PLINQ?; 2010.

26 Ling Wo CM, editor. Parent-Child Task Relationships In The .NET Framework; 2009.

27 Dean J, Ghemawat S, editors. MapReduce: Simplifed Data Processing on Large Clusters; 2004.

28 Asynchronous Programming with Async and Await (C# and Visual Basic) [Mi- crosoft Developer Network]. Microsoft MSDN; [cited 2013 Jun 4]. Available from: http://msdn.microsoft.com/en-us/library/vstudio/hh191443.aspx.

29 Toub S, editor. Patterns of Parallel Programming; 2010.

30 [Wikipedia Page]. Wikipedia; [cited 2013 May 27]. Avail- able from: http://en.wikipedia.org/wiki/Microsoft_Visual_Studio/.

31 Debug Multithreaded Applications in Visual Studio [Microsoft Developer Net- work]. Microsoft MSDN; [cited 2013 Jun 1]. Available from: http://msdn. microsoft.com/en-us/library/ms164746.aspx/.

75 BIBLIOGRAPHY

32 George B, Nagpal P, editors. Optimizing Parallel Applications Using Concur- rency Visualizer: A Case Study; 2010.

33 Analyzing Application Performance by Using Profiling Tools [Microsoft Devel- oper Network]. Microsoft MSDN; [cited 2013 Jun 1]. Available from: http: //msdn.microsoft.com/en-us/library/z9z62c29.aspx/.

34 RDT Handboken: BTR Teknisk Beskrivning. Transportstyrelsen; 2010.

35 Silberschatz A, Korth H, Sudarshan S, editors. Database System Concepts. 5th ed.; 2006.

76

TRITA-ICT-EX-2013:160

www.kth.se