<<

Research Collection

Master Thesis

Optimizing MathZonnon using OpenCL

Author(s): Ernst, Benjamin

Publication Date: 2011

Permanent Link: https://doi.org/10.3929/ethz-a-006607456

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library

Optimizing MathZonnon using OpenCL

Benjamin Ernst Master Thesis July 2011

Supervising Professor: Prof. Dr. Jürg Gutknecht Supervising Assistant: Roman Mitin

Native Systems Group Institute of Computer Systems Swiss Federal Institute of Technology Zurich, Switzerland

1

Abstract The general purpose language features Matlab‐like operators that can work with built‐in mathematical arrays. These operators can be computationally very expensive, and executing them at runtime on CPU takes rather long.

The majority of today’s consumer systems contain a graphics processing unit. With the ongoing shift from GPUs to general‐purpose processing devices comes the ability to run computations on them that are not related to graphics. We use OpenCL to speed up the computations in Zonnon using GPUs.

This thesis introduces the Compute Framework into Zonnon. It is integrated into the Zonnon and runtime library to coordinate computations across multiple OpenCL devices. We give an overview over the Compute Framework, explain the implementation strategy and details, discuss benchmarks and conclude with an analysis of the results.

2

3

Acknowledgements I would like to thank a number of people who have enabled, helped with, optimized, followed, had to bear and/or supported this thesis.

My first thanks go to Prof. Dr. Jürg Gutknecht for letting me work with the Zonnon language and providing me with the hardware I needed to test my developments.

I would like to thank my supervisor Roman Mitin very much for introducing me into the Zonnon compiler. I especially appreciated to have a supervisor who is expert on what I was working on and has a low latency to answer all my questions.

Then my thanks go to Georg Ofenbeck for his support for the thesis and for sharing his knowledge about GPUs.

Many thanks go to Nina Gonova who introduced the mathematical data types into Zonnon in her master thesis so that I could build my optimizations on top of it.

I would like to thank Alexey Morozov for his idea to use the alternating least squares algorithm as a benchmark and providing me with a very well understandable Matlab implementation.

My further thanks go to Lukas Schwab for ensuring a pleasant working environment in the Student Lab and for bearing the noise and heat produced by the GPUs.

I would like to thank my parents for financing my studies at this great university.

Last but not least I would like to thank Ramona for sharing her life with me and taking a share of my own life into hers.

4

Contents 1 Introduction ...... 8 1.1 Task Description ...... 9 2 Background ...... 12 2.1 Zonnon Compiler ...... 12 2.1.1 Common Compiler Infrastructure ...... 12 2.1.2 Math ...... 13 2.2 OpenCL ...... 13 2.2.1 Platforms and Devices ...... 14 2.2.2 Programs and Kernels ...... 14 2.2.3 Buffers ...... 14 2.2.4 Work Items and Concurrency ...... 14 2.3 StarPU ...... 15 2.3.1 Codelets and Tasks ...... 15 2.3.2 Dependencies ...... 15 2.3.3 Time Estimation and Scheduling ...... 15 2.3.4 Data Management and Consistency ...... 16 3 Concept ...... 18 3.1 Using Zonnon for Mathematical Computations ...... 18 3.2 Compute Framework Architecture ...... 18 3.3 Assignments ...... 18 3.4 Introductory Example ...... 19 4 Runtime Architecture ...... 20 4.1 Accelerators ...... 20 4.2 Data ...... 20 4.3 Tasks ...... 21 4.4 Dependencies ...... 21 4.5 Scheduling and Data Management ...... 21 4.6 Running Tasks ...... 22 4.7 Task Completion ...... 22 4.8 Concurrency Model ...... 22 5 Compilation Process ...... 24 5.1 Arrays ...... 24 5.2 Method Calls ...... 24 5.3 Assignments ...... 25 5.3.1 Grouping Operators ...... 25 5.3.2 Assignment Target ...... 27

5

5.3.3 Kernel Generation ...... 27 5.3.4 Kernel reuse ...... 27 5.3.5 Assignment reuse...... 28 5.4 Limitations ...... 28 5.5 Kernel Generation Example ...... 29 5.6 CCI Generation Example ...... 31 6 Experimental Results ...... 36 6.1 Matrix Multiplication ...... 36 6.2 Alternating Least Squares (ALS) ...... 37 6.3 Discussion ...... 38 7 Conclusion ...... 40 7.1 Future Work ...... 40 7.2 Conclusive Statement ...... 41 8 Bibliography ...... 42 A Big Operator Kernel Templates ...... 44 A.1 Matrix‐Matrix Multiplication ...... 44 A.2 Matrix‐Vector Multiplication ...... 45 A.3 Vector‐Matrix Multiplication ...... 45 A.4 Element‐Wise Copy ...... 46 B Test Source Code ...... 47 B.1 Matrix Multiplication ...... 47 B.2 Alternating Least Squares ...... 48 C Zonnon EBNF ...... 56

6

7

have no sense of humor” ‐ Alan Freed Oberon Day 2011

1 Introduction High‐performance computing, a field in computer science, combines, among others, the realms of hardware and computer architecture, programming models and languages as well as algorithms. While earlier developments focused on supercomputing cluster and centers, the field is becoming more and more popular for the general market as well. A change in hardware is since then leading to an improved availability and accessibility of computationally fast devices. Although traditionally graphics processing units had been designed for drawing to the screen only, modifications and new hardware designs enabled running specialized computations on them as well. Recently, device manufacturers even built specialized stream processing hardware that is not able to output graphics anymore, but is solely targeted at intensive computing.

Nowadays, these general purpose graphics processing units (GPGPU) are present in nearly every consumer system. With this vast computing potential available so broadly, the question arises as to how we can exploit it. The programming languages and model typically used to write programs for clusters and GPGPUs are heavily specialized. Understanding and working with these languages needs time and experience, and not every programmer is willing or able to spend resources on learning.

Instead of using these languages directly we focus on the integration of GPGPU‐enabled computations into a general purpose language. This has been done numerous times in the form of libraries (e.g. [1]) that the programmer can use from within her favorite language. While this is a beneficial step, it still requires the programmer to learn how to use the library correctly. We think it would be better if instead the compiler took the heavy lifting of compiling the code such that the computations will run in an optimal way, dynamically using the hardware that is available at runtime.

Zonnon [2] is such a general purpose language that is developed in the Native Systems Group [3] at ETH Zurich. It features built‐in mathematical operators that can operate on large arrays. These computations take a long time when run on the CPU. So we decided to use the computational power of GPGPUs in order to make these operations run much faster. The programmer specifies what should be computed and the compiler or the runtime decide how it should be executed.

This thesis introduces the Compute Framework into the Zonnon compiler to support this decision. It analyzes the computations written in source code using the mathematical operators. At compile time the Compute Framework cannot know what hardware will be available when the compiled program is executed. Therefore it transforms the computations in a way such that it can decide at runtime how to use the available hardware to efficiently perform the computations.

8

1.1 Task Description

Computer Systeems Institute Native Systems Group

Optimizing MathZonnon with OpenCL

Benjamin Ernst Master Project (start:: 1.02.2010, end: 31.07.2010)

1. Project Description

Matlab-like mathematical extensions in Zonnon programming language (MathZonnoon) currently are translated into the MSIL code. Compute shaders are incredibly powerful when it comes to numerical calculations and utilizing them through managed code would yield huge performance gains. This would put managed code into the position of yieldinng the same performance as unmanaged code while still providing interoperability to all the libraries of the .net world. The goal of this project is to map MathZonnon onto OpenCL.

2. Specific Goals

 Revise the semantics of mathematical extensions to provide clean, transparent constructs for high performance mathematical computations on GPGPUUs. The code should remain on the MATLAB-like high level of abstraction, but be transparent enough for programmer to see performance implications. Consolidate syntax of MathZonnon and MathOberon.  Implement mappings to OpenCL in the ETH Zonnon compiler. Implementtation should be sufficient to compile and run selected benchmarks. Compiler shouldd be able to target both MSIL and OpenCL.  Study and describe the optimizatioons that can be performed by the compiler on the mathematical code. This includes typical compiler optimizations such as constant propagation, common subexpression elimination, but applied on a higgher level of matrixes and vectors and specific mathematical such as a selection of an appropriate algorithm. Select one or two optimizations which are supposed to be beneficial and implement them as a concept proof.  Port one or several of the existing applications in MathOberon and compare the performance with native MathOberon by Felix Friedrich and GPGPUU library for MathOberon by Alexey Morozov.

9

3. Deliverables

 An extensive example illustrating mathematical extensions.  A version of the ETH Zonnon compiler capable of compiling this example into a properly functional application that uses GPGPU.  A digital and two printed exemplars of the master thesis containing description of the problem, motivation, overview of related work and existed approaches, description of the programming model, description of the implementation, evaluation results.

4. Organisation

 We’ll have weekly meetings. All documents and source code should be committed to the repository on regular basis. A high level project plan should be defined within the first two weeks of the project.  A Workspace in the RZ Building will be provided for the duration of the Master Thesis

Professor: Prof. J. Gutknecht Assistants: Roman Mitin and Georg Ofenbeck Zürich, 8. Dec 2010

10

11

2 Background

2.1 Zonnon Compiler Zonnon [2] is an imperative language that is developed in the Native SSystems Group at ETH Zurich. It is a successor to the languages Oberon, Modula‐2 and Pascal. While the influential languages traditionally compile to directly,, Zonnon takes a different approach by targeting the Common Intermediate Language (CIL) [4], sometimes also referred to as .NET code.

Zonnon integrates nicely into Microsoft’s developer tools family. Zonnon projects in Visual Studio feature syntax highlighting and intellisense that ease programming for the developer significantly. It is not uncommon to have both a Zonnon and a C# project in a shared solution (workspace) where then one of the projects references and uses the other. Figure 1 shows an example setup.

2.1.1 Common Compiler Infrastructure Zonnon does not compile to CIIL code directly, but uses the Common Compiler Infrastructure (CCI) [5] as its backend compiler. After parsing, the Zonnon compiler converts the abstract syntax tree (AST) into the corresponding CCI ASST and then hands this AST off to the CCI framework. CCI performs the lower‐level compiler tasks such as register allocation and CIL code generation.

Figure 1: A Visual Studio solution that contains both a Zonnon and a C# project.

12

All the semantic definitions of Zonnon are accounted for in this conversion step. The CCI itself knows nothing about Zonnon.

The shape of the AST that CCI consumes looks very much like a C# abstract syntax tree. It features Namespaces, classes, fields, methods, parameters, local variables, conditionals, loops, assignments, expressions and even generic type arguments.

2.1.2 Math As a predecessor to this work, mathematical data types [6] have been implemented in Zonnon to support MATLAB‐like syntax of operations. They are enabled by adding the modifier {math} to declarations of variables of array type: m : array {math} *, * of real

The operations supported by arrays using this modifier include matrix multiplication, element wise operations like addition or comparison, reductions like sum or general comparison, as well as solving a square system of linear equations.

2.2 OpenCL OpenCL [7] is a programming interface used for executing intense computations. It was originally created by Apple and developed further in collaboration with AMD, IBM, Intel and NVidia. The Khronos Group [8] standardized it in 2008. It is the first computation interface that is truly platform‐independent and was designed to be used in a heterogeneous computation environment.

OpenCL unifies features of e.g. NVidia’s Compute Unified Device Architecture (CUDA) [9] and AMD’s Accelerated Parallel Processing platform (APP) [10]. These two were among the technologies that formed the term GPGPU – General Purpose computing on Graphics Processing Units. Before the evolution of GPGPU interfaces like DirectX and OpenGL had to be used to gain access the computational power of graphics cards. Since those interfaces were not designed for general purpose computations but rather to render graphics for output on a display device, there was an impedance mismatch when programming them for computations. This gave boost to technologies for GPGPU that abstract the hardware device details away and concentrate on the actual computations to be performed.

In addition to using dedicated stream processing hardware to run OpenCL, there are also implementations to run it on any CPU which supports Intel’s Streaming SIMD Extensions (SSE) 4.1 or higher. This is especially useful for testing OpenCL in scenarios where expensive hardware is not (yet) available.

OpenCL defines its own programming language which is a narrow subset of the C language. It does not allow the use of e.g. recursion and variable length arrays but also of advanced C features like function pointers. Additionally, it defines parallel operations to support efficient computations on large amounts of data, such as vector operations and synchronization primitives.

In contrast to other high‐performance computing platforms like MPI, OpenCL is designed to be executed on a single machine in isolation. There is no facility to

13

communicate between machines or to bundle the computational power of multiple machines. Such setups would need a higher level framework and OpenCL, if used, would only be a part of it.

2.2.1 Platforms and Devices Multiple OpenCL implementations of different hardware vendors may be installed at the same time. In order to support access to the implementations, OpenCL defines the concept of a platform. Each platform represents one implementation of the standard. Available implementations can be queried and then used in the same operating system process. This makes it possible to run computations e.g. on an AMD graphics card and an Intel CPU at the same time using the same interface.

Platforms provide access to specific devices, such as a GPU or a CPU. While it is typical to use the vendor’s implementation for the corresponding hardware, multiple platforms may also contain devices that use the same underlying hardware. This is currently the case for the implementation of AMD and Intel, which both expose the CPU as a device.

2.2.2 Programs and Kernels Before a computation can run on an OpenCL device, it has to be compiled first. At the time the program which starts the computation (“host program”) is written, the number and kind of devices available at runtime might not be known. The to which the computational code (“device program”) compiles into depends heavily on the actual device that runs it. Therefore, OpenCL implementations usually feature a compiler that is invoked from the host program whenever a new device code needs to be run on a specific device. This is much like a traditional just‐in‐time compiler that compiles abstract code to machine code just before it is executed the first time, with the difference that in this case the abstract code is the source code itself.

Programs cannot be run directly on the device. A kernel must be created first. A kernel bundles the program, the function to be called and the arguments supplied to that function. Such a kernel is then submitted to a device for actual execution.

2.2.3 Buffers One type of arguments that can be supplied to a kernel are buffers. Buffers represent storage space that is typically allocated in device memory. There are methods to copy from host memory to a buffer and to copy from a buffer to host memory. The former is referred to as a write to the device and the latter as a read from the device.

2.2.4 Work Items and Concurrency When a kernel runs, it is typically executed on multiple work items. A work item represents one unit of work and can be thought of as a very lightweight thread. These work items account for the single‐instruction‐multiple‐data (SIMD) parallelism in OpenCL and thus enable the kernel to run efficiently on massively concurrent architectures like GPUs.

Upon submission of a kernel to OpenCL, a handle is returned that, besides querying execution state of the kernel, can be used as a dependency for other kernels. In this way, a kernel can already be submitted to OpenCL, but it will not start before any kernel

14 that was given as a dependency has finished its execution. In addition to kernel executions, reads or writes to buffers can also be started asynchronously and also return a handle to be used in the same fashion as the kernel handles.

Using this dependency pattern, it is possible to start data transfers and kernels ahead of their actual execution and then wait for the last operation to finish, knowing that all dependencies have also finished before. This is typically used to start a transfer to a device, start a kernel that uses the data transferred, and then start a transfer of the modified data back from the device. In other words, OpenCL allows for specifying a directed acyclic graph of operations, each of them having dependencies that must finish before it can start.

2.3 StarPU While OpenCL is a low‐level API used for executing computations on devices, it lacks support for high‐level demands such as load balancing or inter‐device consistency of data.

StarPU [11] is a runtime system for task scheduling on heterogeneous architectures. It was developed at the University of Bordeaux in 2009. Although StarPU exists as a library it was not directly used in this thesis. Rather, a subset of its concepts was implemented and integrated into the Zonnon Runtime Library.

2.3.1 Codelets and Tasks Choosing an algorithm to solve a given problem might depend on the device that executes it. In order to support scheduling among different heterogeneous devices there has to be a heterogeneous set of algorithms that the scheduler can choose from. A StarPU codelet represents one implementation of such an algorithm. Multiple codelets are then grouped into a task. The task also contains a list of arguments to be supplied to the chosen codelet when its implementation is executed.

A task therefore represents the computation to be performed while a codelet is one way of computing it.

2.3.2 Dependencies If a task computes a result that is later used by another task as an argument, there exists a dependency between these tasks. This means that the first task has to complete before the second task can start. These dependencies have to be stated explicitly when the task is created. A task is said to be ready when all its dependencies have completed. A dependency manager takes care of the ordering between tasks such that a task can only run when it is ready.

2.3.3 Time Estimation and Scheduling Additionally to the implementation, a codelet also contains information about the time needed to execute the implementation. A codelet might be used for a range of devices, e.g. they share the same programming interface or the binary code is the same for all these devices. Therefore, the time information in a codelet is not just the time itself, but a function that takes a specific device as input and returns the estimated time that the execution of this codelet on that very device would take. Such functions can be arbitrary

15

complex, allowing incorporating many factors, such as the device clock speed, memory latencies or cache sizes.

As the tasks will usually outnumber the devices available for execution of these tasks, load balancing is needed to ensure all devices are used in an optimal way. When a task becomes ready, a scheduler decides which device will be used to execute a task and therefore also which codelet of the task is chosen. The decision depends on the scheduling policy used by the scheduler.

2.3.3.1 Heterogeneous Earliest Finish Time The StarPU Research Report [12] describes a number of different scheduling policies and provides benchmarks for them. In the HEFT scheduling algorithm (Heterogeneous Earliest Finish Time [13]), each device has a queue that holds tasks that are ready and scheduled on this device. When a new ready task is scheduled, it is placed in the device queue that minimizes the expected finish time of that task. This is taking into account the expected finish time of the task that is currently executing on the device as well as the expected duration of the tasks in the queue.

2.3.4 Data Management and Consistency Devices may have their own memory that is separate from the host memory. A task can be scheduled on a device but the data the task uses might not be present on the device already. In this case, it needs to be moved to the device before the task can start.

There may be multiple copies of the same data distributed over the device or the host memories. StarPU runs a modified‐shared‐invalid (MSI) memory protocol to record which copies are up‐to‐date and guarantees a consistent sight for all the tasks that access the data.

16

17

3 Concept

3.1 Using Zonnon for Mathematical Computations Zonnon compiles to .NET and can fully interoperate with other .NET languages such as C#, Visual Basic and F#. It integrates into Visual Studio such that e.g. a C# project can reference a Zonnon project. This way the computations can be expressed in Zonnon Math while a GUI can be created using the Forms Designer that visualizes the results of the computation.

In order to compile the computations using what is presented throughout this document, a compiler switch /compute must be supplied to the compiler. This can be given directly to the command line compiler or set in the property page of the Zonnon project when using Visual Studio. This enables to switch on and off the optimizations and can be used to easily compare optimized and non‐optimized runtimes.

3.2 Compute Framework Architecture The goal of this master thesis was to speed up the implementation of the Mathematical Data Types in Zonnon. For this purpose, we built a framework that supports running computations on accelerated hardware such as a GPU.

This framework, the Compute Framework, was integrated into the Zonnon language:

 Compiler. The parser generates the Zonnon intermediate representation (IR) from the Zonnon source file. The compiler maps the computations to OpenCL kernels and a CCI IR. The CCI compiler then compiles the CCI IR to .NET CIL code.  Runtime Library. The compiled CCI IR uses the runtime library to create tasks for the computations which then execute on a possibly heterogeneous set of hardware accelerators. The runtime library manages these tasks.

3.3 Assignments The Zonnon compiler offloads expensive computations to specialized hardware. It does so by transforming assignments in the Zonnon source code into calls to a generated method that instructs the compute runtime system to run a computation at runtime.

Observe that for Zonnon assignments, the only side effect is the assignment itself, that is, the modification of the variable that is being assigned to (the assignee). More importantly the result of this side‐effect can only be observed when accessing an element of the assignee later, in code which is distinct from the assignment itself. It is therefore not required that the effect of the computation happens immediately when the assignment is executed, but rather that the side‐effect, updating the value, happens before any access to the assignee that could observe it. This opens the possibility that the computation can run concurrently to the thread that started it.

Contrary to assignments, an expression is evaluated such that it yields a value. Computing this value must happen right when the expression is executed, as the result of the expression will be used immediately thereafter. An optimization akin to the case of assignments is not possible.

18

3.4 Introductory Example Figure 2 shows an example compilation using the Compute Framework. The Zonnon source consists of a module and a procedure that contains three assignments. The parser takes this textual source code and transforms it into an abstract syntax tree (AST). The compiler transforms the assignments in two ways.

First, it generates OpenCL kernel source code. These kernels perform the same computations as the assignments do, but are optimized for execution in data‐parallel hardware. The compiler places these kernels as static string constants in their own CCI class node called “Kernels”.

Second, the compiler generates a CCI class node for each assignment. This class contains a procedure to start the computation using one or more kernels that have been generated before. It converts the assignment into a call to invoke a special procedure of the class. At runtime, this procedure starts the computation using the corresponding kernels that have been generated before.

Once those steps are complete, the procedure in the CCI intermediate representation (IR) corresponding to the Zonnon procedure does not contain any assignments anymore, but instead calls methods on the generated classes. As we can see in the Operation3 class, one assignment can be too complex for a single kernel, thus the compiler must split the assignment and generate kernels from the split parts. Operation 3 first starts Kernel1 and then Kernel3. Together they perform the computation of the assignment.

The Microsoft CCI compiler takes this CCI IR tree and compiles it to CIL code. As the kernels are source code they are not modified by the compilation but rather integrated as string constants into the generated .NET assembly. They will be compiled at runtime by the OpenCL compiler.

Zonnon Source Zonnon IR (AST) CCI IR

module Computations; Procedure Compute Kernels (OpenCL Source) procedure Compute( A, F: Matrix; c, d: Vector Statement ): Vector; Kernel1 q := p \M; var b := c \A; Mappings to OpenCL b, e : Vector; Kernel2 q := q + M * t; begin Statement Kernel3 q := M * t –v; b := c \A; d := d + A * c; d := d + A * c; Parser e := F * b –d \F; Procedure Compute return e; Statement start operations 1, 2, 3 end Compute; e := F * b –d \F; end Computations; return e; Mappings to .NET Statement Operation 1 return e; run Kernel1(b, c, A) C# Source using Computations; public class Program { Operation 2 public static void Main( CIL Assembly run Kernel2(d, A, c) String[] arguments ) { CIL Assembly Computations c; Operation 3 C# Compiler Microsoft CCI … run Kernel1(tmp, d, F) e = c.Compute(A, F, c, d); … Kernels run Kernel3(e, F, b, tmp) } }

Figure 2: Compilation of a Zonnon and a C# source file.

19

4 Runtime Architecture This section presents the architecture of the Compute Framework that was developed as the runtime system for MathZonnon and integrated into the Zonnon Runtime Library.

The runtime system is discussed here first although chronologically the compiler has to run before the runtime system is actually used. But since the compiler generates code that uses the runtime system it is easier to understand the process of compiling when the runtime is familiar.

Figure 3 shows the architecture of the runtime. Its parts are discussed in the next sections.

4.1 Accelerators The driving forces behind the Compute Framework are Accelerators. An Accelerator is a device that performs computations. This can be a CPU, a GPU or any other kind of processor.

Each Accelerator can optionally have its own memory that is separate from the host memory. In this case data might have to be moved to the Accelerators’ memory before a computation that uses this data can start on the Accelerator.

In the example we can see an example runtime with five Accelerators, two of them are GPUs, driven by OpenCL, and the remaining three represent CPU cores. Currently only OpenCL devices are implemented as Accelerators. They are assumed to have separate memory from the host, so in the remaining part of this document the case where an Accelerator shares memory with the host is not covered.

4.2 Data When an array or a part of it is copied to an Accelerator and subsequently modified by a computation it has to be copied back at some point in time before any host thread can access it. The same array might be used multiple times on the same Accelerator by subsequent computations. Data movement is typically slow compared to the time it takes to perform a computation, thus copying an array back to host memory each time a computation finishes is not efficient. The Compute Framework therefore only copies data back lazily, i.e. when it is actually accessed by a host thread.

Figure 3: Architecture of the Compute Framework runtime

20

In order to detect such accesses to the host array, each array used in the Compute Framework is wrapped in the type Data. The type Data maintains the reference to its host array as well as slices or full copies of it that reside in the memory of Accelerators. When a host thread wants to access the array, it must do so by calling the GetHostArray() method on the type Data, which then makes sure that the most recently modified copy is moved to the host array before returning the reference to it to the caller. The compiler that is generating this call must also make sure that the reference returned is not leaked. Otherwise a host thread could later access the array directly, bypassing the Data and any consistency guarantee would be lost.

4.3 Tasks Computations are grouped into Compute Tasks. It is typical to have one computation (e.g. a matrix multiplication) per Task, but there can also be more than one.

Compute Tasks are very much like StarPU tasks. They consist of one or more Codelets that enable the computation to run on different hardware types. Each Codelet provides a function that can be invoked to estimate the time it needs when executed on a given Accelerator.

Each task has an Argument List of Datas that the Codelets need to access when run. Each entry in this List is tagged either with Read, Write or ReadWrite to indicate the type of access needed, as well as an optional range that indicates which part of the underlying host array will be accessed.

The Compute Framework assumes that no two references to Data passed in the Argument List refer to the same Data object.

4.4 Dependencies Unlike StarPU Compute Tasks do not have explicit dependencies on each other. The ordering is inferred solely from the Argument List. Each Data runs a multiple‐reader‐ one‐writer protocol in order to orchestrate Tasks that want to access it. Tasks that cannot access a Data or a part of it yet are dependent on the Tasks that are currently using the Data or that specific part. The Compute Framework puts a dependent Task into a wait state until all of its dependencies have finished.

4.5 Scheduling and Data Management When all dependencies of a Task finish, the Scheduler decides on which Accelerator to run the Task. This decision depends on three factors:

 Some Data arguments might already have data in the separate memory of an Accelerator. If this Accelerator is chosen, the Data does not have to be copied and the time it would take to copy it can be saved.  An Accelerator can be busy computing another Task and there might also be other Tasks waiting in queue for this Accelerator.  The time estimation functions supplied by the Codelets might return varying results for different devices.

21

GPU 2 Task2(c, d) Task2(d, g)

c g

GPU 1 Task1(a, b, c) Task3(c, e,f) Task2(f, g, h)

a b e h

Host h.GetHostArray()

Figure 4: A scheduling of tasks.

The Scheduler takes all this information into account and finds the Accelerator where the given task is expected to finish first.

In case some Datas referenced in the Task’s Argument List do not already have copies of the requested range on the chosen Accelerator this data has to be moved there. It is possible for any missing Data to issue a series of transfers from other or even the same Accelerator because the most recently updated data might not be in the host array itself but on an Accelerator, waiting lazily for being copied back.

Figure 4 shows an example scheduling of tasks. The arrows represent data movement between devices. The yellow boxes indicate the size of data that has to be moved. Observe that when the host calls h.GetHostArray(), this thread is blocked until all the tasks that use h and their dependencies are finished.

4.6 Running Tasks Once all data is present the Task is submitted to the Accelerator. Each Accelerator maintains its own queue of Tasks that are ready to be started and has a driver that autonomously pops a Task off the queue, finds the codelet suitable for the type of the Accelerator and runs it. It is up to the implementation of the driver whether it allows executing multiple Codelets concurrently on the Accelerator.

4.7 Task Completion When the Codelet that ran on the Accelerator is finished, the corresponding Task is marked as complete. This allows waiting Tasks that depend on the current Task to continue.

4.8 Concurrency Model A task that is submitted to the Compute Framework will prepare and run asynchronously from the thread that created it. This has multiple benefits:

 A host thread can perform computations (that are separate from the Compute Framework) at the same time as a Task is preparing or executes on an Accelerator

22

 Multiple Tasks can be processed and executed simultaneously, possibly on different Accelerators.

Because the control is returned to the caller immediately after creating the Task, the host thread cannot rely on the task being finished and therefore can also not assume that a Data given as an argument to the Task is already updated.

The host thread has to call the method GetHostArray() in order to access the underlying host array of a Data (see 4.2). This does not only allow copying Data back lazily but also enables waiting for all Tasks that are currently using or waiting for the Data. The host thread is blocked until these Tasks finished and all modified parts of the Data are copied back into the host array.

Accesses to the underlying host array of a Data are not distinguished further. At runtime there is no way to check whether the host thread modifies the array, therefore the Compute Framework pessimistically assumes that yielding the reference to the host array invalidates all copies of the Data possibly held on Accelerators.

This has one important side effect. When another Task later uses the same Data it will have to be copied from the host array due to the assumption described above. Any memory held on the Accelerators can effectively be freed after it has been copied back to the host array because it will become stale after the host array modification anyway. This allows for easy garbage collection on the Accelerators and lowers memory load significantly at the expense of having to re‐copy the Data later to any Accelerator even if the host thread did not modify the array.

23

5 Compilation Process This section presents the work included in the Zonnon compiler. It describes the changes made to the compiler in order to generate code that uses the Compute Framework at runtime for computations.

5.1 Arrays The Compute Framework manages storage with Data. Array types are converted to the type Data when they meet three conditions:

 The array type is tagged with the {math} modifier. It is up to the programmer to decide whether an array is of mathematical data type or just a normal array.  The entity being typed by the array is a method parameter or a local variable.  The element type of the array is a basic type (defined in [2] section 5.3.1).

As the Compute Framework does not allow aliasing the same Data object when giving arguments to a task (see 4.3), the compiler must make sure that there is at most one reference to each Data visible from the host thread. Otherwise, the programmer could accidentally use two aliases to the same Data in the same assignment which would not meet the aliasing requirement of the Compute Framework runtime system.

In [6] assignments to or from mathematical data types are specified to have value semantics, that is, Data and the underlying array is copied rather than the reference to the Data. This prevents simple aliasing.

When assigning an array to a variable of runtime type Data, the compiler generates a call to the Data constructor which takes the array as the only argument. If the compiler cannot be sure that there are no other references to the array being wrapped, it must generate code to copy the array first. E.g. if the array being wrapped is newly created itself, no references of it may have been leaked, so in this case it is not copied.

The Data can also be used in contexts where the underlying host array is actually needed, e.g. in computations that are not submitted to the Compute Framework or when accessing a single element. In these cases, the compiler generates a call to the method GetHostArray() as detailed in section 4.2.

5.2 Method Calls Types of parameters that are of array type are eligible for conversion to Data. Although [6] specifies that Mathematical Data Types have value semantics it is impractical to copy a Data each time when it is given to a method, as this includes copying the underlying array as well. Therefore Data that is given as a parameter is not copied. The reference is copied instead.

Giving the references to Datas to the callee can introduce aliasing if a variable is given twice or more as an actual parameter. The callee will have two formal parameters and therefore think the two are distinct when they are actually aliases. Thus, the compiler does not allow using the same Data variable twice or more in the same method call.

24

5.3 Assignments The compiler converts mathematical assignments to use the Compute Framework. It does so by generating a class from the assignment and replacing the latter with a call to the constructor of the generated class. For any inner nodes in the AST of the assignment, that is, for the operators, this constructor will check whether the sizes of the arguments used for this operator are valid. The exact check performed depends on the semantics of the actual operator. In case not all of the sizes match the constructor throws an ArgumentException and no computation will be performed. The constructor also checks if any indexer used in the assignment is out of range for the variable it is used on and throws an ArgumentOutOfRangeException in that case.

When these checks succeed the constructor creates a new task from the codelet described below, a time estimation function, the arguments used in the expression and their indexers, if any. The indexers will tell the Compute Framework which part of an array is needed and must therefore be present on the accelerator before running the codelet there. The task is then submitted to the framework runtime.

The generated class contains a codelet method. It is used to start the computations and will be invoked directly from the Compute Framework runtime once all data is ready on the Accelerator that the scheduler chose. The codelet requests to compile kernels that it then uses for the computations. The generation of the kernels is discussed in the next sections. When these kernels are compiled the codelet assigns arguments to them and starts them in the correct order, specifying earlier started kernels as dependencies for the later ones, thereby ensuring the correct order of execution. The codelet also creates a callback to a cleanup function that disposes the kernels and temporary buffers once all kernels finished their execution.

5.3.1 Grouping Operators At compile time, one or more kernels have to be generated that perform the computations of the assignment.

The straight‐forward way is to have one kernel for each operator and use temporary storage for intermediate results. This has the advantage that the set of kernels is static, as the set of operators is known up‐front. The kernels could even be integrated into the Runtime Library and the compiler does not need to touch them at compile time.

But this also has disadvantages. First, the runtime needs to allocate a lot of space on the device to hold the intermediate state. For complex expressions, there might not be enough space on the device to hold all of the intermediate results at one point in time.

Second, for each operator all data has to be loaded from global device memory before and has to be written to global device memory after the computations are performed. This memory is typically much slower than register accesses within a compute unit. This gets even worse due to the fact that most OpenCL GPU platforms have no caches between the different layers of the memory hierarchy that would help in compensating latencies. Therefore the one‐kernel‐per‐operator strategy is simple to implement but not overly practical in terms of efficiency.

25

Observe that not all of the implemented operators require to be executed in their own kernel, but can be combined to form a group of operators that are implemented in the same kernel. Consider an expression like

A := B + C * D and assume these are all matrices. The multiplication needs to be performed in a special matrix multiplication kernel that is suited for execution in a SIMD environment. The way it is written makes it impossible to combine it with another matrix multiplication. But the addition can actually be integrated into the matrix multiplication kernel quite easily. Observe, without going into the details how the kernel works internally1, that whenever an element of the expression C * D is computed the corresponding element in B can directly be added and the result written to A. So there is no need to allocate an intermediary buffer because each element of the intermediary result exists only for a short time in a register. This also works on subexpressions that are purely element wise:

A := B * (C + D)

Here, whenever the multiplication needs an element of the expression C + D, it can be computed on the fly instead of fetching it from an intermediate storage. Whether this actually makes the computation faster depends on how many times each element is accessed from the multiplication and the latency of memory reads/writes. In cases where it makes it slower this is a tradeoff between computational efficiency and storage space needed.

Out of the implemented operators, only the element wise ones can be integrated into another kernel easily. All the combinations of matrix/vector multiplications require their own kernel. In the rest of this document we will call the latter big operators, since they are somewhat bigger or more complex to implement in SIMD.

It remains to decide where to integrate an element wise operation if there are multiple big operators used as subexpressions and/or we are dealing with a big operator that has this element wise operator as a subexpression. Consider for example:

A := B * (C + D * E)

There are two matrix multiplications in this assignment, so we need to generate two kernels in any case. As we have seen above, integrating the plus into one of its arguments is practically cost free. Therefore the compiler decides in this example to generate kernels for the assignments

T := C + D * E and A := B * T where T will be a buffer that is allocated on the device solely when the codelet of this assignment is executed and is destroyed in the cleanup method.

1 The complete template can be found in Appendix A.1.

26

5.3.2 Assignment Target The variable used as the target of the assignment might also be part of the expression. Consider an assignment like:

A := A * B

When one element of A * B is computed, it cannot be written to A directly. The original value of this element in A will be needed to compute other elements of the multiplication as well. Therefore the compiler will split this assignment into

T := A * B and A := T where T will again be an extra allocated buffer.

Not all operators need this special treatment; and it turns out that the element wise operators are again special. Consider a similar but purely element wise assignment:

A := A + B

Here, each element of A is only read once, added to the corresponding element in B and then written to A again. Because each element is just written after the only time it has been read, there is no need to use a temporary storage for the result, thus a single kernel can directly be generated from the above assignment.

5.3.3 Kernel Generation Knowing how to group the operators into kernels, the compiler can generate the code for each kernel. Each group has at most one big operator that provides the template for the kernel. Please find the complete mapping of big operators to templates in Appendix 0.

These templates contain a number of placeholders that need to be replaced in order to get the complete kernel. These placeholders make it possible to customize a kernel in order to integrate the other, element wise, operators of the group and form a compound kernel.

Please consult the kernel generation example in section 5.5 below for further details.

5.3.4 Kernel reuse Kernels need to be compiled at runtime for the specific device they are to be executed on at runtime. Each compilation takes some small amount of time and also uses space on the memory of the device. It is therefore desirable that kernels are reused. Since the kernels are available as source code at runtime it makes sense to memoize the kernel compilation function. This way, instead of compiling the same source code again, the previously created program can be used again.

In order to decrease the number of different kernels, the compiler is built such that it generates the same kernel source for operator groups that perform the same operations. As a consequence the names of the variables used in the expression are not used as identifiers in the kernel source. A numbering scheme is used instead.

27

5.3.5 Assignment reuse When multiple assignments are equal except for the names of the used variables, they can be compiled to use the same generated class. However, the impact on performance is not as critical compared to the benefit of reusing kernels. The driving force behind implementing it was that it collapses the amount of generated code which is especially helpful when debugging the kernel generation code of the compiler.

5.4 Limitations Only a subset of the operators and indexers described in the Zonnon Language Report [2] (section 10) can convert to the Compute Framework. While it would certainly be possible to implement the full range of operators and indexers, the focus of this work was to design a scalable runtime system and mainly to implement a proof of concept.

The supported operators are:

 Element wise addition/subtraction/multiplication/division of arrays of the same rank (operators +, ‐, .*, ./)  Multiplication of two matrices (operator *)  Multiplication of a matrix with a vector (operator *)  Multiplication of a vector with a matrix (operator *)

Only range indexers are supported. Furthermore, they must comply with the following conditions:

 Each dimension must consist of a single element (constant) or the unrestricted range (“..”).  The resulting type after applying the indexer must be of array type, i.e. at least one dimension must be the unrestricted range.  If a variable is used multiple times in the same assignment, the occurrences must either all not be indexed or use indexers that will evaluate to the same values at runtime. Different parts of the same array cannot be accessed in one assignment.

Certain infrastructure in the runtime system is not used to full extend. This includes the Codelets, as the only Accelerator type that currently is ever targeted by a codelet is an OpenCL device. It would not make sense to implement Codelet handling altogether without having another device type in mind, such as also targeting the CPU. Code generation for CPU, or rather, for .NET, was already implemented in the predecessor of this work and the intent was to also generate CPU Codelets, but, as time is finite, never made it into the implementation stage.

The same holds for OpenCL Codelet runtime estimation. While the infrastructure that uses these time estimation functions, namely the scheduler, is fully implement, the function itself does not do any estimation; it simply returns that the codelet, no matter where executed, takes one second.

28

5.5 Kernel Generation Example Consider the following Zonnon code, consisting of local variable definitions and an assignment: a, c, d : array {math} * of real{32}; B : array {math} *, * of real{32}; … a := a + B * (c - d);

The big operator in the assignment is a matrix‐vector multiplication. So we use the corresponding matrix‐vector‐template: kernel void Kernel(ulong n, ulong m{arguments}) { ulong globalSize = get_global_size(0); // for each row for(ulong row = get_local_id(0); row < n; row += globalSize) { // multiply and add corresponding elements {type} value = 0; for(ulong column = 0; column < m; column += 1) { #define Access(argument) argument[row * m + n] {type} left = {leftExpression}; #undef Access #define Access(argument) argument[column] {type} right = {rightExpression}; #undef Access value += left * right; } // write out result #define Access(argument) argument[row] {target} = {resultExpression}; #undef Access } }

We first replace {arguments} with the list of formal parameters that we need. As discussed in section 5.3.4, we do not use the names of the variables used in the assignment but take a numbering scheme to name the parameters. So we replace:

global {type} * global0, global {type} * global1, {arguments} global {type} * global2, global {type} * global3

The way the compiler processes the variables is such that global0 to global3 correpond to a, B, c and d, respectively. As you can see, a replacement can introduce other placeholders that must subsequently also be replaced.

Next, we replace {type} with the type of all the variables used in the assignment. In OpenCL, real{32} is a float:

{type} float As we know that the target variable, a, is represented by global3, we can replace {target}:

{target} Access(global0) The Access macro used here has an important function: It separates the variable that is being accessed from they way it is accessed. The latter depends on the used template

29

while the former depends on the variable used in the assignment. As you can see in the template, Access is defined in all the places where a single element access to an array is needed.

The remaining placeholders {leftExpression}, {rightExpression} and {resultExpression} all have the same goal: They enable the integration of the element wise operators into the kernel. Their names correspond the position relative to the big operator.

The left child of the big operator consists only of B, and B is represented by global0, thus we can replace:

{leftExpression} Access(global1) The right child is somewhat more complicated, as c – d is an expression rather than just a variable. We need to “lift” the expression into the Access macro:

{rightExpression} Access(global2) – Access(global3) The last step is to replace {resultExpression}. This expression is evaluated each time the big operator calculated one element and needs to store it. The calculated value will be in a local variable named value. In our example, an element of a, here global3, must be added to value.

{resultExpression} Access(global0) + value After all these replacements, we finally get the fully generated kernel code: kernel void Kernel(ulong n, ulong m, global float * global0, global float * global1, global float * global2, global float * global3) { ulong globalSize = get_global_size(0); // for each row for(ulong row = get_local_id(0); row < n; row += globalSize) { // multiply and add corresponding elements float value = 0; for(ulong column = 0; column < m; column += 1) { #define Access(argument) argument[row * m + n] float left = Access(global1); #undef Access #define Access(argument) argument[column] float right = Access(global2) – Access(global3); #undef Access value += left * right; } // write out result #define Access(argument) argument[row] Access(global0) = Access(global0) + value; #undef Access } }

The OpenCL compiler, that compiles this code at runtime, will first resolve the macro definitions. The resulting code that is actually compiled to device‐dependent code will be: kernel void Kernel(ulong n, ulong m, global float * global0, global float * global1, global float * global2,

30

global float * global3) { ulong globalSize = get_global_size(0); // for each row for(ulong row = get_local_id(0); row < n; row += globalSize) { // multiply and add corresponding elements float value = 0; for(ulong column = 0; column < m; column += 1) { float left = global1[row * m + n]; float right = global2[column] – global3[column]; value += left * right; } // write out result global0[row] = global0[row] + value; } } 5.6 CCI Generation Example The generated kernel code is utilized by the CCI IR. As CCI IR is very much C#‐like, we present the CCI nodes in textual form, although this source code never exists like this in the compilation process.

All the generated kernels reside as string constants in a special class. internal sealed class ComputeKernels { public static String Kernel0 = "..."; }

Consider again the same variables and assignment as in the kernel generation example: a, c, d : array {math} * of real{32}; B : array {math} *, * of real{32}; := … a := a + B * (c - d); a +

Figure 5 shows the abstract syntax tree of the above a * assignment. B -

In the following section we will walk that generates a c d class from the assignment. This example class has some fields and delegate definitions which we will Figure 5: AST used in the Example discuss when they are used later in the procedures. internal class Operation0 { UInt64[] _kernel0Size; Kernel _kernel0; EventObject _eventObject0;

delegate TimeSpan GetOpenCLTimeCallback(OpenCLComputeDevice device); delegate Object StartOpenCLCallback(OpenCLComputeDevice device, Buffer buffer0, Buffer buffer1, Buffer buffer2, Buffer buffer3);

The compiler creates a constructor that takes the variables used in the assignment. It checks them for being null. The parameters data0 to data3 correspond to a, B, c and d, respectively. public Operation0(Data data0, Data data1, Data data2, Data data3) { if (data0 == null) { throw new ArgumentNullException(); }

31

UInt64[] size0 = data0.GetDimensions(); if (data1 == null) { throw new ArgumentNullException(); } UInt64[] size1 = data1.GetDimensions(); if (data2 == null) { throw new ArgumentNullException(); } UInt64[] size2 = data2.GetDimensions(); if (data3 == null) { throw new ArgumentNullException(); } UInt64[] size3 = data3.GetDimensions();

The compiler has to make sure that the sizes of the arguments fit together according to the operations they are used in. When the compiler goes bottom up, it encounters a minus, the multiplication, a plus and finally the assign. Plus, minus and assign are element‐wise, so the arguments must have the exact same dimensions. For the multiplication, the second dimension of the first argument must be equal to the first (and only) dimension of the second argument. Since the multiplication is a big operation, the compiler stores the size information of the arguments in the field _kernel0Size that it will later use in the codelet.

if (!ComputeHelper.AreDimensionsEqual(size2, size3)) { throw new ArgumentException(); }

if (size1[1] != size2[0]) { throw new ArgumentException(); }

UInt64[] tempsize0 = new UInt64[1];

tempsize0[0] = size1[0]; _kernel0Size = new UInt64[2]; _kernel0Size[0] = size1[0]; _kernel0Size[1] = size1[1];

if (!ComputeHelper.AreDimensionsEqual(size0, tempsize0)) { throw new ArgumentException(); }

if (!ComputeHelper.AreDimensionsEqual(size0, size0)) { throw new ArgumentException(); }

Next, the compiler creates the Codelet and submits the Task, using delegates to specify the time estimation and start function as well as supplying the arguments.

Codelet[] codelets = new Codelet[1]; codelets[0] = new Codelet( OpenCLComputeDevice.Type, new GetOpenCLTimeCallback(GetOpenCLTime), new StartOpenCLCallback(StartOpenCL) );

DataUse[] dataUses = new DataUse[4]; dataUses[0] = new DataUse(data0, DataAccess.ReadWrite); dataUses[1] = new DataUse(data1, DataAccess.Read); dataUses[2] = new DataUse(data2, DataAccess.Read); dataUses[3] = new DataUse(data3, DataAccess.Read);

32

ComputeManager.SubmitTask(codelets, dataUses); }

The compiler generates a simple time estimation function.

TimeSpan GetOpenCLTime(OpenCLComputeDevice device) { return TimeSpan.FromSeconds(1); }

In the next step the compiler creates the Codelet start method. It takes the Accelerator that the scheduler chose and representations of the arguments on this Accelerator as parameters, and starts the computation. Here the compiler uses the field _kernel0Size again which it assigned in the constructor.

EventObject StartOpenCL( OpenCLComputeDevice device, Buffer buffer0, Buffer buffer1, Buffer buffer2, Buffer buffer3 ) { UInt64[] kernel0GlobalRange = device.GetMatrixVectorMultiplicationGlobalSize(_kernel0Size);

_kernel0 = KernelManager.GetKernelForProgram(ComputeKernels.Kernel0); _kernel0.SetValueArgument(0, _kernel0Size[0]); _kernel0.SetValueArgument(1, _kernel0Size[1]); _kernel0.SetGlobalArgument(2, buffer0); _kernel0.SetGlobalArgument(3, buffer1); _kernel0.SetGlobalArgument(4, buffer2); _kernel0.SetGlobalArgument(5, buffer3);

EventObject[] kernel0predecessors = new EventObject[0]; _eventObject0 = device.CommandQueue.StartKernel( _kernel0, kernel0GlobalRange, null, kernel0predecessors ); device.CommandQueue.Flush(); _eventObject0.RegisterCompletionCallback( new EventObjectCompletionCallback(CompleteOpenCL) ); return _eventObject0; }

The compiler must keep the reference to the kernel that it obtains from the KernelManager in a field in order to be able to dispose it properly, together with the event object that represents a handle to the started kernel.

void CompleteOpenCL(EventObject _) { _kernel0.Dispose(); _eventObject0.Dispose(); } } // end of class Operation1

As a final step the compiler can now convert the assignment to use the generated class. It also change the types of the variables to Data.

Data a; Data c; Data d;

33

Data B; ... new Operation0(a, B, c, d);

And with this last step the compiler is finished converting the assignment to the Compute Framework.

It might look strange that the constructed Operation0 Object is not assigned to any variable. The Task that is created by the constructor will reference the instance of the generated class via the Codelet start delegate. So the garbage collector cannot reclaim this object before both the computation has completed and the Task is removed from the runtime system completely.

34

35

6 Experimental Results To evaluate the benefits of using the Compute Framework, two computations were measured: A general matrix multiplication and a solution to the parallel factor analysis (PARAFAC, [14]) problem, specifically the alternating least squares algorithm.

These tests were performed on the following hardware:

 Intel® Core™ i7 930 CPU running four cores with hyper threading at a 2.8 GHz each  2 x ATI Radeon™ HD 6950 GPU with 880 MHz clock and 1375 MHz memory. The BIOS of both GPUs was flashed to the HD 6970 firmware.

In all tests, these configurations are measured:

 Zonnon‐CPU: Implementation in Zonnon code, compiled without the /compute switch  Zonnon‐Compute: Implementation in Zonnon code, compiled with the /compute switch  Matlab: A comparable Matlab‐implementation running on the CPU

While on paper the performance of the GPUs is factors higher than the CPU, this performance can only be achieved in optimal data parallel algorithms with a size that amortizes for the high cost of moving data to the GPU memory.

A nice discussion on the Topic can be found in a paper by Intel [15], where they compare multiple algorithms running on optimized CPU and GPU code. The speedups described in the paper, which are around 2.5, are way lower than one would expect looking at the raw number of floating point operations the devices are capable of.

As Matlab uses the Intel Math Kernel Library [16], a highly optimized library for Intel CPUs, we expect only very minor, if any, speedup for our not fully optimized GPU code.

Each measurement was taken 5 times, recording the 3 lowest times. The time presented in the table is the average of the 3 best times, ensuring the best and worst are within 7% relative to the worst measurement. The source code that was used to obtain the results can be found in Appendix B.

6.1 Matrix Multiplication General matrix multiplication with square matrices (n = 16):

Configuration Runtime [µs] Zonnon‐CPU 193 Zonnon‐Compute 2145 Matlab 70

36

General matrix multiplication with square matrices (n=1024):

Configuration Runtime [ms] Zonnon‐CPU 56925 Zonnon‐Compute 113 Matlab 26

General matrix multiplication with square matrices (n=8192):

Configuration Runtime [ms] Zonnon‐CPU > 600000 (10 min) Zonnon‐Compute 54247 Matlab 16370

6.2 Alternating Least Squares (ALS) A tensor (multi‐dimensional array) can be decomposed into factors. This is known as the PARAFAC problem. The Alternating Least Squares algorithm can be used to decompose a three‐dimensional array X into matrices A, B and C and weights W such that it satisfies

ݔ௜௝௞ ൌ෍ܽ௜௥ܾ௝௥ܿ௞௥ݓ௥ ൅߳௜௝௞ ௥

where ߳௜௝௞ represents the error of the decomposition. The algorithm iteratively minimizes this error. Iteration continues until the error is satisfyingly small, where satisfaction depends on the later usage of the decomposed values.

The algorithm works by fixing two “dimensions”, i.e. two of A, B and C, and then calculating the third dimension and the weights from that. Because the algorithm is iterative, the measured values represent the time need for a single iteration, that is, successively computing A, B and C once each while fixing the others.

The problem size can be characterized by four numbers: N1, N2, N3, and r. such that dim(X) = [N1, N2, N3], dim(A) = [N1, r], dim(B) = [N2, r], dim(C) = [N3, r] and dim(W) = [r]

For N1 = 32, N2 = 33, N3 = 34 and r = 512, the measured iteration times are:

Configuration Runtime [ms] Zonnon‐CPU 68880 Zonnon‐Compute 9986 Matlab 2309

To be able to compare Zonnon‐CPU and Zonnon‐GPU, that is, to see which parts dominate the speedup, the Zonnon code was instrumented to get finer grained measurements (using in both cases only the numbers of the fastest execution):

37

Region Zonnon‐CPU [ms] Zonnon‐Compute [ms] Rel. change total 68789 9972 – 86% + error 6 229 + 3710% + A 2406 550 – 77% + B 45103 1393 – 96% + diag 30 1286 + 4170% + solve 6789 6802 + 2% + normalizeVectors 7 8 ≈ 0 + compose 14471 334 – 96% + scaleVectors 2 2 0 + diag 9 282 +3030%

6.3 Discussion For the largest matrix multiplications and ALS the introduction of the Compute Framework into Zonnon led to a speedup of 6 and 500, respectively. For the small matrix multiplication the overhead of moving the data to and from the GPUs is too big; so this results in a slowdown of roughly 11.

When we compare the Zonnon results with those from Matlab we can immediately see that Matlab is faster in all test cases with a factor of at least 3. It seems that it is indeed hard to keep up with Matlab and therefore with the Intel Math Kernel Library. There is some potential to speed up the Zonnon computations even more. The OpenCL Kernel templates we use are not heavily optimized. There are some overheads associated with the way the Compute Framework manages Tasks and dependencies between them that could be optimized.

We instrumented the Zonnon code in order to see which parts of the ALS program can be optimized by using the Compute Framework. Please consult Appendix B.2 to see which computations are performed in the respective parts.

The sections that obtain the largest speedup are clearly A, B and compose. Each of these sections repeatedly computes matrix multiplications. These matrix multiplications are implemented in Zonnon‐CPU using a very simple algorithm that is neither cache‐aware nor computes the result block‐wise. If instead an optimized algorithm would be used, we suspect that it could even run faster than Zonnon‐Compute.

The sections error and diag both experience a massive slowdown. These are the sections that access multiple elements of the arrays without using the Compute Framework. Arrays are wrapped in Data and when accessing the underlying host array it must fetch back all copies possibly held on Accelerators. In order to check whether such copies exist the Data must be synchronized. This synchronization inhibits some overhead and when repeatedly some elements are accessed, the overhead cumulates to such a big amount that is really slows down these sections.

38

39

7 Conclusion The goal of this thesis was to speed up the mathematical computations in Zonnon. Based on the numbers above, we can say that this goal was well achieved. When the operations reach a certain size, it makes sense to compile with the /compute modifier.

The Compute Framework implemented is scalable up to a complicated algorithm like ALS, and we can see no sign that an endpoint of the scalability is reached. The Kernels generated are already optimized by the integration of the element‐wise operators into big operators. We can say that the CCI and OpenCL code generated by the compiler is well readable and understandable.

With the Compute Framework we were able to implement a proof of concept on how to hide the complexity of addressing specialized Hardware within the Compiler. We show how the compiler can generate optimized composed kernels directly from expressions.

7.1 Future Work The Compute Framework and its integration into the Zonnon language can be improved in many ways. The following list contains areas that can be part of future work:

 Implement more/all of the mathematical operators: Only a small subset of the math operators was implemented. Extending this subset enables running many more algorithms using the Compute Framework.  Allow different array base types: Integrating other base types than real{32} for arrays should be relatively straight forward for primitive types. When allowing object types to be used with the framework, care must be taken with user defined operators. They might have to be converted to OpenCL code as well.  Integrate .NET computations: When compiling without the /compute switch, the computations are generated to run on .NET. This facility can be used to integrate .NET computations into the Compute Framework, such that each Task is composed of two (or even more) Codelets.  Implement time estimation: In order for the scheduling to be much more efficient, codelets should include useful time estimation functions.  Optimize OpenCL kernels: There is some potential in making the Compute Framework faster by optimizing the OpenCL templates used to generate the kernels.  Revise the runtime model: Lower the overhead of the framework runtime by measuring and optimizing the implementation.  Incorporate constant arrays: Arrays can have constant sizes. If these are small enough, the compiler can decide not to convert operators that use them to the Compute Framework.  Optimize Scheduling: The compiler could analyze the control flow of a procedure and generate scheduling hints for the runtime, e.g. tasks that should be scheduled on the same device due to later data use of other tasks.

40

7.2 Conclusive Statement Working with the Zonnon language was a very interesting experience. The compiler is different from traditional compilers as it uses CCI to do the heavy lifting to .NET code. I can say that the way this compiler is set up and implemented is very well understandable and therefore also modifiable. I am happy to have contributed to Zonnon.

41

8 Bibliography 1. Accelerator. [Online] Microsoft Research. http://research.microsoft.com/en‐ us/projects/Accelerator/. 2. Gutknecht, Jürg. Zonnon Language Report. [Online] 2009. http://www.zonnon.ethz.ch/language/report.html. 3. Native Systems Group. [Online] ETH Zurich. http://nativesystems.inf.ethz.ch/. 4. Common Language Infrastructure (CLI). [Online] http://www.ecma‐ international.org/publications/standards/Ecma‐335.htm. ECMA‐335. 5. Common Compiler Infrastructure. [Online] Microsoft Research. http://research.microsoft.com/en‐us/projects/cci/. 6. Implementing Mathematical Data Types on Top of .NET. Gutknecht, Jürg, et al. 2010, Brazilian Symposium on Programming Languages. 7. OpenCL ‐ The open standard for parallel programming of heterogeneous systems. [Online] http://www.khronos.org/opencl/. 8. Khronos Group. [Online] http://www.khronos.org/. 9. NVidia Developer Zone: CUDA. [Online] http://developer.nvidia.com/category/zone/cuda‐zone. 10. AMD Accelerated Parallel Processing (APP) (formerly ATI Stream). [Online] http://www.amd.com/stream. 11. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Augonnet, Cédric, et al. 2, s.l. : John Wiley & Sons, Ltd., February 2011, Concurrency and Computation: Practice and Experience, Special Issue: Euro‐Par 2009, Bd. 23, S. 187‐198. 12. Augonnet, Cédric, et al. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. [Hrsg.] Henk Sips, Dick Epema und Hai‐Xiang Lin. Lecture Notes in Computer Science. s.l. : Springer Berlin / Heidelberg, S. 863‐874. 13. Performance‐effective and low‐complexity task scheduling for heterogeneous computing. Topcuoglu, H., Hariri, S. und Wu, Min‐You. 3, March 2002, Parallel and Distributed Systems, IEEE Transactions on, Bd. 13, S. 260‐274. 14. Beckmann, Christian. Parallel Factor Analysis (PARAFAC). [Online] http://www.fmrib.ox.ac.uk/analysis/techrep/tr04cb1/tr04cb1/node2.html. 15. Lee, Victor W., et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. Proceedings of the 37th annual international symposium on Computer architecture. Saint‐Malo, France : ACM, 2010, S. 451‐‐460. 16. Intel Math Kernel Library. [Online] http://software.intel.com/en‐us/articles/intel‐ mkl/.

42

43

A Big Operator Kernel Templates

A.1 Matrix‐Matrix Multiplication

#define Element(matrix, width, n, m) (matrix)[(n) * (width) + (m)] kernel void Kernel( ulong n, ulong m, ulong p, local {type} * localA, local {type} * localB{arguments} ) { ulong localN = get_local_id(0); ulong localP = get_local_id(1); ulong blockSize = get_local_size(0);

// each thread group may process 0..n (output) blocks, //depending on the problem size and global size for (ulong blockN = get_global_id(0) - localN; blockN < n; blockN += get_global_size(0)) { for (ulong blockP = get_global_id(1) - localP; blockP < p; blockP += get_global_size(1)) {

ulong globalN = blockN + localN; ulong globalP = blockP + localP; {type} value = 0;

// loop through (source) blocks for (ulong blockM = 0; blockM < m; blockM += blockSize) { // load the corresponding array elements if(globalN < n && (blockM + localP) < m) { #define Access(matrix) Element(matrix, m, globalN, blockM + localP) Element(localA, blockSize, localN, localP) = {leftExpression}; #undef Access } else { Element(localA, blockSize, localN, localP) = 0; } if((blockM + localN) < m && globalP < p) { #define Access(matrix) Element(matrix, p, blockM + localN, globalP) Element(localB, blockSize, localN, localP) = {rightExpression}; #undef Access } else { Element(localB, blockSize, localN, localP) = 0; } // let all threads write an element barrier(CLK_LOCAL_MEM_FENCE); // sum up for (ulong localM = 0; localM < blockSize; localM++) { value += Element(localA, blockSize, localN, localM) * Element(localB, blockSize, localM, localP); } // let all threads read the elements barrier(CLK_LOCAL_MEM_FENCE); } // write out sum if(globalN < n && globalP < p) { #define Access(matrix) Element(matrix, p, globalN, globalP) {target} = {resultExpression}; #undef Access } } } }

44

A.2 Matrix‐Vector Multiplication kernel void Kernel(ulong n, ulong m{arguments}) { ulong globalSize = get_global_size(0); // for each row for(ulong row = get_local_id(0); row < n; row += globalSize) { // multiply and add corresponding elements {type} value = 0; for(ulong column = 0; column < m; column += 1) { #define Access(argument) argument[row * m + n] {type} left = {leftExpression}; #undef Access #define Access(argument) argument[column] {type} right = {rightExpression}; #undef Access value += left * right; } #define Access(argument) argument[row] {target} = {resultExpression}; #undef Access } }

A.3 Vector‐Matrix Multiplication kernel void Kernel(ulong m, ulong p{arguments}) { ulong globalSize = get_global_size(0); // for each column for(ulong column = get_local_id(0); column < n; column += globalSize) { // multiply and add corresponding elements {type} value = 0; for(ulong row = 0; row < m; row += 1) { #define Access(argument) argument[row] {type} left = {leftExpression}; #undef Access #define Access(argument) argument[row * p + m] {type} right = {rightExpression}; #undef Access value += left * right; } #define Access(argument) argument[column] {target} = {resultExpression}; #undef Access } }

45

A.4 Element‐Wise Copy kernel void Kernel(ulong count{arguments}) { ulong size = get_local_size(0); ulong index = get_local_id(0);

while (index < count) { #define Access(arr) arr[index] {target} = {resultExpression}; #undef Access index += size; } }

46

B Test Source Code

B.1 Matrix Multiplication

Zonnon: module MatrixMultiplication; import System.Diagnostics.Stopwatch as Stopwatch; type Matrix = array {math} *, * of real {32}; procedure {public} compute; var n : integer; A, B, C, D, E, F : Matrix; value : real {32}; watch : Stopwatch; begin watch := new Stopwatch; n := 1024; (* do not measure compute framework starting overhead *) A := new Matrix(1, 1); B := new Matrix(1, 1); C := A * B; value := C[0, 0];

D := new Matrix(n, n); E := new Matrix(n, n); watch.Start; F := D * E; value := F[0,0]; watch.Stop(); writeln(watch.Elapsed.ToString); readln; end compute; begin end MatrixMultiplication.

Matlab: n = 1024; A = single(rand([n,n])); B = single(rand([n,n])); A(1,1) B(1,1) tic; C = A * B; C(1,1) toc

47

B.2 Alternating Least Squares

Zonnon: module AlternateLeastSquares; import System.Random as Random, System.Math as Math, Measurement.Measure as Measure, AlternateLeastSquaresHost as Host; type Scalar = real{32}; type Vector = array {math} * of real {32}; type Matrix = array {math} *, * of real{32}; type Tensor = array {math} *, *, * of real {32}; procedure {public} Solution( x : array *, *, * of real {32}; var F1, F2, F3 : array *, * of real {32}; var w : array * of real {32}; r, maxit : integer; random : Random ); var N1, N2, N3 : integer; mathx : Tensor; mathF1, mathF2, mathF3 : Matrix; mathw : Vector; i, j: integer; begin N1 := len(x, 0); N2 := len(x, 1); N3 := len(x, 2);

F1 := new Matrix(N1, r); F2 := new Matrix(N2, r); F3 := new Matrix(N3, r);

for i := 0 to r - 1 do for j := 0 to N1 - 1 do F1[j, i] := real{32}(random.NextDouble()); end;

for j := 0 to N2 - 1 do F2[j, i] := real{32}(random.NextDouble()); end;

for j := 0 to N3 - 1 do F3[j, i] := real{32}(random.NextDouble()); end; end;

mathx := x; mathF1 := F1; mathF2 := F2; mathF3 := F3; mathw := new Vector(r);

SolutionInternal(mathx, mathF1, mathF2, mathF3, mathw, maxit);

F1 := mathF1; F2 := mathF2; F3 := mathF3; w := mathw; end Solution;

48 procedure SolutionInternal( x: Tensor; (* [N1, N2, N3] *) var F1, F2, F3 : Matrix; (* [N1, r], [N2, r], [N3, r] *) w: Vector; (* [r] *) maxit : integer ); var N1, N2, N3, r : integer; random : Random; i, it: integer; xnrm, relResidNrm : real {32}; A1, B1, A2, B2, A3, B3, transpose, transposed, diagged : Matrix; F1t, F2t, F3t : Matrix; x1, err : Tensor; d: Vector; tmp : real{32}; begin

measure := new Measure;

N1 := len(F1, 0); N2 := len(F2, 0); N3 := len(F3, 0); r := len(w, 0);

xnrm := real{32}(Math.Sqrt(x +* x)); x1 := new Tensor(N1, N2, N3);

it := 0;

loop measure.BeginPart("error");

err := x - x1;

relResidNrm := real{32}(real{32}(Math.Sqrt(err +* err)) / xnrm); writeln(relResidNrm);

it := it + 1;

measure.EndPart();

if (it > maxit) or (relResidNrm < 0.001) then exit; end;

measure.BeginPart("A"); F2t := !F2; F3t := !F3; A1 := (F2t * F2) .* (F3t * F3); tmp := A1[0, 0]; measure.EndPart();

measure.BeginPart("B"); B1 := new Matrix(N1, r); for i := 0 to N3 - 1 do d := F3[i, ..]; diagged := diag(d); B1 := B1 + x[.., .., i] * F2 * diagged; end; tmp := B1[0,0]; measure.EndPart();

F1 := linSolve(A1, B1);

measure.BeginPart("normalizeVectors");

49

w := Host.normalizeVectors(F1); measure.EndPart();

measure.BeginPart("A"); F1t := !F1; A2 := (F3t * F3) .* (F1t * F1); tmp := A2[0,0]; measure.EndPart();

measure.BeginPart("B"); B2 := new Matrix(N2, r); for i := 0 to N1 - 1 do d := F1[i,..]; diagged := diag(d); B2 := B2 + (x[i, .., ..] * F3) * diagged; end; tmp := B2[0,0]; measure.EndPart();

F2 := linSolve(A2, B2);

measure.BeginPart("normalizeVectors"); w := Host.normalizeVectors(F2); measure.EndPart();

measure.BeginPart("A"); F1t := !F1; F2t := !F2; A3 := (F1t * F1) .* (F2t * F2); tmp := A3[0,0]; measure.EndPart();

measure.BeginPart("B"); B3 := new Matrix(N3, r); for i := 0 to N2 - 1 do d := F2[i, ..]; diagged := diag(d); transpose := x[.., i, ..]; transposed := !transpose; B3 := B3 + (transposed * F1) * diagged; end; tmp := B3[0,0]; measure.EndPart();

F3 := linSolve(A3, B3);

measure.BeginPart("normalizeVectors"); w := Host.normalizeVectors(F3); measure.EndPart();

x1 := cp3d_compose(F1, F2, F3, w);

measure.Output();

end; end SolutionInternal; procedure linSolve( A : Matrix; (* [n, n] *) B : Matrix (* [m, n] *) ) : Matrix; (* [m, n] *) var result : Matrix; begin measure.BeginPart("solve");

50

result := real{32}(!(A \ !B)); measure.EndPart(); return result; end linSolve; procedure cp3d_compose( F1, F2, F3 : Matrix; (* [N1, r], [N2, r], [N3, r] *) w: Vector (* [r] *) ) : Tensor (* [N1, N2, N3] *); var N1, N2, N3 : integer; x : Tensor; T : Matrix; d : Vector; i : integer; diagged : Matrix; F3t : Matrix; tmp : real{32}; begin measure.BeginPart("compose"); N1 := len(F1, 0); N2 := len(F2, 0); N3 := len(F3, 0);

x := new Tensor(N1, N2, N3); measure.BeginPart("scaleVectors"); T := Host.scaleVectors(F1, w); measure.EndPart(); for i := 0 to N1 - 1 do d := T[i, ..]; diagged := diag(d); F3t := !F3; x[i, .., ..] := F2 * diagged * F3t; end; tmp := x[0,0,0]; measure.EndPart();

return x; end cp3d_compose; procedure diag( V : Vector ) : Matrix; var value : Scalar; result : Matrix; i, n : integer; begin measure.BeginPart("diag"); n := len(V, 0); result := new Matrix(n, n); for i := 0 to n - 1 do value := V[i]; result[i, i] := value; end; measure.EndPart(); return result; end diag; var measure : Measure; begin end AlternateLeastSquares.

51

Matlab: cp3d_als.m % % Canonical decomposition for 3D tensors using alternating least squares % % [F,w] = cp3d_als(x,r,mintol,maxit) % % INPUT: % x - input 3D tensor [N1,N2,N3] % r - desired rank of the decomposition (number of the additive terms in the decomposition) % % OUTPUT: % F - decomposition factor matrices {[N1,r], [N2,r], [N3,r]} % w - weights attached to each additive term of the decomposition [r] % % Copyright (c) 2011 by Alexey Morozov, Physics in Medicine Group, University Hospital of Basel % function [F,w] = cp3d_als(x,r,mintol,maxit) if nargin < 3 mintol = 1e-4; end if nargin < 4 maxit = 100; end; w = ones(r,1);

% random initialization F = cell(3,1); F{1} = single(rand(size(x,1),r)); F{2} = single(rand(size(x,2),r)); F{3} = single(rand(size(x,3),r));

% Frobenuis norm of the tensor xnrm = sqrt(sum(x(:).*x(:))); relResidNrm = 2; newRelResidNrm = 1; it = 0; while (it < maxit) && (newRelResidNrm < relResidNrm - 0.0001) tic; % solve alternatively for each factor while fixing the others for dim = 1:3 [A,B] = prepareLinearSystem(dim,x,F); F{dim} = linSolve(A,B);

% normalize all vectors of the factor matrix [F{dim},w] = normalizeVectors(F{dim}); end;

% compute relative norm of the error x1 = cp3d_compose(F,w); err = x - x1; relResidNrm = newRelResidNrm; newRelResidNrm = sqrt(sum(err(:).*err(:))) / xnrm;

it = it + 1;

52

fprintf('it %d: relResidNrm=%f\n',it,newRelResidNrm); toc end end

% % Prepare linear system to solve for a factor with number dim % % INPUT: % dim - factor number [1] % x - input tensor [N1,N2,N3] % F - factor matrices {[N1,r], [N2,r], [N3,r]} % % OUTPUT: % A - system matrix [r,r] (symmetric, positive definite) % B - multiple right hand sides [Ndim,r] % function [A,B] = prepareLinearSystem(dim,x,F) r = size(F{1},2); B = zeros(size(x,dim),r); switch dim

case 1

% system matrix A = (F{2}'*F{2}) .* (F{3}'*F{3});

% multiple right hand sides for k = 1:size(x,3) B = B + squeeze(x(:,:,k)) * F{2} * diag(F{3}(k,:)); end

case 2

% system matrix A = (F{3}'*F{3}) .* (F{1}'*F{1});

% multiple right hand sides for k = 1:size(x,1) B = B + squeeze(x(k,:,:)) * F{3} * diag(F{1}(k,:)); end

case 3

% system matrix A = (F{1}'*F{1}) .* (F{2}'*F{2});

% multiple right hand sides for k = 1:size(x,2) B = B + squeeze(x(:,k,:))' * F{1} * diag(F{2}(k,:)); end end

%% here 'squeeze' is used to remove singleton dimensions; for example %% array of size [32,1,34] is 'squeezed' to array of size [32,34] end

% % Perform multiple linear solves % % INPUT: % A - system matrix [N,N]

53

% B - multiple right hand sides [M,N] % % OUTPUT: % C - solution array [M,N] % function C = linSolve(A,B)

% Do C = (A\B')'; % in case of general linear solve availability % otherwise perform multiple iterative solves (e.g. conjugate gradient) % NOTE that iterative solves can be done in PARALLEL!

C = zeros(size(B)); tol = 1e-8; maxit = size(A,1); % maximal number of iteration for Krylov solvers with convergence garantee

for k = 1:size(B,1) b = B(k,:)'; [c,flag] = pcg(A,b,tol,maxit); assert(flag == 0,'CG failed to converge!'); C(k,:) = c'; end end

% % Normalize vectors of a factor matrix % % INPUT: % F - factor matrix [N,r] % % OUTPUT: % F - normalized matrix [N,r] % w - normalization coefficients for each vector [r,1] % function [F,w] = normalizeVectors(F) r = size(F,2); w = zeros(r,1); for k = 1:r w(k) = sqrt(sum(F(:,k).^2)); % use Frobenius norm F(:,k) = F(:,k) ./ w(k); end end cp3d_compose.m % % Compose factors of 3D Canonical decomposition to a tensor % % x = cp3d_compose(F,w) % % INPUT: % F - decomposition factor matrices {[N1,r], [N2,r], [N3,r]} % w - weights attached to each additive term of the decomposition [r] % % OUTPUT: % x - composed tensor [N1,N2,N3] % % Copyright (c) 2011 by Alexey Morozov, Physics in Medicine Group, University Hospital of Basel % function x = cp3d_compose(F,w) x = single(zeros(size(F{1},1), size(F{2},1), size(F{3},1))); if nargin < 2 % no weight specified, use w=1

54

for k = 1:size(F{1},1) x(k,:,:) = F{2} * diag(F{1}(k,:)) * F{3}'; end else T = scaleVectors(F{1},w); % multiply by the weights each vector of the first factor matrix for k = 1:size(F{1},1) x(k,:,:) = F{2} * diag(T(k,:)) * F{3}'; end end; end

% % Scale vectors of a factor matrix % % INPUT: % F - factor matrix [N,r] % % OUTPUT: % F - scaled matrix [N,r] % w - weight coefficients for each vector [r,1] % function F = scaleVectors(F,w) r = size(F,2); for k = 1:r F(:,k) = F(:,k) .* w(k); end end cp3d_test.m % input tensor x = single(rand(32,33,34));

% run the CP ALS algorithm [F,w] = cp3d_als(x,512,1e-3,100);

% construct a tensor from the obtained decomposition x1 = cp3d_compose(F,w);

% simple 1D visualization of the obtained result compared with the original data plot(x(:)) hold on plot(x1(:),'r') plot(x(:)-x1(:),'k')

55

C Zonnon EBNF

In this appendix, the complete Zonnon EBNF is presented. Modifications that stem from the introduction of the mathematical data types are highlighted.

// 1. Program and program units CompilationUnit = { ProgramUnit "." }. ProgramUnit = ( Module | Definition | Implementation | Object).

// 2. Modules Module = module [ Modifiers ] ModuleName [ ImplementationClause ] ";" [ ImportDeclaration ] ModuleDeclarations ( UnitBody | end ) SimpleName. Modifiers = "{" IdentList "}". ModuleDeclarations = { SimpleDeclaration | NestedUnit ";" | ProcedureDeclaration | OperatorDeclaration ProtocolDeclaration | ActivityDeclaration }. NestedUnit = ( Definition | Implementation ). ImplementationClause = implements ImplementedDefinitionName { "," ImplementedDefinitionName }. ImplementedDefinitionName = DefinitionName | "[" "]". ImportDeclaration = import Import { "," Import } ";". Import = ImportedName [ as ident ]. ImportedName =( ModuleName | DefinitionName | ImplementationName | NamespaceName | ObjectName ). UnitBody = begin [ StatementSequence ] end.

// 3. Definitions Definition = definition [ Modifiers ] DefinitionName [ RefinementClause ] ";" [ ImportDeclaration ] DefinitionDeclarations end SimpleName. RefinementClause = refines DefinitionName. DefinitionDeclarations = { SimpleDeclaration | { ProcedureHeading “;” } | ProtocolDeclaration }. ProtocolDeclaration = protocol ProtocolName "=" "(" ProtocolSpecification ")" ";". ProtocolSpecification = [ Alphabet "," ] Grammar | Alphabet [ "," Grammar ]. Alphabet = TerminalSymbol { "," TerminalSymbol }. Grammar = Production { "," Production }. Production = ProductionName "=" Alternative. Alternative = ItemSequence { "|" ItemSequence }. ItemSequence = Item { Item }. Item = ( ["?"] TerminalSymbol | ProductionName | TypeName | Alternative | Group | Optional | Repetition). Group = "(" ItemSequence ")". Optional = "[" ItemSequence "]". Repetition = "{" ItemSequence "}". TerminalSymbol = number | ident | charConstant. ProductionName = ident.

// 4. Implementations Implementation = implementation [ Modifiers ] ImplementationName ";" [ ImportDeclaration ] Declarations ( UnitBody | end ) SimpleName.

56

// 5. Objects Object = object [ Modifiers ] ObjectName ObjectDefinition SimpleName. ObjectDefinition = [ FormalParameters ] [ ImplementationClause ] ";" [ ImportDeclaration ] { SimpleDeclaration | ProcedureDeclaration | ProtocolDeclaration | ActivityDeclaration } ( UnitBody | end ). ActivityDeclaration = activity ActivityName [ FormalParameters ] [ProcImplementationClause]";" Declarations ( UnitBody | end ) SimpleName.

// 6. Declarations Declarations = { SimpleDeclaration | ProcedureDeclaration }. SimpleDeclaration = ( const [ Modifiers ] { ConstantDeclaration ";" } | type [ Modifiers ] { TypeDeclaration ";" } | var [ Modifiers ] { VariableDeclaration ";" } ). ConstantDeclaration = ident "=" ConstExpression. ConstExpression = Expression. TypeDeclaration = ident "=" Type. VariableDeclaration = IdentList ":" Type.

// 7. Types Type = ( TypeName [ "{" Width "}" ] | EnumType | ArrayType | ProcedureType | InterfaceType | ObjectType | RecordType | ProtocolType ). Width = ConstExpression. ArrayType = array [ "{" math "}" ] Length { "," Length } of Type. Length = ( ConstExpression | "*" ). EnumType = "(" IdentList ")". ProcedureType = procedure [ ProcedureTypeFormals ]. ProcedureTypeFormals = "(" [ PTFSection { ";" PTFSection } ] ")" [ ":" FormalType ]. PTFSection = [ var ] FormalType { "," FormalType }. FormalType = { array "*" of } ( TypeName | InterfaceType ). InterfaceType = object [ PostulatedInterface ]. PostulatedInterface = "{" DefinitionName { "," DefinitionName } "}". ObjectType = object ObjectDefinition ident. RecordType = record { VariableDeclaration ";" } end ident. ProtocolType = activity [ "{" ProtocolName "}" ].

// 8. Procedures & operators ProcedureDeclaration = ProcedureHeading [ ProcImplementationClause ] ";" [ ProcedureBody ";" ]. ProcImplementationClause = implements ImplementedMemberName { "," ImplementedMemberName }. ImplementedMemberName = ( DefinitionName | "[" "]" ) "." MemberName. ProcedureHeading = procedure [ Modifiers ] ProcedureName [ FormalParameters ]. ProcedureBody = Declarations UnitBody SimpleName. FormalParameters = "(" [ FPSection { ";" FPSection } ] ")" [ ":" FormalType ]. FPSection = [ var ] ident { "," ident } ":" FormalType. OperatorDeclaration = operator [ Modifiers ] OpSymbol [ FormalParameters ] ";" OperatorBody ";". OperatorBody = Declarations UnitBody OpSymbol. OpSymbol = string. // A 1,2,3-character string; the set of possible symbols is restricted

57

// 9. Statements StatementSequence = Statement { ";" Statement }. Statement = [ Assignment | ProcedureCall | IfStatement | CaseStatement | WhileStatement | RepeatStatement | LoopStatement | ForStatement | await Expression | exit | return [ Expression { "," Expression } ] | BlockStatement | Send | Receive | SendReceive | LaunchActivity | AnonymousActivity ]. Assignment = Designator { "," Designator } ":=" Expression { "," Expression }. ProcedureCall = Designator. IfStatement = if Expression then StatementSequence { elsif Expression then StatementSequence } [ else StatementSequence ] end. CaseStatement = case Expression of Case { "|" Case } [ else StatementSequence ] end. Case = [ CaseLabel { "," CaseLabel } ":" StatementSequence ]. CaseLabel = ConstExpression [ ".." ConstExpression ]. WhileStatement = while Expression do StatementSequence end. RepeatStatement = repeat StatementSequence until Expression. LoopStatement = loop StatementSequence end. ForStatement = for ident ":=" Expression to Expression [ by ConstExpression ] do StatementSequence end. BlockStatement = do [ Modifiers ] [ StatementSequence ] { ExceptionHandler } [ CommonExceptionHandler ] [ TerminationHandler ] end. ExceptionHandler = on ExceptionName { "," ExceptionName } do StatementSequence. CommonExceptionHandler = on exception do StatementSequence. TerminationHandler = on termination do StatementSequence. Send = ActivityInstanceName [ "(" Designator { "," Designator } ")" ]. Receive = [ Designator { "," Designator } ":=" ] await [ ActivityInstanceName ]. SendReceive = Designator { "," Designator } ":=" Send. Accept = accept QualIdent {"," QualIdent}. LaunchActivity = new ActivityName [ "(" ActualParameters ")" ]. AnonymousActivity = activity ";" Declarations UnitBody.

// 10. Expressions Expression = SimpleExpression [ ( "=" | "#" | "<" | "<=" | ">" | ">=" | ".=" | ".#" | ".<" | ".<=" | ".>" | ".>=" | in ) SimpleExpression ] | Designator implements DefinitionName | Designator is TypeName. SimpleExpression = [ "+"|"-" ] Term { ( "+" | "" | or ) Term }. Term = Factor { ( "*" | "/" | div | mod | "&" | "+*" | ".*" | "./" | "\" ) Factor }.

58

Factor = number | CharConstant | string | nil | Set | Designator | new TypeName [ "(" ActualParameters ")" ] | new ActivityName [ "(" ActualParameters ")" ] | "(" Expression ")" | "~" Factor | "!" Factor | Factor “**” Factor. Set = "{" [ SetElement { "," SetElement } ] "}". SetElement = Expression [ ".." Expression ]. ExpressionArray = "[" ArrayFactor "]". ArrayFactor = ExpressionArray { "," ExpressionArray } | Expression { "," Expreession }. ExpressionRange = Expression | Range. Range = [Expressions ] ".." [Expression ] [ "by" Expression ] . Designator = Instance | TypeName "(" Expression [ "," Size ] ")" // Convversion | Designator "^" // Dereference | Designator "[" ExpressionRange { "," ExpressionRange } "]" // Arraay element(s) | Designator "(" [ ActualParameters ] ")" // Function call | Designator "." MemberNamee // Member selector Instance = ( self | InstanceName | DefinitionName "(" InstanceName ")" ). Size = ConstantExpression. ActualParameters = Actual { "," Actual }. Actual = Expression [ "{" [ var ] FormalType "}" ]. // Argument with type signaturee

// 11. Constants number = (whole | real) [ "{" Width "}" ]. whole = digit {digit} | digit {hexDigit} "H". real = digit { digit } "." { digit } [ ScaleFactor ]. ScaleFactor = "E" ["+" | "" ] digit { digit }. HexDigit = digit | "A" | "B" | "C" | "D" | "E" | "F". digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9". CharConstant = '"' character '"' | "'" character "'" | digit { HexDigit } "X". string = '"' { character } '"' | "'" { character } "'". character = // Any character from the alphabet except the current delimiter characterr

// 12. Identifiers & names ident = ( letter | "_" ) { letter | digit | "_" }. letter = "A" | ... | "Z" | "a" | ... | "z" | // any oother "culturally-defined" letter IdentList = ident { "," ident }. QualIdent = { ident "." } ident. DefinitionName = QualIdent. ModuleName = QualIdent. NamespaceName = QualIdent. ImplementationName = QualIdent. ObjectName = QualIdent. TypeName = QualIdent. ExceptionName = QualIdent. InstanceName = QualIdent. ActivityInstanceName = QualIdent. ProceddureName = ident. ProtocolName = ident. ActivityName = ident. MemberName = ( ident | OpSymbol ). SimpleName = ident.

59