732A54 Traditional Use of : Big Data Analytics Large-Scale HPC Applications

n High Performance Computing (HPC) l Much computational work (in FLOPs, floatingpoint operations) l Often, large data sets Introduction to l E.g. climate simulations, particle physics, engineering, sequence Parallel Computing matching or proteine docking in bioinformatics, … n Single-CPU computers and even today’s multicore processors cannot provide such massive computation power Christoph Kessler n Aggregate LOTS of computers à Clusters IDA, Linköping University l Need scalable parallel algorithms l Need to exploit multiple levels of parallelism

Christoph Kessler, IDA, Linköpings universitet. C. Kessler, IDA, Linköpings universitet. NSC Triolith2

More Recent Use of Parallel Computing: HPC vs Big-Data Computing Big-Data Analytics Applications

n Big Data Analytics n Both need parallel computing n Same kind of hardware – Clusters of (multicore) servers l Data access intensive (disk I/O, memory accesses) n Same OS family (Linux) 4Typically, very large data sets (GB … TB … PB … EB …) n Different programming models, languages, and tools l Also some computational work for combining/aggregating data l E.g. data center applications, business analytics, click stream HPC application Big-Data application analysis, scientific data analysis, machine learning, … HPC prog. languages: Big-Data prog. languages: l Soft real-time requirements on interactive querys Fortran, C/C++ (Python) Java, Scala, Python, … Programming models: Programming models: n Single-CPU and multicore processors cannot MPI, OpenMP, … MapReduce, Spark, … provide such massive computation power and I/O bandwidth+capacity Scientific computing Big-data storage/access: libraries: BLAS, … HDFS, … n Aggregate LOTS of computers à Clusters OS: Linux OS: Linux l Need scalable parallel algorithms HW: Cluster HW: Cluster l Need to exploit multiple levels of parallelism l Fault tolerance à Let us start with the common basis: Parallel computer architecture C. Kessler, IDA, Linköpings universitet. NSC Triolith3 C. Kessler, IDA, Linköpings universitet. 4

Parallel Computer Parallel Computer Architecture Concepts

Classification of parallel computer architectures: n by control structure l SISD, SIMD, MIMD n by memory organization l in particular, vs. n by interconnection network topology

C. Kessler, IDA, Linköpings universitet. 5 C. Kessler, IDA, Linköpings universitet. 6

1 Classification by Control Structure Classification by Memory Organization

op e.g. (traditional) HPC cluster e.g. multiprocessor (SMP) or computer with a standard multicore CPU

Most common today in HPC and Data centers: Hybrid Memory System • Cluster (distributed memory) op op op op 1 2 3 4 of hundreds, thousands of shared-memory servers each containing one or several multi-core CPUs

C. Kessler, IDA, Linköpings universitet. 7 C. Kessler, IDA, Linköpings universitet. NSC Triolith8

Hybrid (Distributed + Shared) Memory Interconnection Networks (1)

n Network = physical interconnection medium (wires, switches) + communication protocol (a) connecting cluster nodes with each other (DMS) (b) connecting processors with memory modules (SMS)

Classification M M n Direct / static interconnection networks l connecting nodes directly to each other P R l Hardware routers (communication coprocessors) can be used to offload processors from most communication work n Switched / dynamic interconnection networks

C. Kessler, IDA, Linköpings universitet. 9 C. Kessler, IDA, Linköpings universitet. 10

Interconnection Networks (2): P Interconnection Networks (3): P Simple Topologies P Fat-Tree Network P fully connected P P n Tree network extended for higher bandwidth (more switches, more links) closer to the root l avoids bandwidth bottleneck

n Example: Infiniband network (www.mellanox.com)

C. Kessler, IDA, Linköpings universitet. 11 C. Kessler, IDA, Linköpings universitet. 12

2 More about Interconnection Networks Example: Beowulf-class PC Clusters

n Hypercube, Crossbar, Butterfly, Hybrid networks… à TDDC78

n Switching and routing algorithms with off-the-shelf CPUs (Xeon, Opteron, …) n Discussion of interconnection network properties l Cost (#switches, #lines) l (asymptotically, cost grows not much faster than #nodes) l Node degree l Longest path (à latency) l Accumulated bandwidth l Fault tolerance (worst-case impact of node or switch failure) l … C. Kessler, IDA, Linköpings universitet. 13 C. Kessler, IDA, Linköpings universitet. 14

Cluster Example: The Challenge Triolith (NSC, 2012 / 2013) n Today, basically all computers are parallel computers! A so-called Capability cluster l Single- performance stagnating (fast network for parallel applications, l Dozens of cores and hundreds of HW threads available per server not for just lots of independent l May even be heterogeneous (core types, accelerators) sequential jobs) l Data locality matters 1200 HP SL230 servers (compute l Large clusters for HPC and Data centers, require message passing nodes), each equipped with n Utilizing more than one CPU core requires thread-level parallelism 2 Intel E5-2660 (2.2 GHz Sandybridge) n One of the biggest software challenges: Exploiting parallelism processors with 8 cores each l Need LOTS of (mostly, independent) tasks to keep cores/HW threads busy and overlap waiting times (cache misses, I/O accesses) à 19200 cores in total l All application areas, not only traditional HPC 4 General-purpose, data mining, graphics, games, embedded, DSP, … à Theoretical peak performance l Affects HW/SW system architecture, programming languages, of 338 Tflops/s algorithms, data structures … l Parallel programming is more error-prone Mellanox Infiniband network (, races, further sources of inefficiencies) (Fat-tree topology) 4 And thus more expensive and time-consuming C. Kessler, IDA, Linköpings universitet. NSC Triolith15 C. Kessler, IDA, Linköpings universitet. 16

Can’t the compiler fix it for us? Insight

n Design of efficient / scalable parallel algorithms is, n Automatic parallelization? in general, a creative task that is not automatizable l at compile time: n But some good recipes exist … 4Requires static analysis – not effective for pointer-based l Parallel algorithmic design patterns à languages

– inherently limited – missing runtime information 4needs programmer hints / rewriting ... 4ok only for few benign special cases:

– loop vectorization

– extraction of instruction-level parallelism l at run time (e.g. speculative multithreading) 4High overheads, not scalable

C. Kessler, IDA, Linköpings universitet. 17 C. Kessler, IDA, Linköpings universitet. 18

3 The remaining solution … Parallel Programming Model

n Manual parallelization! n System-software-enabled programmer’s view of the underlying hardware n Abstracts from details of the underlying architecture, e.g. network topology l using a parallel programming language / framework, n Focuses on a few characteristic properties, e.g. memory model 4e.g. MPI message passing interface for distributed memory; à Portability of algorithms/programs across a family of parallel architectures 4Pthreads, OpenMP, TBB, … for shared-memory l Generally harder, more error-prone than sequential Programmer’s view of Message passing programming, the underlying system (Lang. constructs, API, …) Shared Memory 4requires special programming expertise to exploit the HW à Programming model

resources effectively Mapping(s) performed by l Promising approach: programming toolchain (compiler, runtime system, Domain-specific languages/frameworks, library, OS, …) 4Restricted set of predefined constructs doing most of the low-level stuff under the hood Underlying parallel computer architecture 4e.g. MapReduce, Spark, … for big-data computing

C. Kessler, IDA, Linköpings universitet. 19 C. Kessler, IDA, Linköpings universitet. 20

Foster’s Method for Design of Parallel Programs (”PCAM”)

PROBLEM PARALLEL + algorithmic PARTITIONING ALGORITHM approach DESIGN

Design and Analysis Elementary COMMUNICATION Tasks of Parallel Algorithms + SYNCHRONIZATION Textbook-style

Introduction AGGLOMERATION PARALLEL MAPPING ALGORITHM + SCHEDULING ENGINEERING

P1 P2 P3 (Implementation and adaptation for a specific Macrotasks (type of) parallel computer) Christoph Kessler, IDA, Linköpings universitet. C. Kessler, IDA, Linköpings universitet. à I. Foster, Designing and Building Parallel Programs. Addison-Wesley, 1995.22

Parallel Computation Model = Programming Model + Cost Model

Parallel Cost Models

A Quantitative Basis for the Design of Parallel Algorithms

Christoph Kessler, IDA, Linköpings universitet. C. Kessler, IDA, Linköpings universitet. 23

4 How to analyze sequential algorithms: Cost Model The RAM (von Neumann) model for sequential computing

Basic operations (instructions): - Arithmetic (add, mul, …) on registers - Load - Store op - Branch op1

Simplifying assumptions for time analysis: op2 - All of these take 1 time unit - Serial composition adds time costs T(op1;op2) = T(op1)+T(op2)

C. Kessler, IDA, Linköpings universitet. 25 C. Kessler, IDA, Linköpings universitet. 26

Analysis of sequential algorithms: The PRAM Model – a Parallel RAM RAM model (Random Access Machine)

s = d[0] for (i=1; i

ß Data flow graph, showing dependences (precedence constraints) between operations

PRAM variants à TDDD56, TDDC78 C. Kessler, IDA, Linköpings universitet. 27 C. Kessler, IDA, Linköpings universitet. 28

Remark A first parallel sum algorithm …

PRAM model is very idealized, Keep the sequential sum algorithm’s structure / data flow graph. extremely simplifying / abstracting from real parallel architectures: Giving each processor one task (load, add) does not help much – All n loads could be done in parallel, but The PRAM cost model – Processor i needs to wait for partial result from processor i-1, for i=1,…,n-1 has only 1 machine-specific parameter: the number of processors time ß Data flow graph, showing dependences (precedence constraints) between operations

à Good for early analysis of parallel algorithm designs: A parallel algorithm that does not scale under the PRAM model does not scale well anywhere else! à Still O(n) time steps! C. Kessler, IDA, Linköpings universitet. 29 C. Kessler, IDA, Linköpings universitet. 30

5 Divide&Conquer Parallel Sum Algorithm Divide&Conquer Parallel Sum Algorithm in the PRAM / Circuit (DAG) cost model in the PRAM / Circuit (DAG) cost model

Recurrence equation for Recurrence equation for parallel execution time: parallel execution time:

T(1) = O(1) T(1) = O(1)

C. Kessler, IDA, Linköpings universitet. 31 C. Kessler, IDA, Linköpings universitet. 32

Recursive formulation of DC parallel sum Circuit / DAG model algorithm in some programming model n Independent of how the parallel computation is expressed, the resulting (unfolded) task graph looks the same. Implementation e.g. in : (shared memory)

cilk int parsum ( int *d, int from, int to ) { int mid, sumleft, sumright; if (from == to) return d[from]; // base case else { mid = (from + to) / 2; sumleft = spawn parsum ( d, from, mid ); n Task graph is a directed acyclic graph (DAG) G=(V,E) sumright = parsum( d, mid+1, to ); l Set V of vertices: elementary tasks (taking time 1 resp. O(1) each) sync; // The main program: l Set E of directed edges: dependences (partial order on tasks) return sumleft + sumright; (v1,v2) in E à v1 must be finished before v2 can start } main() { n Critical path = longest path from an entry to an exit node } Fork-Join execution style: … l Length of critical path is a lower bound for parallel single task starts, tasks spawn child tasks for parsum ( data, 0, n-1 ); n Parallel time can be longer if number of processors is limited independent subtasks, and … à schedule tasks to processors such that dependences are preserved } synchronize with them (by programmer (SPMD execution) or run-time system (fork-join exec.)) C. Kessler, IDA, Linköpings universitet. 33 C. Kessler, IDA, Linköpings universitet. 34

For a fixed number of processors … ? For a fixed number of processors … ?

n Usually, p << n n Usually, p << n n Requires scheduling the work to p processors n Requires scheduling the work to p processors

(A) manually, at algorithm design time: (B) automatically, at run time: n Requires algorithm engineering n Requires a task-based runtime system n E.g. stop the parallel divide-and-conquer e.g. at subproblem size n/p with dynamic scheduler and switch to sequential divide-and-conquer (= task agglomeration) l Each newly created task is dispatched For parallel sum: at runtime to an available worker processor. Run-time l Step 0. Partition the array of n elements in p slices of n/p scheduler elements each (= domain decomposition) l Load balancing (overhead) l Step 1. Each processor calculates a local sum for one slice, 4Central task queue where idle workers using the sequential sum algorithm, fetch next task to execute resulting in p partial sums (intermediate values) 4Local task queues + Work stealing – … l Step 2. The p processors run the parallel algorithm idle workers steal a task from to sum up the intermediate values to the global sum. some other processor Worker threads C. Kessler, IDA, Linköpings universitet. 35 C. Kessler, IDA, Linköpings universitet. 1:1 pinned to cores 36

6 Analysis of Parallel Algorithms

Performance metrics of parallel programs n Parallel execution time l Counted from the start time of the earliest task to the finishing time of the latest task n Work – the total number of performed elementary operations Analysis of Parallel Algorithms n Cost – the product of parallel execution time and #processors n Speed-up l the factor by how much faster we can solve a problem with p processors than with 1 processor, usually in range (0…p) n Parallel efficiency = Speed-up / #processors, usually in (0…1) High latency, n Throughput = #operations finished per second high n Scalability bandwidth l does keep growing well Christoph Kessler, IDA, also when #processors grows large? Linköpings universitet. C. Kessler, IDA, Linköpings universitet. 38

Analysis of Parallel Algorithms Parallel Time, Work, Cost

Asymptotic Analysis n Estimation based on a cost model and algorithm idea (pseudocode operations) n Discuss behavior for large problem sizes, large #processors

Empirical Analysis n Implement in a concrete parallel programming langauge s = d[0] n Measure time on a concrete parallell computer for (i=1; i

C. Kessler, IDA, Linköpings universitet. 39 C. Kessler, IDA, Linköpings universitet. 40

Parallel work, time, cost Speedup

>

Speedup S(p) with p processors is usually in the range (0…p)

C. Kessler, IDA, Linköpings universitet. 41 C. Kessler, IDA, Linköpings universitet. 42

7 Amdahl’s Law: Upper bound on Speedup Amdahl’s Law

C. Kessler, IDA, Linköpings universitet. 43 C. Kessler, IDA, Linköpings universitet. 44

Proof of Amdahl’s Law

Towards More Realistic Cost Models

Modeling the cost of communication and data access

Christoph Kessler, IDA, Linköpings universitet. C. Kessler, IDA, Linköpings universitet. 45

Memory Hierarchy Modeling Communication Cost: Delay Model And The Real Cost of Data Access

Transfer Access Processor / CPU cores Access Capacity Block Bandwidth Latency core each containing few (~32) [B] Size [B] [GB/s] [ns] general-purpose data registers L1 and L1 cache Small Very (~10KB) Very fast L2 L2 cache (on-chip) - ~1MB high (few Small cc) L3 L3 cache (on-chip) – ~64MB (~10.. 100B) Computer’s main Primary Storage memory (off-chip) > (DRAM) ~64 GB 100 cc

I/O Secondary Storage Large (KB) (Hard Disk, SSD) Network High Very Mode- Slow (E.g. other nodes (TB) rate (ms in a cluster; to …s) internet, …) Tertiary Storage low (Tapes, …) Cloud storage C. Kessler, IDA, Linköpings universitet. 47 C. Kessler, IDA, Linköpings universitet. 48

8 Data Locality Memory-bound vs. CPU-bound computation

n Memory hierarchy rationale: Try to amortize the high access cost of lower levels (DRAM, disk, …) by caching data in higher levels for n Arithmetic intensity of a computation faster subsequent accesses = #arithmetic instructions (computational work) executed l Cache miss – stall the computation. fetch the block of data containing per accessed element of data in memory (after cache miss) the accessed address from next lower level, then resume l More reuse of cached data (cache hits) à better performance n A computation is CPU-bound n Working set = the set of memory addresses accessed together in if its arithmetic intensity is >> 1. a period of computation l The performance bottleneck is the CPU’s arithmetic throughput n Data locality = property of a computation: keeping the working set small during a computation n A computation is memory-access bound otherwise. l Temporal locality – re-access same data element multiple times within a short time interval l The performance bottleneck is memory accesses, CPU is not fully utilized l Spatial locality – re-access neighbored memory addresses multiple times within a short time interval n Examples: n High latency favors larger transfer block sizes (cache lines, memory pages, file blocks, messages) for amortization over many l Matrix-matrix-multiply (if properly implemented) is CPU-bound. subsequent accesses l Array global sum is memory-bound on most architectures. C. Kessler, IDA, Linköpings universitet. 49 C. Kessler, IDA, Linköpings universitet. 50

Data Parallelism

 Given:

l One (or several) data containers x , z, … with n elements each,

e.g. array(s) x = (x1,...xn), z = (z1,…,zn), …

l An operation f on individual elements of x, z, … Some Parallel Algorithmic (e.g. incr, sqrt, mult, ...) Design Patterns  Compute: y = f(x) = ( f(x1), ..., f(xn) )

 Parallelizability: Each data element defines a task

l Fine grained parallelism

l Easily partitioned into independent tasks, fits very well on all parallel architectures

map(f,a,b):  Notation with higher-order function:

l y = map ( f, x )

Christoph Kessler, IDA, Linköpings universitet. C. Kessler, IDA, Linköpings universitet. 52

Data-parallel Reduction MapReduce (pattern)

 Given: n A Map operation with operation f l A data container x with n elements, on one or several input data containers x, …, e.g. array x = (x1,...xn) producing a temporary output data container w, l A binary, associative operation op on individual elements of x directly followed by a Reduce with operation g on w (e.g. add, max, bitwise-or, ...) producing result y

 Compute: y = OPi=1…n x = x1 op x2 op ... op xn n y = MapReduce ( f, g, x, … )

 Parallelizability: Exploit associativity of op n Example:

Dot product of two vectors x, z: y = Σi xi * zi f = scalar multiplication, g = scalar addition  Notation with higher-order function:

l y = reduce ( op, x ) C. Kessler, IDA, Linköpings universitet. 53 C. Kessler, IDA, Linköpings universitet. 54

9 Task Farming Task Farming dispatcher dispatcher

 Independent subcomputations f1, f2, ..., fm  Independent subcomputations f1, f2, ..., fm could be done in parallel and/or in arbitrary could be done in parallel and/or in arbitrary order, e.g. f1 f2 … fm order, e.g. f1 f2 … fm

l independent loop iterations l independent loop iterations

l independent function calls collector l independent function calls collector

 Scheduling (mapping) problem time  Scheduling (mapping) problem time

l m tasks onto p processors P1 l m tasks onto p processors P1 f1 f1 l static (before running) or dynamic l static (before running) or dynamic P2 P2 l Load balancing is important: l Load balancing is important: P3 P3 most loaded processor determines f2 most loaded processor determines f2 the parallel execution time the parallel execution time

 Notation with higher-order function:  Notation with higher-order function:

l farm (f1, ..., fm) (x1,...,xn) l farm (f1, ..., fm) (x1,...,xn)

C. Kessler, IDA, Linköpings universitet. 55 C. Kessler, IDA, Linköpings universitet. 56

Parallel Divide-and-Conquer Example: Parallel Divide-and-Conquer

 (Sequential) Divide-and-conquer:

l If given problem instance P is trivial, solve it directly. Otherwise:

l Divide: Decompose problem instance P in one or several smaller

independent instances of the same problem, P1, ..., Pk

l For each i: solve Pi by recursion.

l Combine the solutions of the Pi into an overall solution for P Example: Parallel Sum over integer-array x

 Parallel Divide-and-Conquer: Exploit associativity:

l Recursive calls can be done in parallel. Sum(x1,...,xn) = Sum(x1,...xn/2) + Sum(xn/2+1,...,xn) l Parallelize, if possible, also the divide and combine phase. Divide: trivial, split array x in place l Switch to sequential divide-and-conquer when enough parallel tasks have been created. Combine is just an addition. y = DC ( split, add, nIsSmall, addFewInSeq, n, x )  Notation with higher-order function:

l solution = DC ( divide, combine, istrivial, solvedirectly, n, P ) à Data parallel reductions are an important special case of DC.

C. Kessler, IDA, Linköpings universitet. 57 C. Kessler, IDA, Linköpings universitet. 58

Pipelining … Streaming …

 applies a sequence of dependent computations/tasks (f1, f2, ..., fk) x3 n Streaming applies pipelining to processing x3 elementwise to data sequence x = (x ,x ,x ,...,x ) 1 2 3 n of large (possibly, infinite) data streams l For fixed x, must compute f (x) before f (x) x2 x2 j i j i+1 j from or to memory, network or devices, l … and f (x) before f (x ) if the tasks f have a run-time state i j i j+1 i x1 usually partitioned in fixed-sized data packets, x1  Parallelizability: Overlap execution of all fi for k subsequent xj l in order to overlap the processing of l time=1: compute f1(x1) each packet of data in time with Read a f1 packet of f1 l time=2: compute f (x ) and f (x ) 1 2 2 1 access of subsequent units of data stream data l time=3: compute f1(x3) and f2(x2) and f3(x1) and/or processing of preceding packets f2 f2 l ... of data. a packet l Total time: O ( (n+k) max ( time(f ) )) with k processors i i f Process f k n Examples it more 3 l Still, requires good mapping of the tasks fi to the processors for even load balancing – often, static mapping (done before running) l Video streaming from network to display Write f result k  Notation with higher-order function: l Surveillance camera, face recognition

l (y1,…,yn) = pipe ( (f1, ..., fk), (x1,…,xn) ) l Network data processing e.g. deep packet inspection C. Kessler, IDA, Linköpings universitet. 59 C. Kessler, IDA, Linköpings universitet. 60

10 … Stream Farming (Algorithmic) Skeletons x3

Combining streaming and task farming patterns x2  Skeletons are reusable, parameterizable SW components with well defined semantics for which efficient parallel implementations may be available. x1  Inspired by higher-order functions in functional programming

 Independent streaming  One or very few skeletons per parallel algorithmic paradigm subcomputations f1, f2, ..., fm dispatcher Image source: l map, farm, DC, reduce, pipe, scan ... A. Ernstsson, 2016 on each data packet

 Parameterised in user code  Speed up the pipeline f1 f2 … fm l Customization by instantiating a skeleton template by parallel processing of in a user-provided function subsequent data packets collector  Composition of skeleton instances in program code normally by sequencing+data flow  In most cases, the original order of packets For frequent combinations, may must be kept after processing … l e.g. squaresum( x ) can be defined by { define advanced skeletons, e.g.: tmp = map( sqr, x ); { return reduce( add, tmp ); mapreduce( sqr, add, x ) C. Kessler, IDA, Linköpings universitet. 61 C. Kessler, IDA, Linköpings} universitet. } 62

High-Level Parallel Programming SkePU [Enmyren, K. 2010] with Skeletons

n Skeleton programming library for heterogeneous multicore systems, n Skeletons (constructs) implement (parallel) algorithmic design patterns based on C++ ☺ Abstraction, hiding complexity (parallelism and low-level programming)

Image source: n Example: Global sum in SkePU-2 [Ernstsson 2016] A. Ernstsson, 2016 K Enforces structuring, restricted set of constructs ☺ Parallelization for free Add ☺ Easier to analyze and transform L Requires complete understanding and rewriting of a computation L Available skeleton set does not always fit L May lose some efficiency compared to manual parallelization

n Idea developed in HPC (mostly in Europe) since the late 1980s. Add n Many (esp., academic) frameworks exist, mostly as libraries n Industry (also beyond HPC domain) has adopted skeletons l map, reduce, scan in many modern parallel programming 4 e.g., Intel Threading Building Blocks (TBB): par. for, par. reduce, pipe 4 NVIDIA Thrust l Google MapReduce (for distributed data mining applications) C. Kessler, IDA, Linköpings universitet. 63 C. Kessler, IDA, Linköpings universitet. 64

Further Reading Questions for Reflection

n Model the overall cost of a streaming computation with a very large number N of input data elements on a single processor (a) if implemented as a loop over the data elements running on an ordinary memory hierarchy with hardware caches (see above) On PRAM model and Design and Analysis of Parallel Algorithms (b) if overlapping computation for a data packet with transfer/access of the next data packet n J. Keller, C. Kessler, J. Träff: Practical PRAM Programming. Wiley (b1) if the computation is CPU-bound Interscience, New York, 2001. (b2) if the computation is memory-bound n J. JaJa: An introduction to parallel algorithms. Addison-Wesley, 1992. n Which property of streaming computations makes it possible to overlap n D. Cormen, C. Leiserson, R. Rivest: Introduction to Algorithms, Chapter computation with data transfer? 30. MIT press, 1989, or a later edition. n Can each dataparallel computation be streamed? n H. Jordan, G. Alaghband: Fundamentals of Parallel Processing. Prentice Hall, 2003. n What are the performance advantages and disadvantages of large vs. small packet sizes in streaming? n A. Grama, G. Karypis, V. Kumar, A. Gupta: Introduction to Parallel Computing, 2nd Edition. Addison-Wesley, 2003. n Why should servers in datacenters running I/O-intensive tasks (such as disk/DB accesses) get many more tasks to run than they have cores? n How would you extend the skeleton programming approach for computations On skeleton programming, see e.g. our publications on SkePU: that operate on secondary storage (file/DB accesses)? n http://www.ida.liu.se/labs/pelab/skepu C. Kessler, IDA, Linköpings universitet. 65 C. Kessler, IDA, Linköpings universitet. 66

11