Programming with CUDA
Total Page:16
File Type:pdf, Size:1020Kb
Programming with CUDA Jens K. Mueller [email protected] Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Tuesday 31st May, 2011 Today’s lecture: Analyzing Parallel Algorithms Sequential Machine Models Well established models sequential models of computability I Turing machine I Random Access Machine (RAM) I µ-recursive functions I λ-calculus All these notions of computability are Turing equivalent. Church hypothesis: "Every function that is computable is computable by a Turing machine." 2011-05-31 – CUDA 3 / 20 Parallel Machine Models I Directed acyclic graph I Networking models CPUs directly connected I Parallel Random Access Machine (PRAM) RAMs only connected via infinite shared memory I Exclusive Read Exclusive Write (EREW) I Concurrent Read Exclusive Write (CREW) I Concurrent Read Concurrent Write (CRCW) I Common Concurrent Read Concurrent Write I Arbitrary Concurrent Read Concurrent Write I Priority Concurrent Read Concurrent Write I Associative Concurrent Read Concurrent Write I ... Algorithm analysis for one model may not adapt to another model. 2011-05-31 – CUDA 4 / 20 Algorithm Analysis I Run-Time analysis I Space analysis in dependence to input size I Worst Case I Average Case I Best Case I Amortized Average time per operation over a worst-case sequence of operations. 2011-05-31 – CUDA 5 / 20 Big O Notation I f (n) ∈ O(g(n)) There exists a c for all n > n0 such that f (n) ≤ c · g(n). f is asymptotically bounded above by g (up to a constant factor). I f (n) ∈ Ω(g(n)) There exists a c for all n > n0 such that f (n) ≥ c · g(n). f is asymptotically bounded below by g (up to a constant factor). 2011-05-31 – CUDA 6 / 20 Big O Notation (cont.) I f (n) ∈ Θ(g(n)) There exists a c1, c2 for all n > n0 such that c1 · g(n) ≤ f (n) ≤ c2 · g(n). f is asymptotically bounded above and below by g (up to a constant factors). Shorthand for: f (n) ∈ O(g(n)) and f (n) ∈ Ω(g(n)). 2011-05-31 – CUDA 7 / 20 Parallel RAM I Infinite processors I Infinite memory I Constant Random Access I Processors clock is synchronized I Instructions: Read, Compute, Write I Run-time analysis dependent on input size and number of processors 2011-05-31 – CUDA 8 / 20 Running Time Definition (Sequential Running Time) Given a problem P. Then T is the best known running time of an sequential algorithm that solves P. Definition (Parallel Running Time) Given a problem P. Then Tp is the best known running time of an parallel algorithm using p processors that solves P. 2011-05-31 – CUDA 9 / 20 Parallel Speedup Definition (Parallel Speedup) T Sp := Tp where T is the running time of the best known sequential algorithm and Tp is the running time of a parallel algorithm using p processors. Lemma Sp ≤ p The parallel speedup can never be larger than the number of processors. Ideal speedup is p (linear speedup; n-times faster with n-times more processors). 2011-05-31 – CUDA 10 / 20 Parallel Efficiency Definition (Parallel Efficiency) The efficiency of an parallel algorithm is S E = p p p Corollary Ep ≤ 1 2011-05-31 – CUDA 11 / 20 Parallel Cost Definition (Parallel Cost) The cost of an parallel algorithm is Cp := p · Tp i.e., the product of parallel running time and the number of processors. Lemma The costs of an parallel algorithm are greater than the sequential running time: T ≤ Cp 2011-05-31 – CUDA 12 / 20 Parallel Work Definition (Parallel Work) The work of an parallel algorithm W is the total number of executed operations over all processors. Lemma W ≤ Cp Proof. PTp PTp W = i=1 Wi ≤ i=1 p = p · Tp = Cp Wi ≤p 2011-05-31 – CUDA 13 / 20 Work Optimality Definition (Work optimal) A parallel algorithm is work optimal if W = T . 2011-05-31 – CUDA 14 / 20 Cost Optimality Definition (Cost optimal) A parallel algorithm is cost optimal if Cp = T . Corollary For a parallel cost optimal algorithm it follows Sp = p and Ep = 1. Cost optimality implies linear speedup. Example (Summation) I Sequential summation is cost optimal since C1 = 1 · T = 1 · n = O(n) I Parallel summation (using a binary tree) is not cost optimal since Cn = n · log n = O(n · log n) 6= O(n) 2011-05-31 – CUDA 15 / 20 Sorting I T = O(n log n) I Cost optimal algorithms I p = O(n) and Tp = O(log n) n 2 I p = O( log n ) and Tp = O(log n) 2 I Using O(n ) and Tp = O(1) is not cost optimal. I Cost optimality takes care of the number of used processors. Important when there are fewer processors. 2011-05-31 – CUDA 16 / 20 Brent’s Theorem Theorem (Brent) A parallel algorithm using p processors with time Tp and work W can be executed on p0 < p processors with time W 0 Tp = Tp + b p0 c. Brent’s Theorem allows changing the number of processors for a given parallel algorithm. Example (Parallel Reduction (using a binary tree)) 0 n Tp = O(log n). Set p = O( log ). Then O(n) Tp0 = O(log n) + b n c = O(log n) + O(log n) = O(log n). O( log n ) Then the parallel reduction is cost optimal, i.e. n 0 Cp = O(log n) ·O( log n ) = O(n). 2011-05-31 – CUDA 17 / 20 Amdahl’s Law Assume T = 1 and fraction x of the sequential algorithm can be parallelized using p processors. x Then Tp = (1 − x) + . | {z } p sequential part |{z} parallel part Overall speedup is 1 x (1 − x) + p I Based on fixed input 2011-05-31 – CUDA 18 / 20 Amdahl’s Law (cont.) 10 x = 0.5 9 x = 0.6 x = 0.7 8 x = 0.8 x = 0.9 7 6 5 speedup 4 3 2 1 1 4 16 64 256 1024 4096 16384 65536 # processors Figure: Amdahl’s Law 2011-05-31 – CUDA 19 / 20 Questions?.