Parallel System Performance: Evaluation & Scalability

ParallelParallel SystemSystem Performance:Performance: EvaluationEvaluation && ScalabilityScalability • Factors affecting parallel system performance: – Algorithm-related, parallel program related, architecture/hardware-related. • Workload-Driven Quantitative Architectural Evaluation: – Select applications or suite of benchmarks to evaluate architecture either on real or simulated machine. – From measured performance results compute performance metrics: • Speedup, System Efficiency, Redundancy, Utilization, Quality of Parallelism. – Resource-oriented Workload scaling models: How the speedup of a parallel computation is affected subject to specific constraints: 1 • Problem constrained (PC): Fixed-load Model. 2 • Time constrained (TC): Fixed-time Model. 3 • Memory constrained (MC): Fixed-Memory Model. • Parallel Performance Scalability: For a given parallel system and a given parallel computation/problem/algorithm – Definition. Informally: – Conditions of scalability. The ability of parallel system performance to increase – Factors affecting scalability. with increased problem size and system size. Parallel Computer Architecture, Chapter 4 EECC756 - Shaaban Parallel Programming, Chapter 1, handout #1 lec # 9 Spring2013 4-23-2013 Parallel Program Performance • Parallel processing goal is to maximize speedup: Time(1) Sequential Work Speedup = < Time(p) Max (Work + Synch Wait Time + Comm Cost + Extra Work) Fixed Problem Size Speedup Max for any processor Parallelizing Overheads • By: 1 – Balancing computations/overheads (workload) on processors (every processor has the same amount of work/overheads). 2 – Minimizing communication cost and other overheads associated with each step of parallel program creation and execution. Parallel Performance Scalability: For a given parallel system and parallel computation/problem/algorithm Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of processors) are increased. Or Continue to achieve good parallel performance "speedup"as the sizes of the system/problem are increased. (More formal treatment of scalability later) EECC756 - Shaaban #2 lec # 9 Spring2013 4-23-2013 FactorsFactors affectingaffecting ParallelParallel SystemSystem PerformancePerformance • Parallel Algorithm-related: i.e Inherent – Available concurrency and profile, dependency graph, uniformity, patterns. Parallelism – Complexity and predictability of computational requirements – Required communication/synchronization, uniformity and patterns. – Data size requirements. • Parallel program related: – Partitioning: Decomposition and assignment to tasks • Parallel task grain size. • Communication to computation ratio. C-to-C ratio (measure of inherent communication) – Programming model used. For a given partition – Orchestration • Cost of communication/synchronization. – Resulting data/code memory requirements, locality and working set characteristics. – Mapping & Scheduling: Dynamic or static. • Hardware/Architecture related: – Total CPU computational power available. + Number of Processors – Parallel programming model support: • e.g support for Shared address space Vs. message passing support. • Architectural interactions, artifactual “extra” communication – Communication network characteristics: Scalability, topology .. – Memory hierarchy properties. EECC756 - Shaaban Refined from factors in Lecture # 1 #3 lec # 9 Spring2013 4-23-2013 ParallelParallel PerformancePerformance MetricsMetrics RevisitedRevisited Observed Concurrency Profile MIN( Software Parallelism , Hardware Parallelism ) • Degree of Parallelism (DOP): For a given time period, reflects the number of processors in a specific parallel computer actually executing a particular parallel program. • Average Parallelism, A: i.e DOP at a given time = Min (Software Parallelism, Hardware Parallelism) – Given maximum parallelism = m – n homogeneous processors Computations/sec – Computing capacity of a single processor ∆ – Total amount of work (instructions or computations): t 2 m W= ∆ DOP() t dt or as a discrete summation ∫ Wi= ∆∑ .t i t1 i=1 m =− Where ti is the total time that DOP = i and ∑ttti 21 Execution i=1 The average parallelism A: Time m m t 2 ⎛ ⎞ ⎛ ⎞ 1 In discrete form Ai= ⎜ . ⎟ ⎜ ⎟ A= DOP() t dt ⎝∑∑tti⎠ ⎝ i⎠ − ∫ i==11i Execution t 211t t DOP Area Time EECC756 - Shaaban From Lecture # 3 #4 lec # 9 Spring2013 4-23-2013 Example:Example: ConcurrencyConcurrency ProfileProfile ofof AA DivideDivide--andand--ConquerConquer AlgorithmAlgorithm ⎛ m ⎞ ⎛ m ⎞ • Execution observed from t1 = 2 to t2 = 27 Ai= ⎜∑∑.tti⎟ ⎜ i⎟ • Peak parallelism m = 8 ⎝ i==11⎠ ⎝ i ⎠ • A = (1x5 + 2x3 + 3x4 + 4x6 + 5x2 + 6x2 + 8x3) / (5 + 3+4+6+2+2+3) Average Parallelism = 93/25 = 3.72 Degree of Parallelism (DOP) 11 10 Concurrency Profile 9 8 7 6 5 4 3 2 Area equal to total # of 1 computations or work, W 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 t1 Time t2 EECC756 - Shaaban From Lecture # 3 #5 lec # 9 Spring2013 4-23-2013 ParallelParallel PerformancePerformance MetricsMetrics RevisitedRevisited AsymptoticAsymptotic Speedup:Speedup: i.e. Hardware Parallelism > Software Parallelism (more processors n than max software DOP, m) m m Execution time with W i T ()11==∑∑t i () one processor i==11i ∆ Execution time with m m W i an infinite number T ()∞=∑∑t i () ∞= of available processors i==11i i∆ (number of processors m n = ∞ or n > m ) T ()1 ∑W i Asymptotic speedup S = = i=1 ∞ S ∞ T ()∞ m The above ignores all overheads. ∑W i i i=1 ∆ = Computing capacity of a single processor m = maximum degree of software parallelism Keeping problem size fixed and ignoring parallelization overheads/extra work ti = total time that DOP = i W = total work with DOP = i i EECC756 - Shaaban i.e. Hardware parallelism n exceeds software parallelism m #6 lec # 9 Spring2013 4-23-2013 PhasePhase ParallelParallel ModelModel ofof AnAn ApplicationApplication • Consider a sequential program of size s consisting of k computational phases C1 …. Ck where each phase Ci has a degree of parallelism DOP = i • Assume single processor execution time of phase Ci = T1(i) i=k • Total single processor execution time = = (i) k = max. DOP T 1 ∑T 1 i=1 i=k • Ignoring overheads, n processor execution time: = (i) / min(i,n) T n ∑T 1 i=1 • If all overheads are grouped as interaction Tinteract = Synch Time + Comm Cost Accounting for and parallelism T = Extra Work, as h(s, n) = T +T then parallel parallelization par interact par overheads execution time: i=k n = number of processors Lump sum overheads term h(s,n) = (i) / min(i,n) + h(s,n) n ∑ 1 T T s = problem size i=1 Total overheads • If k = n and f is the fraction of sequential execution time with DOP =i DOP i Profile π = {fi|i = 1, 2, …, n} and ignoring overheads ( h(s, n) = 0) the speedup is given by: T 1 1 S(n) = S(∞) = = n i T n ()∑i=1 f i π = {f |i = 1, 2, …, n} for max DOP = n i EECC756 - Shaaban is parallelism degree probability distribution (DOP profile) EECC756 - Shaaban #7 lec # 9 Spring2013 4-23-2013 HarmonicHarmonic MeanMean SpeedupSpeedup forfor nn ExecutionExecution ModeMode MultiprocessorMultiprocessor systemsystem Fig 3.2 page 111 See handout EECC756 - Shaaban #8 lec # 9 Spring2013 4-23-2013 ParallelParallel PerformancePerformance MetricsMetrics Revisited:Revisited: Amdahl’sAmdahl’s LawLaw • Harmonic Mean Speedup (i number of processors used fi is the fraction of sequential execution time with DOP =i ): 1 DOP =1 S(n) = = DOP =n T 1 T n n (sequential) i ()∑i=1 f i • In the case π = {fi for i = 1, 2, .. , n} = (α, 0, 0, …, 1-α), the system is running sequential code with probability α and utilizing n processors with probability (1-α) with other processor modes not utilized. Amdahl’s Law: Keeping problem size fixed 1 and ignoring overheads = (i.e h(s, n) = 0) S n α + (1−α) / n S → 1/α as n →∞ ⇒ Under these conditions the best speedup is upper-bounded by 1/α EECC756 - Shaaban Alpha = α = Sequential fraction with DOP = 1 #9 lec # 9 Spring2013 4-23-2013 ParallelParallel PerformancePerformance MetricsMetrics RevisitedRevisited Efficiency, Utilization, Redundancy, Quality of Parallelism i.e total parallel work on n processors i.e. Each operation takes one time unit • System Efficiency: Let O(n) be the total number of unit operations performed by an n-processor system and T(n) be the parallel execution time in unit time steps: – In general T(n) << O(n) (more than one operation is performed by more than one processor in unit time). n = number of processors – Assume T(1) = O(1) Here O(1) = work on one processor O(n) = total work on n processors – Speedup factor: S(n) = T(1) /T(n) • Ideal T(n) = T(1)/n -> Ideal speedup = n – Parallel System efficiency E(n) for an n-processor system: E(n) = S(n)/n = T(1)/[nT(n)] Ideally: Ideal speedup: S(n) = n and thus ideal efficiency: E(n) = n /n = 1 EECC756 - Shaaban #10 lec # 9 Spring2013 4-23-2013 ParallelParallel PerformancePerformance MetricsMetrics RevisitedRevisited Cost, Utilization, Redundancy, Quality of Parallelism • Cost: The processor-time product or cost of a computation is defined as Speedup = T(1)/T(n) Efficiency = S(n)/n Cost(n) = n T(n) = n x T(1) / S(n) = T(1) / E(n) – The cost of sequential computation on one processor n=1 is simply T(1) – A cost-optimal parallel computation on n processors has a cost proportional Ideal parallel to T(1) when: speedup S(n) = n, E(n) = 1 ---> Cost(n) = T(1) • Redundancy: R(n) = O(n)/O(1) • Ideally with no overheads/extra work O(n) = O(1) -> R(n) = 1 • Utilization: U(n) = R(n)E(n) = O(n) /[nT(n)] Assuming: T(1) = O(1)

Parallel System Performance: Evaluation & Scalability

High Performance Computing Through Parallel and Distributed Processing

Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load ﬂows for Power Distribution Systems

Scalability and Performance Management of Internet Applications in the Cloud

Scalability and Optimization Strategies for GPU Enhanced Neural Networks (Genn)

Multiprocessing and Scalability

Scalable Task Parallel Programming in the Partitioned Global Address Space

Pascal Viewer: a Tool for the Visualization of Parallel Scalability Trends

Computer Architecture: Parallel Processing Basics

Parallel Programming with Openmp

Challenges for the Message Passing Interface in the Petaflops Era

Scalable SIMD-Efficient Graph Processing on Gpus

Computing at Massive Scale: Scalability and Dependability Challenges