UNIVERSITY OF CALIFORNIA, IRVINE
Cross-System Runtime Prediction of Parallel Applications on Multi-Core Processors
DISSERTATION
submitted in partial satisfaction of the requirements for the degree of
DOCTOR OF PHILOSOPHY
by
Scott W Godfrey
Dissertation Committee: Professor Amelia Regan, Chair Professor Michael Dillencourt Professor Emeritus Dennis Volper
2016 c 2016 Scott W Godfrey DEDICATION
“Lead, follow, or get out of the way.” -Joe [76]
“It’s only after we’ve lost everything that we’re free to do anything.” -Tyler Durden [20]
“The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt.” -Bertrand Russell (attributed)
“The degree of one’s emotions varies inversely with one’s knowledge of the facts.” -Bertrand Russell (attributed)
“If you’re going through hell, keep on going.” -Unknown
To my family, committee, friends, employers, and all those who have supported my endeavors unquestioningly and uncompromisingly, I salute you as I depart from this fantastic technicolor fantasyland called ‘academia’. You and all those who have acted, and those who will act, in the name of justice and righteousness are the true heroes of the world.
ROSEBUD
ii TABLE OF CONTENTS
Page
LIST OF FIGURES vii
LIST OF TABLES ix
ACKNOWLEDGMENTS x
CURRICULUM VITAE xi
ABSTRACT OF THE DISSERTATION xiii
1 Introduction 1
2 Review of Related Literature 6
3 Modern Parallel Hardware Technology 7 3.1 Flynn’s Taxonomy ...... 7 3.2 Development of CMP Multi-Core ...... 8 3.3 Intel Core2 Architecture ...... 11 3.4 Intel i7 Architecture ...... 11 3.5 Hyperthreading/Hardware Threads ...... 12
4 Parallel Performance Models 16 4.1 Models of Parallel Computation ...... 16 4.1.1 Amdahl’s Law ...... 16 4.1.2 Gustafson’s Law ...... 17 4.1.3 Unification of Amdahl’s and Gustafson’s Laws ...... 18 4.1.4 Parallel Speedup ...... 18 4.1.5 Roofline ...... 19 4.2 Algorithmic Computational Models ...... 20 4.3 Drawbacks in the Modern Era ...... 22 4.3.1 Runtime Variability, Performance Uncertainty, and Noise ...... 25 4.3.2 System Symmetry ...... 26 4.3.3 Lack of Hierarchy ...... 27 4.3.4 Continuous Functions ...... 27
iii 5 Operating System Process/Thread Scheduler Effects 29 5.1 Process Affinity ...... 32 5.2 Thread Affinity ...... 32 5.3 Thread Placement ...... 32 5.4 Affinity in Practice ...... 33 5.5 Variation in Performance ...... 33 5.6 Experimental Thread Affinity Effects ...... 34
6 Structure of a Parallel Application 40 6.1 Composition ...... 40 6.2 Decomposition ...... 46 6.3 Parallelization (from the literature) ...... 47
7 Parallel Benchmarking 49 7.1 Relationship between HYDRA benchmark types ...... 52 7.1.1 Concurrent Processes (independent memory address spaces) . . . . . 52 7.1.2 Concurrent Threads (common memory address space) ...... 53 7.1.3 Individual application task parallel computation ...... 53 7.2 Benchmarking Protocols ...... 55
8 Modular Performance Model 60 8.1 Hardware Parameters ...... 61 8.2 Algorithm Bandwidth ...... 62 8.3 Software Parts ...... 63 8.3.1 Amdahl’s Law ...... 63 8.3.2 Modularity ...... 63 8.3.3 Task Parallelism ...... 64 8.4 Hardware Parts ...... 66 8.4.1 Main Memory Bandwidth ...... 66 8.4.2 Sequential Boost ...... 67 8.4.3 “Virtual” Core Efficiency ...... 67 8.4.4 Lx Space Contention ...... 68 8.4.5 Lx Space Sharing ...... 70 8.5 Contentious Parts ...... 71 8.5.1 H3 Parallel Mutex, Simple ...... 73 8.5.2 H3 Parallel Mutex, Parameterized ...... 73 8.5.3 H2 Sequential Mutex, Parameterized ...... 74 8.5.4 H2 Thread Mutex, Parameterized ...... 74 8.5.5 H1,H2 Model Extension ...... 74 8.6 Operating System Parts ...... 76 8.6.1 Thread Placements ...... 76 8.6.2 Probabilities and Structure of Migrations ...... 79 8.6.3 The Cost of Migrations ...... 83 8.7 Performance Model Implementation ...... 87
iv 9 Experimental Applications 88 9.1 3D Finite-Difference Numerical Integration (FDI) ...... 90 9.1.1 Application Characteristics ...... 90 9.2 3D Surface Reconstruction (SRA) ...... 91 9.2.1 Application Characteristics ...... 92
10 Experimental Toolset 93 10.1 Development Tools ...... 93 10.1.1 Prometheus: Combinatoric Build ...... 93 10.1.2 Ilithyia: Code Generation ...... 94 10.2 Logistics Tools ...... 96 10.2.1 Iris: Distribution and Collection ...... 96 10.2.2 Ponos: Automated Benchmarking ...... 96 10.3 Analysis Tools ...... 97 10.3.1 Pandora: Model Fitting and Cross-Prediction ...... 97
11 Error Analysis 99 11.1 Relevance ...... 99 11.2 Outlier Rejection ...... 100 11.3 Error Metrics and Characterization ...... 100 11.3.1 Total Squared Error, Mean Squared Error ...... 101 11.3.2 Total Absolute Error, Mean Absolute Error ...... 101 11.3.3 Mean Absolute Relative Error ...... 102 11.3.4 Mean Weighted Absolute Relative Error ...... 102 11.3.5 Prediction Methodology ...... 105
12 Optimization 106 12.1 Types of Optimization ...... 106 12.2 Optimization Strategy ...... 107 12.3 Solution Methodology ...... 108
13 Cross-Prediction 112 13.1 Methods and Error Measures ...... 112 13.2 Complications, Caveats, and Limitations ...... 115
14 Predictive Outcomes 117 14.1 Architecture Representation ...... 117 14.2 Model Decomposition ...... 118 14.3 Curve-Fitting Experimental Data ...... 118 14.3.1 Best Fit on Model Parts ...... 118 14.3.2 Best Fit on Model Properties ...... 119 14.3.3 Best Fit on Model ...... 120 14.4 Cross-Prediction ...... 125 14.4.1 Cross-Prediction on Model Parts ...... 125 14.4.2 Cross-Prediction on Model ...... 125
v 15 Conclusions 138
16 Opportunities for Future Work 142
Bibliography 144
A Data Fitting Results 151 A.1 Fitting Errors Per Model ...... 151 A.1.1 Fitting Errors Per Model, All Data ...... 151 A.1.2 Fitting Errors Per Model, FDI ...... 154 A.1.3 Fitting Errors Per Model, SRA ...... 157 A.2 Fitting Errors, Per Part ...... 160 A.2.1 Fitting Errors, Per Part, Aggregate, Per Architecture ...... 160 A.2.2 Fitting Errors, Per Part, FDI, Per Architecture ...... 164 A.2.3 Fitting Errors, Per Part, SRA, Per Architecture ...... 168 A.3 Fitting Errors, Per Property ...... 172 A.3.1 Fitting Errors, Per Property, Aggregate, Per Architecture ...... 172 A.3.2 Fitting Errors, Per Property, FDI, Per Architecture ...... 191 A.3.3 Fitting Errors, Per Property, SRA, Per Architecture ...... 210
B Cross-Prediction Results 229 B.1 Cross Prediction Relative Errors, Per Part ...... 229 B.1.1 Cross Prediction Relative Errors, All Data, Per Part ...... 230 B.1.2 Cross Prediction Relative Errors, Per Part, FDI ...... 231 B.1.3 Cross Prediction Relative Errors, Per Part, SRA ...... 232 B.2 Cross Prediction Relative Errors, Per Model ...... 233 B.2.1 Cross Prediction Relative Errors, All Data ...... 233 B.2.2 Cross Prediction Relative Errors, FDI, per Architecture ...... 244 B.2.3 Cross Prediction Relative Errors, SRA, per Architecture ...... 253
vi LIST OF FIGURES
Page
1.1 Computational Scheme ...... 3 1.2 Predictive System Architecture ...... 4
3.1 Intel i7 cache structure ...... 14 3.2 Intel Core 2 cache structure ...... 14 3.3 AMD FX cache structure ...... 15 3.4 Intel Xeon E5335 cache structure ...... 15
4.1 Multi-platform parallel performance comparisons...... 24
5.1 CPU Utilization 4/8, no affinity control...... 30 5.2 CPU Utilization 5/8, no affinity control...... 31 5.3 CPU Utilization 7/8, no affinity control...... 31 5.4 Core 2 Duo Thread Affinity Effects ...... 35 5.5 Core 2 Quad Thread Affinity Effects ...... 36 5.6 Core i7-4820K Thread Affinity Effects ...... 36 5.7 Core i7-4700MQ Thread Affinity Effects ...... 37 5.8 Core i7-4720HQ Thread Affinity Effects ...... 37 5.9 Core i7-3930K Thread Affinity Effects ...... 38
6.1 Parallel program structure ...... 41 6.2 Parallel contention ...... 42 6.3 Data structure shapes in memory ...... 43 6.4 Wood chipper ...... 44 6.5 CNC router ...... 44
7.1 HYDRA configurations and structure ...... 51 7.2 HYDRA relationships ...... 52 7.3 HYDRA mutexes ...... 54 7.4 HYDRA 3 sample results ...... 56 7.5 HYDRA 1 sample results ...... 57 7.6 HYDRA 1 and 3 composite samples ...... 58 7.7 HYDRA 1 and 3 mean and normalized data ...... 59
8.1 Parallel task blocks ...... 65 8.2 Cache bandwidth partitioning ...... 69
vii 8.3 Thread assignment notation ...... 77 8.4 State migration transition counts ...... 80 8.5 Isomorphic thread configurations ...... 81 8.6 Thread migrations ...... 81 8.7 Heteromorphic state transitions ...... 82
11.1 HYDRA 1 and 3 weighting ...... 104
12.1 Model part-property mapping ...... 110 12.2 Model part-part relations ...... 110 12.3 Model part-property relations ...... 111
14.1 Model Part Representation, Top 25 Best Fit ...... 121 14.2 Model Part Representation, Top 50 Best Fit ...... 122 14.3 Model Part Representation, Top 75 Best Fit ...... 123 14.4 Model Part Representation, Top 100 Best Fit ...... 124 14.5 Predictive Model Complexity ...... 127 14.6 Model Part Representation, Top 25*Archs Cross Prediction ...... 129 14.7 Model Part Representation, Top 50*Archs Cross Prediction ...... 130 14.8 Model Part Representation, Top 75*Archs Cross Prediction ...... 131 14.9 Model Part Representation, Top 100*Archs Cross Prediction ...... 132 14.10Model Part Representation, Top 12 BEST Cross Prediction ...... 137
viii LIST OF TABLES
Page
3.1 Intel architectures ...... 13 3.2 AMD architectures ...... 13
8.1 HYDRA mutexes ...... 72 8.2 HYDRA processor counts ...... 72
14.1 Cross Prediction BEST Models, *denotes complete non-MCS groups . . . . 135
A.1 Comprehensive Model Fitting Errors (MWARE) ...... 152 A.2 Comprehensive Model Fitting Errors (MWARE) ...... 154 A.3 Comprehensive Model Fitting Errors (MWARE) ...... 157
B.4 Cross Prediction comprehensive relative error ...... 234 B.5 Cross Prediction per model relative error arch Core2-Arch to Core2-Arch . . 236 B.6 Cross Prediction per model relative error arch Core2-Arch to i7-Arch . . . . 238 B.7 Cross Prediction per model relative error arch i7-Arch to Core2-Arch . . . . 240 B.8 Cross Prediction per model relative error arch i7-Arch to i7-Arch ...... 242 B.9 Cross Prediction per model relative error arch Core2-Arch to Core2-Arch . . 245 B.10 Cross Prediction per model relative error arch Core2-Arch to i7-Arch . . . . 247 B.11 Cross Prediction per model relative error arch i7-Arch to Core2-Arch . . . . 249 B.12 Cross Prediction per model relative error arch i7-Arch to i7-Arch ...... 251 B.13 Cross Prediction per model relative error arch Core2-Arch to Core2-Arch . . 254 B.14 Cross Prediction per model relative error arch Core2-Arch to i7-Arch . . . . 256 B.15 Cross Prediction per model relative error arch i7-Arch to Core2-Arch . . . . 258 B.16 Cross Prediction per model relative error arch i7-Arch to i7-Arch ...... 260
ix ACKNOWLEDGMENTS
I would like to thank my third PhD advisor, Amelia Regan, who has stood by my side and has acted with sterling merit, support, and credibility – magnitudes above her predecessors. To my advancement committee and especially to my PhD committee, Michael Dillencourt and Dennis Volper, who, together, have allowed me to move on and to close this book of my life.
Thanks to Lorenzo Valdevit for the early years of support financial support and computer usage.
Thanks to MSC Software Corporation for the supplementary computational support needed to finalize this work in an expedient manner.
Thanks to Bill Fisher, Dennis Volper, and Quicksilver Software, Inc. for access to computing hardware and many healthy exchanges over the years and years to come.
x CURRICULUM VITAE
Scott W Godfrey
EDUCATION Doctor of Philosophy in Computer Science 2016 University of California, Irvine Irvine, California Master of Science in Computer Science 2014 University of California, Irvine Irvine, California Master of Science in Aerospace and Mechanical Engineering 2010 University of California, Irvine Irvine, California Bachelor of Science in Aerospace and Mechanical Engineering 2009 University of California, Irvine Irvine, California Associate of Science in Mathematics 2007 Orange Coast College Costa Mesa, California
RESEARCH EXPERIENCE Graduate Student Researcher 2010–2015 University of California, Irvine Irvine, California Technology Transfer Intern (Intellectual Property) 2013–2014 University of California, Irvine, Office of Technology Alliances Irvine, California
TEACHING EXPERIENCE Teaching Assistant, Reader 2011–2016 University of California, Irvine Irvine, California
PROFESSIONAL EXPERIENCE Software Performance Engineer, Parallel Architect 2014–2016 MSC Software Corporation Newport Beach, California Consulting Software Engineer, Parallel Architect 2011–2013 HRL Laboratories, LLC Malibu, California Senior Software Engineer, Senior Technical Lead 1999–2014 Quicksilver Software, Inc. Irvine, California
xi REFEREED JOURNAL PUBLICATIONS Compressive Strength of Hollow Microlattices: Experi- 2013 mental Characterization, Modeling and Optimal Design Journal of Materials Research MEMS resonant load cells for micro-mechanical test 2010 frames: Feasibility study and optimal design Journal of Micromechanics and Microengineering
REFEREED CONFERENCE PUBLICATIONS A novel modeling platform for characterization and op- Apr 2012 timal design of micro-architected materials 2012 AIAA Structural Dynamics and Materials Conference
xii ABSTRACT OF THE DISSERTATION
Cross-System Runtime Prediction of Parallel Applications on Multi-Core Processors
By
Scott W Godfrey
Doctor of Philosophy in Computer Science
University of California, Irvine, 2016
Professor Amelia Regan, Chair
Prediction of the performance of parallel applications is a concept useful in several domains of software operation. In the commercial world, it’s often useful to be able to anticipate how an application will perform on a customer’s machine with a minimal burden to the user. In the same spirit, it’s in the best interest of a user/consumer of computational software to most optimally operate it. In the super-computing/distributed computing world, being able to anticipate the performance of an application on a set of compute-nodes allows one to more optimally select the set of nodes to execute on. In terms of a large-scale shared computing environment where parallel computational jobs are assigned resources and scheduled for execution, being able to optimally do so can improve overall throughput by decreasing contention. In all cases, being able to anticipate the ideal degree of parallelism to invoke during execution (and have reasonable expectations for what can be acheived) will lead to more optimal use of all resources involved. For any of this to be possible, a good model (or models) are required which can not only capture an application’s performance on one machine but also predict its behavior on another.
Here, we present a large family of performance models composed of discrete parts, all as com- binatoric variations on Amdahl’s Law. We establish a protocol involving thorough bench- marking of the application on a known system. A protocol is established for the collection
xiii of meaningful machine architecture and performance information for the known and target machines. With the resulting high quality models and a single execution of the application on the target system we are able to closely predict its parallel behavior.
We propose that computation applications that are in need of this kind of treatment are sufficiently sophisticated and, especially in the case of commercial applications, are most likely black boxes and therefore avoid any need to analyze our applications in any static manner and expressly rely on parallel runtimes of individual executions. The protocols and methods can be implemented by any skilled developer on conceivably any parallel platform without the need of specialized API’s, hardware diagnostic support, or any manner of reverse- engineering of the applications of interest.
xiv Chapter 1
Introduction
The availability and ubiquity of modern parallel processors has led to parallel implementa- tions for many applications. Many applications which are now subject to parallel processing on desktop multi-core systems bear little resemblance to the kinds of applications tradition- ally run on large-scale supercomputers in either form or function.
The need, or rather the opportunity, to schedule parallel tasks or operate parallel applications arises on many occasions. The problem of scheduling parallel tasks presents itself in a manifold of variations on the same general principle: having some quantity of independent tasks to perform and some quantity of resources which allow for multiplexed operation. These tasks may be packed into an application and are all under the hood or else realized individually.
Scheduling can be performed with a full spectrum of knowledge about the applications and architectures ranging from blind execution to having perfect knowledge. Modern operating systems have their own internal task schedulers which are preemptive but are entirely blind. Adding knowledge about applications and architectures into the scheduling equation can improve performance. Therefore, many applications embed their own specific scheduler on
1 top of existing infrastructures [12]. However, obtaining the appropriate information and organizing it in an actionable manner can be difficult. Analytical models, which can view a program or the underlying system at a higher level of abstraction than measurement or simulation techniques can therefore play a complementary role to those methods[1].
The simplified performance models traditionally implemented in most task schedulers are too simple, even simpler than the simplest model we present here, Amdahl’s Law, to make reasonable predictions of runtime and effective use of resources.
While other notions of performance like scalability and speedup may seem interesting, they have little tangible meaning for real-world work and they typically rely on runtime analysis derivatives. “Execution time is by far the most important measure of interest. Therefore performance prediction should be in terms of execution.” [95]. With rare exception, no matter what we’re doing or how we are going about it, we’re always working to minimize runtime in some way even if there are secondary goals to balance.
Here, we present a family of performance models with increasing complexity which are devel- oped based on Amdahl’s Law. Variables in the performance model are either fully abstract quantities (generic parameters of a curve-fit), or quantities inferred as parameters or in- variants of the application. Constants pertaining to qualities of the host machine are also used. Integral variables may be specified to select a best-fit value from an array of possi- bilities. For example, memory speed may be read/write, random/sequential, and pertain to L1/L2/L3/main memory.
These performance models are built for the purpose of cross-predicting parallel application performance over a range of machines with multi-core processors of different architectures. Inferences are made about the underlying applications through fitting the models to exper- imental data obtained from a variety of machines and architectures (figure 1.1, figure 1.2). It is well known that the characterization of a parallel machine is more complex than a
2 Figure 1.1: Explained in detail later, our predictive system relies on application benchmark data, machine-specific information and benchmarks, and varying combinations of model parts to generate predictive models. An optimizer provides solutions for fitting the partic- ular predictive models to the benchmark data and the resulting fitted models are used for performing cross-predictions between machines. uniprocessor, because of the interaction among processors [74]. Our models are appropriate for single machines which, of course, are the building blocks of multi-computers. Methods to predict scalability accurately are necessary in order to improve throughput and overall efficiency on large-scale machines[9].
The primary contributions of our work is that from our family of models we can:
1: Infer from local benchmarks the parametrically averaged structure of the application being evaluated
2: Predict the runtime of said application on a target machine of similar architecture and, arising from high quality runtime predictions,
3: Determine the ideal processor count for operating a parallel application on a target machine.
3 Figure 1.2: The overall system is simple to understand schematically. Parallel software appli- cations and parallel computer systems are fed into a benchmarking system which generates benchmark data for the applications operating on the machines and also benchmark data which specifically characterizes the machines providing a basis for inter-relationships. All benchmark data is fed into a statistical analysis and optimization system with the mod- ular performance models. This system performs curve-fitting of the models to the data and also evaluates cross-predictions between machines. All outcomes are statistically evaluated and ranked according to least mean error to output a set of validated models.
Systems considered here are specifically shared-memory SMP’s with applications architected with either explicit threads or task-parallel API’s like OpenMP [94]. Network communication and message passing interfaces like MPI[46] are not considered for models at this level.
Users, be they end-users or the actual developers, generally know little about the performance profile of the applications they run and also the machines they operate on. Consequently, they often cannot accurately predict the best number of processors to use, leading to appli- cation slowdown and reduced throughput. Knowing how best to operate an application is difficult. The ideal number of processors to use varies with both the application and the spe- cific machine under consideration and sometimes even the data being evaluated. Predicting the parallel efficiency of applications without first executing is an enormous challenge [9].
It is, of course, necessary to collect performance data on an application and architectural performance information about the machines it will operate on. Without machine-specific
4 information, cross-prediction will be infeasible and, at best, a matter of luck. Ideally, we will be able to minimize the required information. Rosas and Barnes (2011) both try ‘small’ core counts on what would otherwise be large machines which may not afford extensive testing or ready availability of a large number of cores. They report that low core count runs provide enough information on the fundamental behavior of parallel code and that several program executions on a small subset of the processors is all that is necessary to predict execution time on larger numbers of processors [9]. However, given the complexity of modern processors, we wonder if this approach yields sufficient information for a high quality prediction.
5 Chapter 2
Review of Related Literature
Because the topic presented in this dissertation is multi-faceted, we choose to present our literature review inline, with discussion in the relevant chapters. We discuss modern parallel processors in Chapter 3, parallel performance models in Chapter 4, operating system in Chapter 5, parallel applications in Chapter 6, performance modeling in Chapter 8, error analysis and metrics in Chapter 11, and optimization in Chapter 12.
6 Chapter 3
Modern Parallel Hardware Technology
3.1 Flynn’s Taxonomy
Considering Flynn’s Taxonomy, the applications we consider fall into the task-parallel (the fork-join model) and multiple-instruction/multiple-data (MIMD) classification. We also con- sider multiple-program/multiple-data (MPMD) scenarios operating on separate cores of a common processor. There may be internal, local, aspects of any application which may be compiled under the single-instruction/multiple data (SIMD) data parallel paradigm, but this is a small part of the applications of interest to our research. Applications compiled with architecture-specific SIMD instruction targeting are necessarily restricted to the machines and architectures they can operate on and so we don’t break it out as a separate detail; this is a low-level implementation matter. Some experimental applications in this project are compiled in this way and are correspondingly restricted.
7 Processors of interest to us are those which are general computation main system proces- sors with small numbers of cores on uniform memory-access architectures (UMA), which may involve one or more separate processors such as canonical symmetric multi-processors (SMP). Non-uniform memory access architectures (NUMA) with multiple processor sockets internally networked with separate memory attached to each socket are outside the scope of this work.
3.2 Development of CMP Multi-Core
In 2006, multi-core processors were widely adopted, with the advent of Intel Core-2 chips. Currently a broad variety of chips are available from Intel (see table 3.1 for examples), AMD (see table 3.2 for examples), Samsung, Qualcomm, etc. and it’s nearly impossible to acquire a computer without hardware parallel processing through normal consumer channels. Earlier, multi-processor symmetric multi-processor (SMP) chips and systems existed where every processor was physically identical, separate, and mostly independent to all the others. SMP systems were generally only available through a small number of vendors targeting specific high-performance markets as operating system and application support were also quite unusual. Modern multi-core chips are characterized by more shared on-chip resources, particularly the multi-level memory cache hierarchy. Shared resources have led to lower and less predictable performance than with older architectures; the cost and complexity is dramatically reduced, however. Architectural designs vary in core count, cache hierarchy size, cache hierarchy depth, and cache coherence and eviction policies.
Since 2006, the number of processors and hardware threads has increased, the processor has absorbed the memory controller (Northbridge chip), and the cache hierarchy has gotten larger and deeper with L3 cache becoming standard and L4 cache coming into the market recently. These advances boost performance, but the gap between memory bandwidth and
8 processor speed (popularly referred to by various names including “The Memory Wall”) is generally regarded to remain the single largest crippling factor to a high level of scalability in parallel applications running on modern processors.
Memory Wall: Due to shared resources in the memory hierarchy, multi-core applications tend to be limited by off-chip bandwidth. [40].
Bandwidth to main memory: While main memory access is supported with a multi-level caching system, it is regarded as the chief limiting factor to performance on modern comput- ers. Other shared resources throughout the system do not typically have as extreme adverse affects on high performance computational systems, but any contentious aspect leads to performance degradation in a parallel computing environment.
Shared memory bandwidth has a negative effect on concurrently executed applications as each application makes unique demands on the memory system. As the operating system schedules alternate execution of applications, large portions of the cache hierarchy must be disrupted to accommodate new tasks and shared between them.
Shared memory bandwidth penalizes parallel applications due to the progressive starvation of increasing parallelism. Parallel applications already suffer from asymptotic speedup due to fractional sequentialization as demonstrated by Ahmdahl’s law see for example ([53]). In general, researchers present a consistent message about the state of technology today.
For example, [40] and [93], both operating with large-scale cluster supercomputers composed of multi-core nodes, argue that the Memory Wall is a reality. While Simon [93] specifically assesses application performance on several large-scale computers, Diamond [40] identifies that almost every aspect of the memory hierarchy being shared, be it L3 capacity, or off-chip main memory bandwidth, has negative implications for performance. Diamond finds that making use of a typical quad-core processor is a difficult and rare event and expresses concern for practical utilization of larger-scale chips promised for the near future (circa 2011). To
9 this day, quad-core processors (with the addition of SMT hardware threads leads to 8-thread machines) are probably the most common performance processors on the market with few forays into conventional processors with many more cores.
[18], [52], and [109] discuss the fact that the memory wall is real and contention is a huge issue. Gupta [52] performs some very nice experiments physically altering the structure of their computer in order to evaluate two different memory bandwidths for a range of applications and shows higher scalability distinctly tied to increased available bandwidth.
Williams [109] works with floating-point intensive applications and works to not only charac- terize their performance, but to improve it as well. They use as a benchmark the theoretical limit of floating point performance on the particular machine and determine the amount of bandwidth necessary to achieve such. The limiting factor is, of course, the actual bandwidth available on the system. Each application variation is evaluated for its floating point per- formance and bandwidth consumption to establish where in the world of real and feasible performance it lies. Inspired by Williams [109], Chatzopoulos [18] works with statically and dynamically obtained application data in an attempt to determine on-chip and off-chip de- mand to estimate scalability. They find the ratio of on- to off-chip demand to be essentially meaningful.
Interestingly, Sun [96] argues that the memory wall is real, but not such a big issue, in theory. Through some manipulation of Amdahl’s and Gustafson’s Laws and the utilization of some assumptions not valid for current designs, he asserts that whole system architecture needs to be addressed, focused primarily on the memory hierarchy, in order for multi-core performance to improve.
10 3.3 Intel Core2 Architecture
The first commercial release of 64-bit processors from Intel was in the Core2 product line which arrived in 2006. Multi-core Core2 processors were either from the Core2-Duo [26] or Core2-Quad [34] product lines with either 2 or 4 processors in the package. The memory hierarchy here is quite simple with an L1 cache exclusive to each core and L2 cache shared between each pair of cores on the die. The processors were designed with two cores per die and one or two dies per processor package for the Duo and Quad configurations. No L3 cache was present on these chips. See figure 3.2.
3.4 Intel i7 Architecture
Following the Core2 series of processors, several branded product lines were introduced serving different markets: low-end, mainstream, and high-end/business. These were the i3, i5, and i7 [33] series processors, correspondingly, and were distinctly different from the Xeon [25] series of server processors targeting high performance workstation and server markets. i7 processors came to market in 2008.
With a multi-level cache hierarchy, different levels of the cache are shared by different pro- cessor cores. In the case of Intel i7 processors, a single L3 cache is shared by every processor core within the processor package (typically four or six) and each processor core contains an L2 and L1 cache. The L1 cache is then shared between its two logical cores or hard- ware (Hyper-)threads (two per core with current designs) also known as simultaneous multi- threading (SMT) [Hyper-threading is an Intel proprietary technology]. Each L1 cache is split into two equal parts to serve for data and instructions separately. See figure 3.1.
11 3.5 Hyperthreading/Hardware Threads
Hardware threads inside a core may not always operate concurrently and instead operate alternately and opportunistically based on the availability of dependant information in the cache hierarchy. While one thread is waiting for a fetch from memory, the other may compute so long as it has the required resources. There are more and less optimal placements for threads on the processor but the operating system, despite knowing the structure of the processors, often is unable to capitalize on the structure. Noteworthy is that the hardware threads themselves are not different from each other except with regards to the way they are paired and their opportunity to co-execute. They are physically indistinct and are essentially separate execution contexts with substantial portions of the core shared between them. Contrast the relationships of hardware threads to L1 cache in figure 3.1 versus figures 3.2, 3.3, and 3.4.
Realistically, because of opportunistic utilization of resources, processors do not operate symmetrically, despite their physical geometric symmetry. In current architectures, individ- ual hardware threads are identical. For purposes of notation here, where a core has two hardware threads, if only one is active with a software thread it will be considered to be a
‘complete’ core (nc). When two hardware threads are active in the same core with software
threads, one thread will be considered ‘complete’ and the other ‘virtual’ (nv) with the pair
‘shared’ (ns). In principle, these are either scheduled or yielding to the other, flipping the notion of complete and virtual between the two. The system performance is to some degree an average of the performance of the two threads. Scogland et al. emphasize how physically symmetric processors are present in hardware, and circumstances of execution then lead to substantial asymmetry in behavior [91].
12 Computer Name Intel Speed Cores HT’s P L1 P L2 P L3 #L1(d + i), 2, 3 CoyoteTango Core 2 Quad Q8200 [34] 2.33GHz 4 4 256kB 4MB NA 8/2/0 RomeoBlue Core 2 Duo E7400 [26] 2.80GHz 2 2 128kB 3MB NA 4/1/0 Ares Core i7-860 [33] 2.80GHz 4 8 256kB 1MB 8MB 8/4/1 Styx Core i7-4820K [32] 3.70GHz 4 8 256kB 1MB 10MB 8/4/1 ChernoAlpha Core i7-4720HQ [31] 2.60GHz 4 8 256kB 1MB 6MB 8/4/1 Xerxes Core i7-3930K [29] 3.20GHz 6 12 384kB 1.5MB 12MB 12/6/1 L09473-1 Core i7-4700MQ [30] 2.40GHz 4 8 256kB 1MB 6MB 8/4/1 QSI-PC Core i7-2700K [27] 3.50GHz 4 8 256kB 1MB 8MB 8/4/1 L09473-2 Core i7-2820QM [28] 2.30GHz 4 8 256kB 1MB 8MB 8/4/1 CrimsonTyphoon Core i7-860 [33] 2.80GHz 4 8 256kB 1MB 8MB 8/4/1 YourMom Xeon E5335 [25] X2 2.00GHz 8 8 512kB 16MB NA 16/4/0
13 Table 3.1: Some Intel-based architectures used for experiments and comparative consideration.
Computer Name AMD Speed Cores HT’s P L1 P L2 P L3 #L1(d + i), 2, 3 StrikerEureka FX-8350 [24] 4.00GHz 4 8 640kB [128,512] 16MB 8MB 16/8/1 MCMA Opteron 6134 X4 2.30GHz 32 32 4MB 16MB 80MB 64/32/8
Table 3.2: AMD-based architectures for comparative consideration Figure 3.1: Typical structure of an Intel i7 processor. Four cores (sometimes more) including L1 and L2 cache, each with two hardware threads sharing L1 cache. All cores share a common L3 cache.
Figure 3.2: Intel Core 2 architecture predates the i7. Each core maintains its own L1 cache and pairs of cores share L2 with no L3 at all. The Core 2 Quad consists of two identical processing units in one CPU package.
14 Figure 3.3: The AMD FX series of processors boast a memory hierarchy with substantially less contention. L1 and L2 caches are exclusive to each core while L3 is shared.
Figure 3.4: The Intel Xeon E5335 is more reminiscent (and contemporary to) the Core 2 architectures. L1 caches are each dedicated to independent cores with L2 shared between core pairs. L3 is not present.
15 Chapter 4
Parallel Performance Models
4.1 Models of Parallel Computation
(Practical Domain)
4.1.1 Amdahl’s Law
Amdahl’s Law [4] describes the most basic concept of parallelism by taking a fixed application and distributing the computation portion of it (the parallelizable part, p) over n separate resources (processors). It is generally expressed as:
p T = T s + , s + p = 1, p ∈ [0, 1] p s n
Where Ts is the sequential runtime of an application and Tp is the parallel runtime of an application when operated on n processors, as s increases, the opportunity for parallel perfor- mance diminishes. Amdahl’s Law deals with what are considered to be ‘fixed-size’ problems which are expected to get faster with the assignment of more computational resources. Fixed-
16 size problems are prevalent in domains where the size of the computation is either limited by the size of the machines available (and machines with more processors/cores do not nec- essarily accommodate proportionately more system memory) or are already solved/solvable to a degree which does not demand higher resolution or more work.
As discussed in a recent survey paper by Al-Babtain et al., Amdahl’s Law finds itself extended in a variety of ways to consider different multi-core architectures [3]. These extensions are oriented more towards making hypothetical architectural decisions, rather than working in the software domain and understanding performance.
In application, in 1996 Shi found that the sequential and parallel fractions are not practi- cally obtainable and they generally neglect further overheads involved in parallelization [92] which, at this time in history, may substantially include behavior of the particular computer architecture and not just software mechanisms.
The types of applications we consider fall under the ‘fixed-size’ problem domain and we use Amdahl’s Law as the starting point for our work.
4.1.2 Gustafson’s Law
Gustafson’s Law [53] describes a concept of parallelism as follows. As additional computa- tional resources are applied to an application, the size of the parallel computational portion increases proportionately. Therefore, a problem is better solved (higher resolution, smaller error, etc.), but in essentially the same time. These types of problems are referred to as ‘fixed-time’ problems and are prevalent in large-scale computational environments such as weather prediction which are under practical constraints for the availability and utilization of their outcomes. Through the progressive improvement of computer hardware, a particular problem may in practice transition from being a fixed-time to a fixed-size problem.
17 Substantial controversy seems to occur in the literature between these two laws depending on what type of parallel computation is used. Arguments are posed as to which is right while the applicability of these laws depends entirely on the details of the applications at hand.
4.1.3 Unification of Amdahl’s and Gustafson’s Laws
Shi argues that the two laws are essentially the same[92]. Juurlink and Meenderinck [60] attempt to compromise between Ahmdahl and Gustafson with an enhancement for asym- metric and dynamic multi-cores. Hill and Marty [56] extend Ahmdahl’s concept with some basic models for more sophisticated multi-core designs. Gunther’s Universal Scalability Law (USL)[51] was developed to unify the two models.
4.1.4 Parallel Speedup
The concept of parallel speedup exists as an evaluation of how much faster an application becomes with the utilization of additional computational resources. Generally the expres-
T (1) sion Speedup = T (n) is relied upon, with T (n) expressing the runtime of an application on n resources, but not without general controversy over the T (1) term. “Since parallel implementations may introduce computations that are unnecessary with respect to serial implementations, T (1) is the time required to execute the task on a single processor using the ‘best’ serial implementation.”[13] The matter is highly circumstantial whether T (1) IS a sequential application or else a parallel application operated on one thread. Serial implemen- tations may not generally exist for general parallel applications for any number of reasons including budget, time, and lack of further utility. Herein, T (1) for us means a parallel application operated on a single thread or processor and physically the same executable as used for T (n).
18 1 The expression for speedup derived from Ahmdahl’s Law is: Speedup = s+p/n . The result is diminishing returns through added parallelism and assumes a fixed problem size. Gustafson’s Law resolves to Scaled Speedup = N +(1−N)s, where N corresponds to both the processor count and the problem size which vary together. Counter to Ahmdahl’s speedup, continual improvement is achieved through the corresponding increase of problem size.
It’s worth noting that while speedup is an interesting metric, information is lost. The continued reliance on speedup as the primary measure of performance may be attributed to the use of execution time as the unit of measure. Using time as a measure of work has several drawbacks. First, it varies with the computer used. Second, it is simply a statistic which does not provide any insight about the algorithm [implementation].” [13]. By predicting runtime we can always use that to generate scalability. If we were to JUST focus on scalability prediction, the real-world connection, i.e. how long will it actually run, would be missed.
4.1.5 Roofline
Prinslow emphasizes the notion of computational intensity as a major point of interest, dis- cussing how program blocks may be either compute-bound or memory-bound [83]. Williams, Watterman, and Patterson develop the Roofline model motivated towards the diagnosis and improvement of parallel applications [109]. Roofline formalizes this into a performance- analysis framework for optimizing implementations. Roofline relies on measures of memory bandwidth and also computational power (generally giga-flops [GFLOPS]) and their ratio: ‘Operational Intensity’. Roofline allows one to characterize an application as being clearly memory- or compute-bound. Nugteren and Corporaal bring Roofline into the theoretical domain and extend it for the analysis of algorithms [79].
Mega-flops [59], millions of floating-point operations per second, (MFLOPS) has been a very traditional measure for performance capabilities of computational hardware and a reasonable
19 target for any piece of software to achieve on such a platform. This measure seems to be less often reported but does appear in recent literature in comparisons of architecture per- formance [11]. The meaningfulness of such a measure is less relevant when the performance of tasks which are specifically non-numerical, or even transitioning away from floating-point, are at hand. It’s also one of the easiest measures to abuse when making performance claims [7]. Frigo and Johnson measure runtime but then express their results in MFLOPS for a wide range of software (where the internal operations are unknown) with the caveat that “The MFLOPS measure should, thus, be viewed as a convenient scaling factor rather than as an absolute indicator of CPU performance” [47].
MIPS (Millions of Instructions Per Second) is another hardware-centric performance metric often encountered. It’s simply a measure of how many machine instructions are processed per second. A drawback of Roofline is that it specifically relies on some performance quality like MIPS or FLOPS. Frigo and Johnson observe that “...there is no longer any clear connection between operation counts and speed, thanks to the complexity of modern computers.”
4.2 Algorithmic Computational Models
(Theoretical Domain)
Algorithms exist simply as concepts and have no tangible performance measure, only theo- retical. Theoretical performance is expressed asymptotically, in big ‘O’ notation, as a notion of time relative to the size of the input while the input approaches infinity. Big ‘O’ nota- tion drops all but the most significant terms in the expression and also drops all constant coefficients yielding typically quite simple expressions. To measure actual performance of an algorithm it must be implemented in a language and operated on an actual computer. We
20 refer to Schatzman: “...we should note that computer languages are neither fast nor slow − only implementations can truly be associated with speeds” [89].
The Random-Access Machine (RAM) [42] is a model for analyzing algorithms on an ideal sequential machine. Parallel Random-Access Machine (PRAM) [43] is an extension to RAM for parallel algorithms on ideal shared-memory machines. The LogP machine model [38] is also a parallel machine model for distributed systems. Bulk Synchronous Parallel (BSP) [102] is another parallel model including more substantial communication concepts for distributed systems. These models are all focused on algorithm analysis on abstract machines. With this focus, they are language independent and know nothing of actual real technologies but rely on their parametric shapes.
Other more advanced models of parallel computation exist which fall into the theoretical domain of system modeling and algorithm performance optimization. Valiant invents the Multi-BSP model, derived from the BSP model, for aiding in the design of ‘portable algo- rithms’ which may be simply adjusted in a predictable way at compilation or implementation time for ‘optimality’ [103]. Here, both the algorithms and hardware are necessarily white-box and grey-box entities correspondingly (contents are well or sufficiently known) so predictive capabilities are both preemptive and restricted to situations with explicit knowledge. No attempt is made at system identification so the model lacks retroaction. Variations in im- plementation are also outside the scope here.
Where time complexity (the ‘big-O’ notation and its relatives) is used for assessing algo- rithms, Chellappa et al describe the serious problems that exist when trying to use any kind of algorithm-oriented model for any kind of actual prediction on real machines: “The O-notation neglects constants and lower order terms; for example, O(n3 + 100n2) = O(5n3). Hence it is only suited to describe the performance trend but not the actual performance itself. Further, it makes a statement only about the asymptotic behavior, i.e. the behavior as n goes to infinity. Thus it is in principle possible that an O(n3) algorithm performs better
21 than an O(n2) algorithm for all practically relevant input sizes n.”[19] n, of course is a very finite quantity due to the structure of real hardware. Further, they observe two orders of magnitude difference in runtime over four different parallel implementations of matrix mul- tiply each requiring exactly 2n3 operations and conclude that correlating actual runtime to time complexity is unlikely. [19]
Singh notes that theoretical models of (parallel) computing like RAM and PRAM are useful for algorithmic analysis but not much else [94].
Regarding algorithmic analysis, Crovella concurs “Although analysis provides the concep- tual tools to predict parallel program performance, most previous work in analysis has not been directly used by programmers to predict performance of real applications for two rea- sons:”, “...alternative implementations of a program may often have the same asymptotic performance function, yet differ in important ways in the values of the associated constants”, “The work required in developing an analytic model can greatly outweigh the effort in sim- ply implementing and measuring a proposed alternative program structure”[37]. The same variation may be true with just the performance of compilers or interpreters for a given codebase.
4.3 Drawbacks in the Modern Era
Except for Roofline and its variants, which are specifically developed to address the constrain- ing effect of memory bandwidth on modern processors, the other models suffer substantially in their applicability and portability in modeling the performance of applications on modern systems. Any method for modeling an algorithm will show wide variation when used for an actual application due to high-level language choice, data structure selection, library/API,
22 and compiler/interpreter effects. With the further complexities of machine hardware, there is little opportunity for such simple models to provide accurate predictions for real applications.
Pre-multi-core, it was already known that neither Ahmdahl’s Law nor Gustafson’s Law were sufficient to identify invariants of the application such as the sequential (parallel) part [70]. Into the multi-core era, it has become quite clear that because machines differ, and speedups obtained for one machine may not translate to speedups on others [103]. See figure 4.1.
23 t(n) Figure 4.1: The normalized runtimes tn = t(1) for a parallel solid meshing application are pre- sented for cross-system comparison. Two different machines are operated of similar vintage but different architecture: ‘Yourmom’ is a dual-socket SMP Intel Xeon E5335 [25] and ‘Ares’, a first-generation Intel i7-860 [15][33]. Ares doubles for two platforms with Hyperthreading both enabled (4 cores, 8 hardware-threads) and disabled (4 cores, 4 hardware-threads). The theoretical performance curve of a perfectly parallel application (no sequential part) is also presented. With no variation in either the application or its dataset, nontrivial differences in the ac- tual performance on real hardware are revealed. Where the core concept of Amdahl’s Law p t (n) = t (1) s + n is describing the invariant proportion of the software which is either se- quential or parallel, this demonstrates that Amdahl’s law alone is not predictive of software performance on modern hardware, nor can it identify the invariants of the implementation. Using the runtime data for these three platforms and minimizing the mean absolute error (MAE) in finding the invariant parallel part p ∈ [0, 1] across that data, we obtain the fol- lowing results: Ares (HT off), p=0.768004, MAE=0.0022506 Ares (HT on), p=0.728, MAE=0.0107833 Yourmom, p=0.984998, MAE=0.000749495 If all results are evaluated simultaneously: Composite, p=0.768004, MAE=0.0705001 Evaluation with Yourmom reveals nearly perfectly parallel behavior by both visual inspection and also the derived parallel portion p = 0.984998 : 98.48%. On the other hand, evaluation on either Ares variant suggests substantial deficiency in parallelism with p = 0.768004 : 76.80%. Clearly, the architectural differences between the machines are the cause for performance variation and a more sophisticated expression is necessary.
24 4.3.1 Runtime Variability, Performance Uncertainty, and Noise
Variation in any quantity makes individual instances of that quantity more difficult to pre- dict. Barnes observes that operational noise in a system causes random execution time variability which leads to reduced accuracy of scalability models[9]. If the magnitude of vari- ation is sufficiently large, the quality of any discrete prediction will lose its meaning. Barnes notes that significant variability in runtime leads to overall difficulty and reduced accuracy for performance and scalability prediction [9].
Performance variation can arise from several causes. Operating system and kernel opera- tion during parallel computation perturbs program runtimes[78]. A multitude of different services in the operating system will have different perturbing effects with varying duration and magnitude. Even a well-written application in a controlled environment will realize perturbation.
Hardware effects in complicated systems are increasingly interdependent. Hennessy finds that performance-helpful aspects of modern processors may not be universal improvements. For example, microachitectural features aimed at a specific program behavior could nega- tively impact some applications [55].
Even before we had access to multi-core machines it was clear that while the main factors in performance of parallel programs were the computational workload, communication re- quired between processes, contention for shared resources, and associated synchronization constructs caused further delays. Delays due to hardware disproportionally impact parallel rather than sequential code due to the specialized synchronization requirements that arise [1].
Program behavior may be unpredictable and unrepeatable due to memory behavior, execu- tion skew across several processors, and measurements which disturb the actual performance
25 [40]. Increasingly popular in parallel development are parallel computing libraries and API’s, often built directly into the compiler, which often do not apply basic concepts like process and thread affinity. While these libraries and API’s have broadened the accessibility of par- allelism to many new applications and programmers, they can also introduce further causes for variation.
Necessarily, the variation in the ‘true’ runtimes leads to probabilistic models for prediction [58]. Scalar models are therefore limited and aren’t particularly valuable. To feed and generate a probabilistic model, multiple data points are required. Kramer and Ryan highlight the variability in execution time on distributed systems based on statistically significant performance evaluations on each system using a variety of applications [64].
When cross-prediction is the goal and a probabilistic model is not used, loose bounds seem to
1 be the outcome. Mendes shows upper and lower bounds of factors of nearly 4 and 3 [74]. The following year Mendes and Reed show improved upper and lower bounds each consistently
1 within factors of 2 and 2 from observed results [75]. In a practical sense, a performance model with no architectural information and no independent machine performance information will yield no viable avenue for cross-prediction. Baker et al. worked with performance optimization on large distributed systems and observed “A general solution is not possible without taking into account the specific target architecture.” [8]. One cannot simply curve-fit analytic expressions.
4.3.2 System Symmetry
Processor hardware is physically symmetric in the geometric sense, but not so much in practice [91]. Studies such as Sun and Chen[96] rely on this symmetry and therefore risk presenting overly optimistic estimates of performance. Most models lack any notion of contention, memory hierarchy, and assume infinite (or at least sufficient) memory bandwidth,
26 implying, intrinsically, that applications are compute-bound. With modern systems and applications, this seems more often a safe assumption for sequential applications. Parallel applications are, of course, more demanding on the system.
4.3.3 Lack of Hierarchy
Flat memory approximation ignores the speed advantage of things cached close to the pro- cessor cores and the slowdown of things stored beyond main memory (disk, network, etc.) [54]. This assumption is also characteristic in the long history of parallel task scheduling in the literature. Performance effects relating to the memory hierarchy may lead to opportu- nities for super-linear speedups even if super-linear behavior is impossible on homogeneous (and symmetric) systems [94].
4.3.4 Continuous Functions
The performance models typically presented in the literature are expressly continuous func- tions. Implicit is the assumption that the parallel work available for computation is infinitely divisible. This assumption is also characteristic of parallel task scheduling in the literature. This assumption is more appropriate to very fine-grained parallelism, balanced work, and unperturbed execution but is not applicable to coarse-grained or task parallelism which are quite prevalent in contemporary systems. Where modern task-parallel API’s are used, even fine-grained loop parallelism is broken into larger task blocks to reduce the overhead of the parallel system.
Both of our implementation case studies involve computations on a voxel space from the beginning to end of an experimental process. Our first case is a time-evolving numerical integration and our second case is a transformation on that data. Both cases are spatially
27 parallel with no opportunity for temporal parallelism [81]. Parallelism is applied, generally, across the x-axis in the 3D space resulting in a very coarse-grained task parallelization as seen above, which is favorable, or at least minimally antagonistic, to OpenMP [101]. All computations are performed in a synchronous manner with no task communication with the exception of some trade-offs in feature selection considered later. Tasks are assigned to computational threads dynamically so communication with the internal parallel API task scheduler is implicit.
28 Chapter 5
Operating System Process/Thread Scheduler Effects
Multi-tasking is fully preemptive in modern operating systems . Application processes are allocated processor resources dynamically and scheduled slices of time according to some prioritization or fairness criteria (figures 5.1, 5.2, and 5.3 show the transient nature of this scheduling on Microsoft Windows 10). Processes and their threads may be constrained to particular resources with corresponding affinity masks. Generally, they are created uncon- strained for greatest flexibility in scheduling [35].
The migration of threads across different processors causes performance problems as a result of processor architecture. Memory performance is already known to be a limiting factor for modern systems. The cache hierarchy has come into existence to support the disparity in performance between processors and system memory. When threads are migrated from hardware thread to hardware thread (core to core), extra work must be performed to flush dirty information from the old core cache (and possibly the new core) and then refill the cache on the new core with data to serve the new process as well as necessary machine
29 Figure 5.1: CPU history shows steady 50% utilization with 4/8 threads running. Despite the steady usage history, all cores show almost random activity. Affinity is clearly absent. instructions. Migration has varying effects depending on the structural relationship between cores. The cache flush may occur down to L1, L2, or L3 depending on the destination. Refilling always occurs up to L1, of course. There is the potential for reusing existing cached memory depending on the circumstances. Reuse of cached information can only occur for threads sharing a memory address space (i.e. in the same process).
30 Figure 5.2: CPU history shows steady 79% utilization (Windows 10 sometimes overstates this quantity), with 5/8 threads running. Activity is unsteady across all cores, but fuller than Figure 5.1 with 4/8 active.
Figure 5.3: CPU history shows steady 100% utilization (Windows 10 sometimes overstates this quantity) with 7/8 threads running. Activity is unsteady, but still more regular than Figure 5.2 with 5/8 active.
31 5.1 Process Affinity
Process affinity describes the set of processors in a system on which a particular process may be executed. By setting an applications processor affinity, a process may be constrained to a subset of all available processors as it is executed on a system. Affinity may be set externally by the OS, by some other agent acting through the OS, or internally by the application as a recommendation for a subset of processors made available to it by the OS. If unconstrained, the OS is free to schedule and migrate the process amongst different processors (the OS always has ultimate authority on this matter regardless of how an application configures itself).
5.2 Thread Affinity
Thread affinity describes the set of processors in a system on which a particular thread of execution within a process may be scheduled to execute. Thread affinity is necessarily a subset of process affinity and, again, serves as a strong suggestion to the OS for where to place the thread, but the OS retains ultimate authority on the matter. [36]
5.3 Thread Placement
While migrations result in negative performance effects, processor allocations (the placement of threads on processors), even without migrations, will exhibit irregular performance. Even with symmetric and identical cores there exists some degree of contention between them through concurrent access to shared levels in the memory hierarchy. The operating system may or may not consider processor architecture for performance effects. The application developer may or may not consider processor architecture either. Consideration given by the
32 application or operating system may or may not complement each other or even be suitable across a range of architectures. Best, worst, and probabilistic average-case performance can be modeled based on these conflicts.
5.4 Affinity in Practice
While configuring thread affinity should lead to fewer harmful thread migrations during scheduling, the system thread scheduler necessarily has less freedom in scheduling. There exists opportunity, increasing with system load or decreased load balance, for negative per- formance effects. Where communication or synchronization are required between application threads, spurious deferral of one thread leads to a chain reaction of deferral for dependant (waiting) threads, and the temporary idling of those resources. The coarser the parallelism (the larger the tasks), the larger the impact. Fine-grained parallelism with smaller tasks will have shorter idle intervals at the expense of greater task management overhead. Not all developers set (or know of) affinity so threads are free for migration. Parallelism is often left fully managed by task parallel API’s such as OpenMP[39], Intel Cilk Plus[67], Intel TBB[82], etc. Affinity is not universally set by the system.
5.5 Variation in Performance
The variety of possibilities in thread placement, migration, and the dynamic nature of mi- gration alone can lead to substantial variation in runtime (wall-clock time from start to end including system overhead) for a real application. Transient (and especially continuous) ef- fects within the operating system will only contribute negatively. Any change of execution context and will require some degree of flush and fill in the memory hierarchy.
33 Some substantial efforts have been made for characterizing and modeling performance in the presence of noise on large-scale computer systems [9]. The degree and variety of transient events and their effects on individual user-level workstations with off-the-shelf operating systems in real operating environments (home computers, academic, and industrial work- stations) are innumerable and exceed low-level noise. To counteract transience and also capture variations, our benchmarking will rely on a statistically significant number of mea- surements, outlier rejection, and averaging. Transient events could include, but are not limited to, passive and active actual user activity(e-mail, Internet, application usage, etc.), system maintenance (updates, disk maintenance), system security (anti-virus, anti-malware, etc.), actual malware (but hopefully not), device drivers, scheduled and recurring events, etc. These events may or may not be detectable.
5.6 Experimental Thread Affinity Effects
This experiment is on a small numerical kernel (a toy) which accesses memory either ran- domly (upper curves) or sequentially (lower curves). The kernel is run with incrementally more parallel threads in one process, each performing the same work (runtime should re- main constant with increased concurrency; this is a ‘fixed-time’ simulation in reference to Gustafson’s Law). The memory footprint is intended to not be large and therefore not emphasize bandwidth limitations.
OpenMP is used for parallelism. Through thread affinity, threads are mapped to hard- ware architecture where threads #0,#1,#2,#3,#4,#5,#6,#7 generally map to processors 0,2,4,6,1,3,5,7 where processors 0,1 and 2,3 and 4,5 and 6,7 share L1,L2 cache on the Intel i7 architecture with specific variations described (figures 5.6, 5.7, 5.8, and 5.9). Where archi- tectures have more or fewer processors (Intel Core2) the pattern, of course, is compensated appropriately (figures 5.4 and 5.5). Thread affinity is considered in a variety of patterns:
34 Figure 5.4: Runtime is plotted versus thread count for various thread affinity assignment experiments. Showing negligible variation in runtime, the Core 2 Duo architecture, lacking SMT technology, is essentially ambivalent to these thread affinity experiments.
‘OFF’ indicates normal system behavior where the OS is free to schedule threads on ANY core. ‘ON’ indicates a strict affinity of 1 thread per core filling the architecture: #0,#1,#2,#3,#4,#5,#6,#7 mapping 0,2,4,6,1,3,5,7. ‘HALF’ indicates that the first four threads follow the strict mapping of ‘ON’ and the latter unconstrained as ‘OFF’. ‘SMT’ indicates threads map as pairs #0,#4: 0,1 #1,#5: 2,3 #2,#6: 4,5 #3,#7: 6,7. ‘HALF SMT’ indicates the first four threads map according to ‘SMT’ and the latter uncon- strained as ‘OFF’.
35 Figure 5.5: Runtime is plotted versus thread count for various thread affinity assignment experiments. Showing negligible variation in runtime, the Core 2 Quad architecture, lacking SMT technology, is essentially ambivalent to these thread affinity experiments.
Figure 5.6: Runtime is plotted versus thread count for various thread affinity assignment experiments. The Core i7-4820K shows almost flat performance up to four threads when affinity is properly configured. Performance beyond that point shows marked degredation.
36 Figure 5.7: Runtime is plotted versus thread count for various thread affinity assignment experiments. The Core i7-4700MQ shows almost flat performance up to four threads when affinity is properly configured. Performance beyond that point shows marked degredation.
Figure 5.8: Runtime is plotted versus thread count for various thread affinity assignment experiments. The Core i7-4720HQ shows almost flat performance up to four threads when affinity is properly configured. Performance beyond that point shows marked degredation.
37 Figure 5.9: Runtime is plotted versus thread count for various thread affinity assignment experiments. The Core i7-3930K shows almost flat (but slightly deteriorating) performance up to six threads when affinity is properly configured. Performance beyond that point shows marked degredation.
38 Architectures considered are either SMT-based (figure 5.6, figure 5.7, figure 5.8, figure 5.9) or non-SMT (figure 5.4, figure 5.5). Non-SMT based machines are essentially ambivalent to the matter of thread affinity as demonstrated and all plots overlay with no outstanding variation. SMT-based architectures all behave similarly, but markedly different than the non-SMT architectures. Of the five affinity designs considered, only three plots generally arise from logical duplicity at the OS scheduler level (minor variation leads to a fourth plot).
For affinity OFF, runtime continually increases with increasing thread count (resource con- tention and migration is implied). With all other options, runtime is FLAT for up to half of the maximum threads (no contention is implied which agrees with the code). As illustrated previously in figure 3.1, after half of the maximum threads, cache contention for L1 and L2 physically becomes a factor between paired threads on SMT architectures (L3 is always in contention) and extends runtimes accordingly.
39 Chapter 6
Structure of a Parallel Application
6.1 Composition
As described by Ahmdahl’s Law, a parallel application can be most simply considered as an application having some portion which is entirely sequential and the remainder which is parallelizable. Often, the parallel portion is portrayed as a single block of infinitely divisible work, an abstraction which is quite far from reality. The parallel portions are often composed of multiple parallel sections each separated by sequential blocks (see figure 6.1). Each parallel section is composed of one or more tasks and each task may be itself parallelized or the tasks may be collectively processed simultaneously using the task-parallel model.
Beyond just the sequential and parallel portions, Sun and Ni describe applications composed of computational blocks each with varying degrees of parallelism and demand within an application [97]. These may result from algorithmic limitations, implementation limitations, or data size limitations (not enough work to spread around). With any block engaging less than all parallel resources, a reduction of the average parallelism or the ‘degree of parallelism’ (DOP) [58] of the application results.
40 Figure 6.1: Parallel software may be structured with a wide variety of patterns. Frequently encountered patterns which negatively affect parallel performance involve sequential com- putation. Alternating parallel and sequential sections allow for setup/teardown/transition between parallel parts where parallel operation is inconvenient or even impossible to orga- nize. Parallel sections may also include critical sections which moderate access to contended resources. Access is restricted while any thread is occupying the resource requiring all other threads to wait until the resource is relinquished.
Threads of execution within a parallel section may require access to resources, software or hardware, which can be accessed by only one thread at a time exclusively (i.e. sequentially). While memory may be simultaneously read from multiple threads, simultaneous reading and writing is a problem. Any resource in memory subject to being updated or written to is subject to this kind of constraint. Protection of mutually exclusive-use resources may be through a variety of constructs and mechanisms with the net assemblies often regarded as ‘critical sections’ or ‘mutexes’ (see figures 6.1 and 6.2).
The range of problems available in computing are both wide and deep as computers serve a multitude of needs in a multitude of environments and disciplines. Some problems are of a quality that can be solved in fixed-time and can be scaled in complexity accordingly (Gustafson). Some problems are of fixed size and do not scale any further for any practical reason (Ahmdahl). Worlton converges similarly [111]. Some computations are fundamentally small kernels and their demand is through continually streaming memory while operating on small predictable local blocks (PDEs, FDI, etc.) At no time is the whole problem addressed in comprehension.
41 Figure 6.2: The existence of critical sections alone is not a performance liability. The performance impact of any particular critical section is a factor of the contention for its usage (concurrency) and the proportion of the parallel section it occupies which is a factor of both its physical size and the frequency of its usage in the parallel section. A large section called infrequently may be just as disruptive as a small section called constantly. All parallel portions may end up suspended, stacked front to back in a ‘convoy’ during priods of high demand. Catastrophically, worst-case leaves the entire parallel section sequentialized. Realistically, there also exists some overhead for the sequentialization which extends the parallel runtime beyond the equivalent sequential application.
Some computations involve operations on graphs, databases, and other similar data struc- tures which are more distributed and less regular in memory. They suffer from poor locality of reference and a low degree of predictability with essentially random access patterns. Ma- trix multiplication and FFT are well known and important examples here and are more tuneable to target particular architecture through decomposition [103]. “Data intensive ir- regular applications that rely on pointer based data structures, such as graphs, are harder to optimize due to their intrinsic usage of pointers to access data and to their less-predictable pattern of data access.” [44]. Not all of these applications can be decomposed or else are not decomposable with reasonable overhead in either time or space.
Parallel work tends to occur on several forms of data (see figure 6.3). Data may be configured in blocks which are carefully arranged for the application and operated on with multiple threads simultaneously. Mathematical operations like FFT and matrix multiplication often fall into this category, see figure 6.5. Data also may be arbitrarily complicated structures
42 Figure 6.3: Different data structures used by different algorithms may have dramatically different representations in memory. The algorithms in operation will normally access just portions of the data structure and they may be highly tuned for efficient operation (or not). Pictured is the boundary of all memory (purple) and the memory occupied by a data struc- ture (green) with pointers (black). The moving window of L3 or L2 cache is abstracted as the cyan square. Memory accesses may be very stationary as in the blocked/partitioned model. In the stream- ing case, accesses may be highly sequential with new data “sliding in” as older data “slides out”. For an irregular or graph-type structure, memory accesses may be irregular, erratic, and seemingly random. Of course, multiple data structures may be simultaneously accessed, particularly for input and output or multiple structures for either, further complicating the situation. which are not simplifiable and are scattered throughout memory and have irregular access patterns. 3D rendering, mesh manipulation, databases, and irregular graph or unstructured grid processing may be this way. Data may also come in streams occupying sequential and contiguous memory (see figure 6.4). Streams may represent spatial or temporal information. Audio, video, and highly structured problems like finite-difference integration are of this type. We specifically address this type of work. Outputs of parallel work may, of course, take on similar structures, not necessarily congruent to the input, but also not necessarily entirely separate structures.
Problems of the streaming type may be less concerned with the size of a multi-level cache relative to the size of the problem so long as the data can stream. In the latter case where
43 Figure 6.4: The wood processing equivalent to streaming-type algorithms, the chipper shred- der has a fixed throat-size (L2,L3 cache) and processes material from front to back (main memory) with ‘results’ expelled as rapidly as material is ingested. Only a small amount of material relative to the total workload is processed in the throat at one time, but it is completely processed. [110]
Figure 6.5: The CNC milling machine or router exists in contrast to the chipper-shredder. It has a large parameterized workspace (partitioned or tiled data) which is operated on until the product is finished and exchanged when complete; in a degenerate condition the workspace (like the cache) may hold the entire workload [99].
44 streaming is not so capitalizable, the size of the cache at different levels, and ultimately its speed, may be of more critical importance. An alternate approach described by Badur, et al. (streaming is referred to as ‘na¨ıve’) suggests that parallel operation on L2-sized blocks to be a higher performance option but demonstrates it’s no more than 13% better on ‘large’ problems [6]. This might require very careful targeting due to the reality of private L1 caches and the prospect of sharing conflicts.
45 6.2 Decomposition
Regardless of the internal complexities of a parallel application, it’s generally necessary to view it more simply and abstractly. Even a parallel application with a simple structure may have strong data dependencies leading to exceedingly difficult static analysis. Referring back to Ahmdahl’s Law, the application can be viewed as aggregates of sequential and parallel operation with behavior of those aggregates statistically averaged by necessity. More internal details can lead to a more detailed model (and hopefully more accurate predictions) of outward behaviors. Worst-case is that no internal details are available and it’s a completely black-box. This is especially pertinent to commercial off-the-shelf applications or systems which are so opaque or otherwise complicated (or having strong data dependant behavior) that static analysis is utterly infeasible.
The parallel part of an application will most likely be a collection of Np discrete tasks to be performed. Here, we assume that the tasks are identical in nature at least in the average case, or are so scheduled for work on threads internally through a load balancer. While multiple parallel and sequential portions in general are possible in the operation of an application, we treat them as bulk terms for the total of the sequential and parallel parts and assume a similar task load per parallel portion, at least on average. When Np is either small compared to the number of processors ‘n’ applied to the application or ‘n’ is not a numerical factor of
Np, the apparent parallelism of the system or its parallel performance may suffer due to the remainder of work leftover for various ‘n’ resulting in an apparent ‘load imbalance’. If Np is large, the parallel part may be approximated as continuously divisible as the effect of the remainder on the outcome is small.
46 6.3 Parallelization (from the literature)
Some variation in terminology exists across different disciplines within the parallel scheduling literature, so here we disambiguate. The scheduling literature was strongest during the 80’s and 90’s, focusing mostly on supercomputing applications and system sharing on large-scale computing environments as would be the prevalent parallel computing environments. In terms of parallel scheduling, a ‘task’ (user-requested) is specifically an application requiring both resource allocation and also scheduling of time on a system for operation. ‘Resources’ are characteristically processors and, to a lesser extent, system memory.
A sequential task is any task which may strictly be run using only a single resource. A parallel task is simply a task which may be operated on concurrent resources [50]. The quantity of resources which may be utilized may be fixed (predetermined) or bounded (upper or lower). A parallel task with a fixed resource requirement is historically regarded as simply ‘Parallel’ (which is ambiguous contemporarily) or specifically ‘Rigid’ [12]. A parallel task which may be configured to operate on a fixed quantity of resources, generally at startup, is regarded as ‘Moldable’. Moldable tasks may accept an arbitrary number of processors or particular allocations conforming to particular constraints (e.g. powers or multiples of two or four). A ‘Malleable’ task is a task which may have its resources dynamically reassigned during operation [77]. Malleable tasks are substantially more complicated, requiring collaboration between the operating environment and application. When executive control is asserted by the operating environment an application must either be notified of such a change or otherwise make regular interrogations of the system to observe it. Neither of which is common in contemporary practice. Generally, applications are free to operate on a system and assume dominance and priority over resources unless specifically configured a priori.
We concern ourselves explicitly with moldable parallel tasks which will accept arbitrary configurations. In general, the majority of real-world tasks (task parallel parallelism) are of
47 (or CAN be) this type. Where sufficient degrees of architectural control exist, we consider that a single moldable task may be decomposed internally into a chain of dependant tasks, some of which may be moldable parallel tasks and some strictly sequential.
48 Chapter 7
Parallel Benchmarking
Seeking to avoid invasive measurement methodologies due to difficulties arising from the unbounded complexity of applications which we may be interested in analyzing (such as the inability to instrument black-box code, the complexity level of the application may be arbitrarily large, instrumentation itself is potentially unsuitable for potential end-users “...the user is required to have statistical expertise that is not common to parallel programmers.” [73], source code not available or inordinately complicated, hardware instrumentation may be unavailable on certain processors) we actively avoid performance counters and other mechanisms for quantifying ‘symptoms’ of system operation. Several styles of benchmarks are considered for use depending on the application being worked with.
In order to model parallel structure of an existing application from runtime information, sufficient data is required to fulfill basic requirements of the numerical model. A model consisting of n parameters requires at least n + 1 data values in order to fit. For a typical moldable parallel application, operable on [1 : n] processors, not more than n data points are available by simply clocking regular execution runs which can, at best, help to char-
49 acterize the parallel part. Through concurrent operation, not just parallel operation, more information becomes available and the sequential part also becomes characterizable.
‘HYDRA’ is a self-developed benchmarking tool built into our data collection and data analysis experimental applications Ponos and Pandora. HYDRA collects several benchmark types according to application structure (see figure 7.1):
HYDRA-1 (H1) benchmarks are timings of concurrent executions as separate sequential processes. Any application can be operated in this manner.
HYDRA-2 (H2) benchmarks, similar to H1, are timings of concurrent sequential execution threads inside a single process. Special application design is required for this; threadsafe DLLs are highly appropriate.
HYDRA-3 (H3) benchmarks are timings of individual parallel executions. Any moldable parallel application can be operated in this manner. Applications which automatically set their degree of parallelism (no user control) cannot be used to provide adequate data.
To characterize an application sufficiently, H3 and either H1 or H2 data is required. For this thesis we work with H1 and H3 specifically, describing the characteristics of H2 for completeness.
50 Figure 7.1: The structural relationship between the several types of HYDRA benchmarks are illustrated here. HYDRA-1 benchmarks are composed of separate processes (green box) each containing a single thread of execution (red box). The thread executes both the sequential work (yellow) and parallel work (blue) in a sequential manner. HYDRA-2 benchmarks are substantially similar to HYDRA-1 in that single threads perform identical work concurrently with the exception that the threads are contained in a single shared-memory process. HYDRA-3 is the most unique of the three, but also the most conventional with regard to the concept of a parallel application. Like HYDRA-2, all threads are contained in a single process but parallel work is divided across a group of threads. This is the canonical parallel work model described by Amdahl’s Law.
51 Figure 7.2: In order to model with different types of benchmarks, the relationships between them must be known. HYDRA 3 benchmarks are expected to be timings of the operation of a parallel kernel inside an application. Explicitly sequential portions may or may not be included. HYDRA 2 benchmarks consist of the time to completely operate an application thread including startup and shutdown. Timings are also produced internally. HYDRA 1 benchmarks are the total execution time of separate system processes. Timings are performed by an external application and include process startup and shutdown.
7.1 Relationship between HYDRA benchmark types
7.1.1 Concurrent Processes (independent memory address spaces)
HYDRA-1 benchmarks are measured externally by a driving application and therefore rep- resent the total span of the execution including process startup and shutdown (figure 7.2). Relative to the H3 benchmark, an additional sequential portion may need to be added to represent the extended head and tail together. H1 benchmarking lacks intra-process exclu- sive resource contention, but contention may exist at the OS and is expected at hardware levels through resource sharing. Data is collected for each of [1 : nmax] concurrent processes.
52 7.1.2 Concurrent Threads (common memory address space)
HYDRA-2 benchmarks are operated as HYDRA-1 except that threads in a single shared- memory host process are used rather than independent processes (figure 7.2). Within a single process more resource conflicts arise including mutexes in the application and within dependant libraries. These resource restrictions may not only lead to delays of the parallel parts, but extended runtimes of the sequential parts too. Contention in the parallel part may be in addition to that of H3 parallel contention (figure 7.3). Worst-case is that all concur- rent threads are convoyed completely and run entirely sequentially (potentially with further overheads). More complicated models are required to account for this kind of behavior. For these benchmarks to be possible, the application must be built for multi-threaded calls and those calls must be fully threadsafe. This is a higher standard than necessary for each of H1 and H3 and relatively unusual. H2 sequential application threads are operated concurrently for [1 : nmax] threads at a time.
7.1.3 Individual application task parallel computation
HYDRA-3 benchmarks may be subject to some degree of exclusive resource contention in the parallel portion; sequential operation is uncontested and free of this by definition (figure 7.3). H3 benchmarks may represent the operation of the overall application (and the value is reported by an external controller) or a meaningful subset of it such as a computational kernel (reported by the application itself). H3 benchmarks are able to ascertain information about the performance qualities of the parallel part(s) of an application. No such information can be extracted about the sequential part, H1 and H2 fill this information in.
Incremental variations or hybrids between each benchmark style are possible, especially be- tween H1/H3 or H2/H3 such as concurrent multi-threaded executions. Two simultaneous instances of 4-threaded applications and four simultaneous instances of 2-threaded applica-
53 Figure 7.3: Contention for software resources inside the same memory address space in a parallel application may only exist where there are simultaneous demands in place on that resource. For HYDRA-3, meaningful contention can only exist during parallel operation of the parallel portion of the application. For HYDRA-2, parallelism is not expressed directly and arises through concurrency instead. Contention may occur in the sequential portion of the application, in the parallel portion (same mechanism as H3), and also in the parallel portion, but using alternate mechanisms. tions are just a couple such variations. It’s not expected that these would yield any additional information than that of the existing types.
54 7.2 Benchmarking Protocols
It’s well known that parallel applications rarely scale perfectly on their own. This is at- tributable to both qualities of the algorithm and the operating environment (OS scheduling and system architecture). This is captured with H3 benchmarks. H1 benchmarks are free of algorithmic influences and deficient scaling is entirely an environmental matter. H2 bench- marks are similar to H1, but may manifest other software artifacts as described above.
HYDRA collects a statistically significant number of executions per configuration. For H3, applications are operated exclusively on n = [1 : nmax] threads for no less than 10 executions for every n. Applications are operated randomly to avoid any memorization or cache advan- tage. Outliers are incrementally rejected until quotas are full. Each benchmark generates a single run time for a particular number of processors and runtime generally decreases with increasing processor count. See figure 7.4.
For H1 and H2, each application implementation is either sequential or parallel, supporting up to nmax instances. The run times with increasing concurrency are observed to only deteriorate (lengthen) and variation among runtimes become increasingly variant (see figure 7.5). We aim to work with, but not capture the increasing variation and deviation at this time. Benchmarks are run on sequential applications (processes), one application at a time, concurrently for each of [1 : nmax] processors and not less than 10n trials for each. For H1 these instances are separate processes and exist in separate memory spaces while for H2 each instance is a separate thread in a common memory space.
The relationships between H1 and H3 benchmark results can be seen in figures 7.6 and 7.7.
55 Figure 7.4: H3 benchmarks are performed at least 10 times for operation on each of 1 to nmax software threads. Variation is observed and the mean value of each is used following statistical rejection of outliers. Runtime generally decreases with increasing processor count up to a certain point which is application and system specific. Here, regression appears to begin at 7 threads on a 12-thread machine and behavior becomes less predictable following that.
56 Figure 7.5: Both H1 and H2 benchmarks are executed on 1 to nmax concurrent instances either in a single process (H2, shared memory space) or multiple processes (H1, separate memory space). With increasing concurrency there exists increasing contention for hardware resources such as memory bandwidth. Run times are seen to drift longer and also exhibit greater variation with increasing concurrency. H1 is pictured, H2 results would appear quite similar.
57 Figure 7.6: Both H1 and H3 plotted together reveal their relationships to each other. Starting at nearly the same origin with n=1, increasing thread count on H3 follows the lower curve while for H1 the upper curve is developed. If H2 results were available they should fall somewhere between the two curves, but likely near to the H1 curve for most circumstances.
58 Figure 7.7: Here, the preceding H1 and H3 data are group-wise averaged (mean) and pre- sented as curve-connected data. H1 data is normalized according to the number of processes. The difference between curves at one processor is (or should be) indicative of the charac- teristic difference between benchmark types. H3 data may exhibit some logical contention between parallel threads manifesting as suboptimal scaling. H1 data, on the other hand, is essentially contention free. Should logical contention exist in H3, the curves would deviate from each other progressively more as process/thread count increases. Because we are able to assert that no logical contention exists in H1 (not a software problem) and we know that no contention exists in H3 data, we can conclude the sub-optimal parallelism is strictly a hardware performance matter.
59 Chapter 8
Modular Performance Model
In this section we develop our comprehensive parallel performance model. Our model is horizontally decomposed with terms corresponding separately to the application, its runtime environment, and the layers of the hardware architecture [73]. We examine our experimental applications on multiple multi-core parallel machines[61], some of which exhibit structural variation in CPU memory architectures. Some information about the specific machines being operated is necessary. Information about algorithm and operating system behavior are inferred by the models (with one exception). 0 if x = 0 Where called out, H (x) is the Heaviside Step function: H (x) = 1 if x > 0
The parameter n is used circumstantially and interchangeably as a result of conventional constraints. n software threads may be specified (with variation) for an application to operate with. Each software thread will at all times be able to run on at least one core or hardware thread. At no time will n exceed nmax (oversubscription) and at no time will more software threads be affinity constrained to operate on a core than the number of hardware threads it physically contains.
60 8.1 Hardware Parameters
There are some requisite system architectural parameters and also some basic performance qualities needed to support the model. For hardware structure we collect:
Number of cores, nc and hardware threads per core TPC. T P C > 1 implies the sharing of L1 cache; TPC = 1 or 2 here and for all known architectures at this time. The number of ‘virtual’ cores (as some people tend think of them), those subject to significant contention, nv = nc(TPC − 1). The total number of logical cores or hardware threads available nmax = nc + nv = nc · TPC.
L1, L2, L3 cache presence, whether they are shared and to what degree, and their individual
(not collective) physical size at level x: ALx. L0 is the notation used for main memory allowing for additional cache levels to be extended beyond L3 without misunderstanding. L4 caches present on newer processors just starting to become available is not explicitly considered here, but further extension of the concepts are not difficult and should come quite naturally.
Collected but unused are cache line size and NUMA node counts. NUMA architectures with more than one node are beyond the scope of this work. Cache line size is invariant on these systems so no parameterization on that quantity is required or possible.
For hardware performance we collect approximate average memory bandwidth per level in the memory hierarchy by accessing a large volume of data, 50MB, from a block sized to fit the cache level. At each level, Lx, testing is performed for combinations of read (I) and write (O) to sequential (S) and random (R) access. Simon and McGalliard performed benchmarking of the memory hierarchy and also operated the same benchmarks in a concurrent manner to demonstrate contention [93]. Correspondingly, for each of our four combinations we measure the bandwidth for [1 : nmax] affinity-locked threads operating into both a common block
61 (CB) and independent blocks (IB) in shared memory V I R CB . CB measurements are O , S , IB ,Lx,n used for H3 based modeling while IB are appropriate to H1 and H2.
The choices between S and R and CB and IB are free variables in the modular model while I and O are chosen according to the usage.
8.2 Algorithm Bandwidth
Algorithmic input and output memory demands are very similar to memory read velocity
[bytes] VI,n and write velocity VO,n with units [second] . For each we consider the possibility of linear and random access according to algorithmic and data structure properties. VI,n and VO,n
derive directly from system benchmarks V R CB and V R CB . The selection of R or I, S , IB ,L0,n O, S , IB ,L0,n S and CB or IB is a parametric matter for model fitting.
Per-thread bandwidth is some interpolation between VI,n and VO,n. θ is the split in bandwidth between I/O, θ ∈ (0, 1). Therefore:
MT,n = VI,nθ + VO,n (1 − θ)
.
θ is a free variable of the modular model.
62 8.3 Software Parts
8.3.1 Amdahl’s Law