UNIVERSITY OF CALIFORNIA, IRVINE

Cross-System Runtime Prediction of Parallel Applications on Multi-Core Processors

DISSERTATION

submitted in partial satisfaction of the requirements for the degree of

DOCTOR OF PHILOSOPHY

in

by

Scott W Godfrey

Dissertation Committee: Professor Amelia Regan, Chair Professor Michael Dillencourt Professor Emeritus Dennis Volper

2016 c 2016 Scott W Godfrey DEDICATION

“Lead, follow, or get out of the way.” -Joe [76]

“It’s only after we’ve lost everything that we’re free to do anything.” -Tyler Durden [20]

“The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt.” -Bertrand Russell (attributed)

“The degree of one’s emotions varies inversely with one’s knowledge of the facts.” -Bertrand Russell (attributed)

“If you’re going through hell, keep on going.” -Unknown

To my family, committee, friends, employers, and all those who have supported my endeavors unquestioningly and uncompromisingly, I salute you as I depart from this fantastic technicolor fantasyland called ‘academia’. You and all those who have acted, and those who will act, in the name of justice and righteousness are the true heroes of the world.

ROSEBUD

ii TABLE OF CONTENTS

Page

LIST OF FIGURES vii

LIST OF TABLES ix

ACKNOWLEDGMENTS x

CURRICULUM VITAE xi

ABSTRACT OF THE DISSERTATION xiii

1 Introduction 1

2 Review of Related Literature 6

3 Modern Parallel Hardware Technology 7 3.1 Flynn’s Taxonomy ...... 7 3.2 Development of CMP Multi-Core ...... 8 3.3 Intel Core2 Architecture ...... 11 3.4 Intel i7 Architecture ...... 11 3.5 Hyperthreading/Hardware Threads ...... 12

4 Parallel Performance Models 16 4.1 Models of Parallel Computation ...... 16 4.1.1 Amdahl’s Law ...... 16 4.1.2 Gustafson’s Law ...... 17 4.1.3 Unification of Amdahl’s and Gustafson’s Laws ...... 18 4.1.4 Parallel ...... 18 4.1.5 Roofline ...... 19 4.2 Algorithmic Computational Models ...... 20 4.3 Drawbacks in the Modern Era ...... 22 4.3.1 Runtime Variability, Performance Uncertainty, and Noise ...... 25 4.3.2 System Symmetry ...... 26 4.3.3 Lack of Hierarchy ...... 27 4.3.4 Continuous Functions ...... 27

iii 5 Operating System / Scheduler Effects 29 5.1 Process Affinity ...... 32 5.2 Thread Affinity ...... 32 5.3 Thread Placement ...... 32 5.4 Affinity in Practice ...... 33 5.5 Variation in Performance ...... 33 5.6 Experimental Thread Affinity Effects ...... 34

6 Structure of a Parallel Application 40 6.1 Composition ...... 40 6.2 Decomposition ...... 46 6.3 Parallelization (from the literature) ...... 47

7 Parallel Benchmarking 49 7.1 Relationship between HYDRA benchmark types ...... 52 7.1.1 Concurrent Processes (independent memory address spaces) . . . . . 52 7.1.2 Concurrent Threads (common memory address space) ...... 53 7.1.3 Individual application task parallel computation ...... 53 7.2 Benchmarking Protocols ...... 55

8 Modular Performance Model 60 8.1 Hardware Parameters ...... 61 8.2 Algorithm Bandwidth ...... 62 8.3 Software Parts ...... 63 8.3.1 Amdahl’s Law ...... 63 8.3.2 Modularity ...... 63 8.3.3 ...... 64 8.4 Hardware Parts ...... 66 8.4.1 Main Memory Bandwidth ...... 66 8.4.2 Sequential ...... 67 8.4.3 “Virtual” Core Efficiency ...... 67 8.4.4 Lx Space Contention ...... 68 8.4.5 Lx Space Sharing ...... 70 8.5 Contentious Parts ...... 71 8.5.1 H3 Parallel Mutex, Simple ...... 73 8.5.2 H3 Parallel Mutex, Parameterized ...... 73 8.5.3 H2 Sequential Mutex, Parameterized ...... 74 8.5.4 H2 Thread Mutex, Parameterized ...... 74 8.5.5 H1,H2 Model Extension ...... 74 8.6 Operating System Parts ...... 76 8.6.1 Thread Placements ...... 76 8.6.2 Probabilities and Structure of Migrations ...... 79 8.6.3 The Cost of Migrations ...... 83 8.7 Performance Model Implementation ...... 87

iv 9 Experimental Applications 88 9.1 3D Finite-Difference Numerical Integration (FDI) ...... 90 9.1.1 Application Characteristics ...... 90 9.2 3D Surface Reconstruction (SRA) ...... 91 9.2.1 Application Characteristics ...... 92

10 Experimental Toolset 93 10.1 Development Tools ...... 93 10.1.1 Prometheus: Combinatoric Build ...... 93 10.1.2 Ilithyia: Code Generation ...... 94 10.2 Logistics Tools ...... 96 10.2.1 Iris: Distribution and Collection ...... 96 10.2.2 Ponos: Automated Benchmarking ...... 96 10.3 Analysis Tools ...... 97 10.3.1 Pandora: Model Fitting and Cross-Prediction ...... 97

11 Error Analysis 99 11.1 Relevance ...... 99 11.2 Outlier Rejection ...... 100 11.3 Error Metrics and Characterization ...... 100 11.3.1 Total Squared Error, Mean Squared Error ...... 101 11.3.2 Total Absolute Error, Mean Absolute Error ...... 101 11.3.3 Mean Absolute Relative Error ...... 102 11.3.4 Mean Weighted Absolute Relative Error ...... 102 11.3.5 Prediction Methodology ...... 105

12 Optimization 106 12.1 Types of Optimization ...... 106 12.2 Optimization Strategy ...... 107 12.3 Solution Methodology ...... 108

13 Cross-Prediction 112 13.1 Methods and Error Measures ...... 112 13.2 Complications, Caveats, and Limitations ...... 115

14 Predictive Outcomes 117 14.1 Architecture Representation ...... 117 14.2 Model Decomposition ...... 118 14.3 Curve-Fitting Experimental Data ...... 118 14.3.1 Best Fit on Model Parts ...... 118 14.3.2 Best Fit on Model Properties ...... 119 14.3.3 Best Fit on Model ...... 120 14.4 Cross-Prediction ...... 125 14.4.1 Cross-Prediction on Model Parts ...... 125 14.4.2 Cross-Prediction on Model ...... 125

v 15 Conclusions 138

16 Opportunities for Future Work 142

Bibliography 144

A Data Fitting Results 151 A.1 Fitting Errors Per Model ...... 151 A.1.1 Fitting Errors Per Model, All Data ...... 151 A.1.2 Fitting Errors Per Model, FDI ...... 154 A.1.3 Fitting Errors Per Model, SRA ...... 157 A.2 Fitting Errors, Per Part ...... 160 A.2.1 Fitting Errors, Per Part, Aggregate, Per Architecture ...... 160 A.2.2 Fitting Errors, Per Part, FDI, Per Architecture ...... 164 A.2.3 Fitting Errors, Per Part, SRA, Per Architecture ...... 168 A.3 Fitting Errors, Per Property ...... 172 A.3.1 Fitting Errors, Per Property, Aggregate, Per Architecture ...... 172 A.3.2 Fitting Errors, Per Property, FDI, Per Architecture ...... 191 A.3.3 Fitting Errors, Per Property, SRA, Per Architecture ...... 210

B Cross-Prediction Results 229 B.1 Cross Prediction Relative Errors, Per Part ...... 229 B.1.1 Cross Prediction Relative Errors, All Data, Per Part ...... 230 B.1.2 Cross Prediction Relative Errors, Per Part, FDI ...... 231 B.1.3 Cross Prediction Relative Errors, Per Part, SRA ...... 232 B.2 Cross Prediction Relative Errors, Per Model ...... 233 B.2.1 Cross Prediction Relative Errors, All Data ...... 233 B.2.2 Cross Prediction Relative Errors, FDI, per Architecture ...... 244 B.2.3 Cross Prediction Relative Errors, SRA, per Architecture ...... 253

vi LIST OF FIGURES

Page

1.1 Computational Scheme ...... 3 1.2 Predictive System Architecture ...... 4

3.1 Intel i7 cache structure ...... 14 3.2 Intel Core 2 cache structure ...... 14 3.3 AMD FX cache structure ...... 15 3.4 Intel Xeon E5335 cache structure ...... 15

4.1 Multi-platform parallel performance comparisons...... 24

5.1 CPU Utilization 4/8, no affinity control...... 30 5.2 CPU Utilization 5/8, no affinity control...... 31 5.3 CPU Utilization 7/8, no affinity control...... 31 5.4 Core 2 Duo Thread Affinity Effects ...... 35 5.5 Core 2 Quad Thread Affinity Effects ...... 36 5.6 Core i7-4820K Thread Affinity Effects ...... 36 5.7 Core i7-4700MQ Thread Affinity Effects ...... 37 5.8 Core i7-4720HQ Thread Affinity Effects ...... 37 5.9 Core i7-3930K Thread Affinity Effects ...... 38

6.1 Parallel program structure ...... 41 6.2 Parallel contention ...... 42 6.3 Data structure shapes in memory ...... 43 6.4 Wood chipper ...... 44 6.5 CNC router ...... 44

7.1 HYDRA configurations and structure ...... 51 7.2 HYDRA relationships ...... 52 7.3 HYDRA mutexes ...... 54 7.4 HYDRA 3 sample results ...... 56 7.5 HYDRA 1 sample results ...... 57 7.6 HYDRA 1 and 3 composite samples ...... 58 7.7 HYDRA 1 and 3 mean and normalized data ...... 59

8.1 Parallel task blocks ...... 65 8.2 Cache bandwidth partitioning ...... 69

vii 8.3 Thread assignment notation ...... 77 8.4 State migration transition counts ...... 80 8.5 Isomorphic thread configurations ...... 81 8.6 Thread migrations ...... 81 8.7 Heteromorphic state transitions ...... 82

11.1 HYDRA 1 and 3 weighting ...... 104

12.1 Model part-property mapping ...... 110 12.2 Model part-part relations ...... 110 12.3 Model part-property relations ...... 111

14.1 Model Part Representation, Top 25 Best Fit ...... 121 14.2 Model Part Representation, Top 50 Best Fit ...... 122 14.3 Model Part Representation, Top 75 Best Fit ...... 123 14.4 Model Part Representation, Top 100 Best Fit ...... 124 14.5 Predictive Model Complexity ...... 127 14.6 Model Part Representation, Top 25*Archs Cross Prediction ...... 129 14.7 Model Part Representation, Top 50*Archs Cross Prediction ...... 130 14.8 Model Part Representation, Top 75*Archs Cross Prediction ...... 131 14.9 Model Part Representation, Top 100*Archs Cross Prediction ...... 132 14.10Model Part Representation, Top 12 BEST Cross Prediction ...... 137

viii LIST OF TABLES

Page

3.1 Intel architectures ...... 13 3.2 AMD architectures ...... 13

8.1 HYDRA mutexes ...... 72 8.2 HYDRA processor counts ...... 72

14.1 Cross Prediction BEST Models, *denotes complete non-MCS groups . . . . 135

A.1 Comprehensive Model Fitting Errors (MWARE) ...... 152 A.2 Comprehensive Model Fitting Errors (MWARE) ...... 154 A.3 Comprehensive Model Fitting Errors (MWARE) ...... 157

B.4 Cross Prediction comprehensive relative error ...... 234 B.5 Cross Prediction per model relative error arch Core2-Arch to Core2-Arch . . 236 B.6 Cross Prediction per model relative error arch Core2-Arch to i7-Arch . . . . 238 B.7 Cross Prediction per model relative error arch i7-Arch to Core2-Arch . . . . 240 B.8 Cross Prediction per model relative error arch i7-Arch to i7-Arch ...... 242 B.9 Cross Prediction per model relative error arch Core2-Arch to Core2-Arch . . 245 B.10 Cross Prediction per model relative error arch Core2-Arch to i7-Arch . . . . 247 B.11 Cross Prediction per model relative error arch i7-Arch to Core2-Arch . . . . 249 B.12 Cross Prediction per model relative error arch i7-Arch to i7-Arch ...... 251 B.13 Cross Prediction per model relative error arch Core2-Arch to Core2-Arch . . 254 B.14 Cross Prediction per model relative error arch Core2-Arch to i7-Arch . . . . 256 B.15 Cross Prediction per model relative error arch i7-Arch to Core2-Arch . . . . 258 B.16 Cross Prediction per model relative error arch i7-Arch to i7-Arch ...... 260

ix ACKNOWLEDGMENTS

I would like to thank my third PhD advisor, Amelia Regan, who has stood by my side and has acted with sterling merit, support, and credibility – magnitudes above her predecessors. To my advancement committee and especially to my PhD committee, Michael Dillencourt and Dennis Volper, who, together, have allowed me to move on and to close this book of my life.

Thanks to Lorenzo Valdevit for the early years of support financial support and computer usage.

Thanks to MSC Software Corporation for the supplementary computational support needed to finalize this work in an expedient manner.

Thanks to Bill Fisher, Dennis Volper, and Quicksilver Software, Inc. for access to computing hardware and many healthy exchanges over the years and years to come.

x CURRICULUM VITAE

Scott W Godfrey

EDUCATION Doctor of Philosophy in Computer Science 2016 University of California, Irvine Irvine, California Master of Science in Computer Science 2014 University of California, Irvine Irvine, California Master of Science in Aerospace and Mechanical Engineering 2010 University of California, Irvine Irvine, California Bachelor of Science in Aerospace and Mechanical Engineering 2009 University of California, Irvine Irvine, California Associate of Science in Mathematics 2007 Orange Coast College Costa Mesa, California

RESEARCH EXPERIENCE Graduate Student Researcher 2010–2015 University of California, Irvine Irvine, California Technology Transfer Intern (Intellectual Property) 2013–2014 University of California, Irvine, Office of Technology Alliances Irvine, California

TEACHING EXPERIENCE Teaching Assistant, Reader 2011–2016 University of California, Irvine Irvine, California

PROFESSIONAL EXPERIENCE Software Performance Engineer, Parallel Architect 2014–2016 MSC Software Corporation Newport Beach, California Consulting Software Engineer, Parallel Architect 2011–2013 HRL Laboratories, LLC Malibu, California Senior Software Engineer, Senior Technical Lead 1999–2014 Quicksilver Software, Inc. Irvine, California

xi REFEREED JOURNAL PUBLICATIONS Compressive Strength of Hollow Microlattices: Experi- 2013 mental Characterization, Modeling and Optimal Design Journal of Materials Research MEMS resonant load cells for micro-mechanical test 2010 frames: Feasibility study and optimal design Journal of Micromechanics and Microengineering

REFEREED CONFERENCE PUBLICATIONS A novel modeling platform for characterization and op- Apr 2012 timal design of micro-architected materials 2012 AIAA Structural Dynamics and Materials Conference

xii ABSTRACT OF THE DISSERTATION

Cross-System Runtime Prediction of Parallel Applications on Multi-Core Processors

By

Scott W Godfrey

Doctor of Philosophy in Computer Science

University of California, Irvine, 2016

Professor Amelia Regan, Chair

Prediction of the performance of parallel applications is a concept useful in several domains of software operation. In the commercial world, it’s often useful to be able to anticipate how an application will perform on a customer’s machine with a minimal burden to the user. In the same spirit, it’s in the best interest of a user/consumer of computational software to most optimally operate it. In the super-computing/ world, being able to anticipate the performance of an application on a set of compute-nodes allows one to more optimally select the set of nodes to execute on. In terms of a large-scale shared computing environment where parallel computational jobs are assigned resources and scheduled for execution, being able to optimally do so can improve overall throughput by decreasing contention. In all cases, being able to anticipate the ideal degree of parallelism to invoke during execution (and have reasonable expectations for what can be acheived) will lead to more optimal use of all resources involved. For any of this to be possible, a good model (or models) are required which can not only capture an application’s performance on one machine but also predict its behavior on another.

Here, we present a large family of performance models composed of discrete parts, all as com- binatoric variations on Amdahl’s Law. We establish a protocol involving thorough bench- marking of the application on a known system. A protocol is established for the collection

xiii of meaningful machine architecture and performance information for the known and target machines. With the resulting high quality models and a single execution of the application on the target system we are able to closely predict its parallel behavior.

We propose that computation applications that are in need of this kind of treatment are sufficiently sophisticated and, especially in the case of commercial applications, are most likely black boxes and therefore avoid any need to analyze our applications in any static manner and expressly rely on parallel runtimes of individual executions. The protocols and methods can be implemented by any skilled developer on conceivably any parallel platform without the need of specialized API’s, hardware diagnostic support, or any manner of reverse- engineering of the applications of interest.

xiv Chapter 1

Introduction

The availability and ubiquity of modern parallel processors has led to parallel implementa- tions for many applications. Many applications which are now subject to parallel processing on desktop multi-core systems bear little resemblance to the kinds of applications tradition- ally run on large-scale in either form or function.

The need, or rather the opportunity, to schedule parallel tasks or operate parallel applications arises on many occasions. The problem of scheduling parallel tasks presents itself in a manifold of variations on the same general principle: having some quantity of independent tasks to perform and some quantity of resources which allow for multiplexed operation. These tasks may be packed into an application and are all under the hood or else realized individually.

Scheduling can be performed with a full spectrum of knowledge about the applications and architectures ranging from blind execution to having perfect knowledge. Modern operating systems have their own internal task schedulers which are preemptive but are entirely blind. Adding knowledge about applications and architectures into the scheduling equation can improve performance. Therefore, many applications embed their own specific scheduler on

1 top of existing infrastructures [12]. However, obtaining the appropriate information and organizing it in an actionable manner can be difficult. Analytical models, which can view a program or the underlying system at a higher level of abstraction than measurement or simulation techniques can therefore play a complementary role to those methods[1].

The simplified performance models traditionally implemented in most task schedulers are too simple, even simpler than the simplest model we present here, Amdahl’s Law, to make reasonable predictions of runtime and effective use of resources.

While other notions of performance like and speedup may seem interesting, they have little tangible meaning for real-world work and they typically rely on runtime analysis derivatives. “Execution time is by far the most important measure of interest. Therefore performance prediction should be in terms of execution.” [95]. With rare exception, no matter what we’re doing or how we are going about it, we’re always working to minimize runtime in some way even if there are secondary goals to balance.

Here, we present a family of performance models with increasing complexity which are devel- oped based on Amdahl’s Law. Variables in the performance model are either fully abstract quantities (generic parameters of a curve-fit), or quantities inferred as parameters or in- variants of the application. Constants pertaining to qualities of the host machine are also used. Integral variables may be specified to select a best-fit value from an array of possi- bilities. For example, memory speed may be read/write, random/sequential, and pertain to L1/L2/L3/main memory.

These performance models are built for the purpose of cross-predicting parallel application performance over a range of machines with multi-core processors of different architectures. Inferences are made about the underlying applications through fitting the models to exper- imental data obtained from a variety of machines and architectures (figure 1.1, figure 1.2). It is well known that the characterization of a parallel machine is more complex than a

2 Figure 1.1: Explained in detail later, our predictive system relies on application benchmark data, machine-specific information and benchmarks, and varying combinations of model parts to generate predictive models. An optimizer provides solutions for fitting the partic- ular predictive models to the benchmark data and the resulting fitted models are used for performing cross-predictions between machines. uniprocessor, because of the interaction among processors [74]. Our models are appropriate for single machines which, of course, are the building blocks of multi-computers. Methods to predict scalability accurately are necessary in order to improve throughput and overall efficiency on large-scale machines[9].

The primary contributions of our work is that from our family of models we can:

1: Infer from local benchmarks the parametrically averaged structure of the application being evaluated

2: Predict the runtime of said application on a target machine of similar architecture and, arising from high quality runtime predictions,

3: Determine the ideal processor count for operating a parallel application on a target machine.

3 Figure 1.2: The overall system is simple to understand schematically. Parallel software appli- cations and parallel computer systems are fed into a benchmarking system which generates benchmark data for the applications operating on the machines and also benchmark data which specifically characterizes the machines providing a basis for inter-relationships. All benchmark data is fed into a statistical analysis and optimization system with the mod- ular performance models. This system performs curve-fitting of the models to the data and also evaluates cross-predictions between machines. All outcomes are statistically evaluated and ranked according to least mean error to output a set of validated models.

Systems considered here are specifically shared-memory SMP’s with applications architected with either explicit threads or task-parallel API’s like OpenMP [94]. Network communication and message passing interfaces like MPI[46] are not considered for models at this level.

Users, be they end-users or the actual developers, generally know little about the performance profile of the applications they run and also the machines they operate on. Consequently, they often cannot accurately predict the best number of processors to use, leading to appli- cation slowdown and reduced throughput. Knowing how best to operate an application is difficult. The ideal number of processors to use varies with both the application and the spe- cific machine under consideration and sometimes even the data being evaluated. Predicting the parallel efficiency of applications without first executing is an enormous challenge [9].

It is, of course, necessary to collect performance data on an application and architectural performance information about the machines it will operate on. Without machine-specific

4 information, cross-prediction will be infeasible and, at best, a matter of luck. Ideally, we will be able to minimize the required information. Rosas and Barnes (2011) both try ‘small’ core counts on what would otherwise be large machines which may not afford extensive testing or ready availability of a large number of cores. They report that low core count runs provide enough information on the fundamental behavior of parallel code and that several program executions on a small subset of the processors is all that is necessary to predict execution time on larger numbers of processors [9]. However, given the complexity of modern processors, we wonder if this approach yields sufficient information for a high quality prediction.

5 Chapter 2

Review of Related Literature

Because the topic presented in this dissertation is multi-faceted, we choose to present our literature review inline, with discussion in the relevant chapters. We discuss modern parallel processors in Chapter 3, parallel performance models in Chapter 4, operating system in Chapter 5, parallel applications in Chapter 6, performance modeling in Chapter 8, error analysis and metrics in Chapter 11, and optimization in Chapter 12.

6 Chapter 3

Modern Parallel Hardware Technology

3.1 Flynn’s Taxonomy

Considering Flynn’s Taxonomy, the applications we consider fall into the task-parallel (the fork-join model) and multiple-instruction/multiple-data (MIMD) classification. We also con- sider multiple-program/multiple-data (MPMD) scenarios operating on separate cores of a common processor. There may be internal, local, aspects of any application which may be compiled under the single-instruction/multiple data (SIMD) data parallel paradigm, but this is a small part of the applications of interest to our research. Applications compiled with architecture-specific SIMD instruction targeting are necessarily restricted to the machines and architectures they can operate on and so we don’t break it out as a separate detail; this is a low-level implementation matter. Some experimental applications in this project are compiled in this way and are correspondingly restricted.

7 Processors of interest to us are those which are general computation main system proces- sors with small numbers of cores on uniform memory-access architectures (UMA), which may involve one or more separate processors such as canonical symmetric multi-processors (SMP). Non- architectures (NUMA) with multiple processor sockets internally networked with separate memory attached to each socket are outside the scope of this work.

3.2 Development of CMP Multi-Core

In 2006, multi-core processors were widely adopted, with the advent of Intel Core-2 chips. Currently a broad variety of chips are available from Intel (see table 3.1 for examples), AMD (see table 3.2 for examples), Samsung, Qualcomm, etc. and it’s nearly impossible to acquire a computer without hardware parallel processing through normal consumer channels. Earlier, multi-processor symmetric multi-processor (SMP) chips and systems existed where every processor was physically identical, separate, and mostly independent to all the others. SMP systems were generally only available through a small number of vendors targeting specific high-performance markets as operating system and application support were also quite unusual. Modern multi-core chips are characterized by more shared on-chip resources, particularly the multi-level memory cache hierarchy. Shared resources have led to lower and less predictable performance than with older architectures; the cost and complexity is dramatically reduced, however. Architectural designs vary in core count, cache hierarchy size, cache hierarchy depth, and and eviction policies.

Since 2006, the number of processors and hardware threads has increased, the processor has absorbed the memory controller (Northbridge chip), and the cache hierarchy has gotten larger and deeper with L3 cache becoming standard and L4 cache coming into the market recently. These advances boost performance, but the gap between memory bandwidth and

8 processor speed (popularly referred to by various names including “The Memory Wall”) is generally regarded to remain the single largest crippling factor to a high level of scalability in parallel applications running on modern processors.

Memory Wall: Due to shared resources in the memory hierarchy, multi-core applications tend to be limited by off-chip bandwidth. [40].

Bandwidth to main memory: While main memory access is supported with a multi-level caching system, it is regarded as the chief limiting factor to performance on modern comput- ers. Other shared resources throughout the system do not typically have as extreme adverse affects on high performance computational systems, but any contentious aspect leads to performance degradation in a environment.

Shared memory bandwidth has a negative effect on concurrently executed applications as each application makes unique demands on the memory system. As the operating system schedules alternate execution of applications, large portions of the cache hierarchy must be disrupted to accommodate new tasks and shared between them.

Shared memory bandwidth penalizes parallel applications due to the progressive starvation of increasing parallelism. Parallel applications already suffer from asymptotic speedup due to fractional sequentialization as demonstrated by Ahmdahl’s law see for example ([53]). In general, researchers present a consistent message about the state of technology today.

For example, [40] and [93], both operating with large-scale cluster supercomputers composed of multi-core nodes, argue that the Memory Wall is a reality. While Simon [93] specifically assesses application performance on several large-scale computers, Diamond [40] identifies that almost every aspect of the memory hierarchy being shared, be it L3 capacity, or off-chip main memory bandwidth, has negative implications for performance. Diamond finds that making use of a typical quad-core processor is a difficult and rare event and expresses concern for practical utilization of larger-scale chips promised for the near future (circa 2011). To

9 this day, quad-core processors (with the addition of SMT hardware threads leads to 8-thread machines) are probably the most common performance processors on the market with few forays into conventional processors with many more cores.

[18], [52], and [109] discuss the fact that the memory wall is real and contention is a huge issue. Gupta [52] performs some very nice experiments physically altering the structure of their computer in order to evaluate two different memory bandwidths for a range of applications and shows higher scalability distinctly tied to increased available bandwidth.

Williams [109] works with floating-point intensive applications and works to not only charac- terize their performance, but to improve it as well. They use as a benchmark the theoretical limit of floating point performance on the particular machine and determine the amount of bandwidth necessary to achieve such. The limiting factor is, of course, the actual bandwidth available on the system. Each application variation is evaluated for its floating point per- formance and bandwidth consumption to establish where in the world of real and feasible performance it lies. Inspired by Williams [109], Chatzopoulos [18] works with statically and dynamically obtained application data in an attempt to determine on-chip and off-chip de- mand to estimate scalability. They find the ratio of on- to off-chip demand to be essentially meaningful.

Interestingly, Sun [96] argues that the memory wall is real, but not such a big issue, in theory. Through some manipulation of Amdahl’s and Gustafson’s Laws and the utilization of some assumptions not valid for current designs, he asserts that whole system architecture needs to be addressed, focused primarily on the memory hierarchy, in order for multi-core performance to improve.

10 3.3 Intel Core2 Architecture

The first commercial release of 64-bit processors from Intel was in the Core2 product line which arrived in 2006. Multi-core Core2 processors were either from the Core2-Duo [26] or Core2-Quad [34] product lines with either 2 or 4 processors in the package. The memory hierarchy here is quite simple with an L1 cache exclusive to each core and L2 cache shared between each pair of cores on the die. The processors were designed with two cores per die and one or two dies per processor package for the Duo and Quad configurations. No L3 cache was present on these chips. See figure 3.2.

3.4 Intel i7 Architecture

Following the Core2 series of processors, several branded product lines were introduced serving different markets: low-end, mainstream, and high-end/business. These were the i3, i5, and i7 [33] series processors, correspondingly, and were distinctly different from the Xeon [25] series of server processors targeting high performance workstation and server markets. i7 processors came to market in 2008.

With a multi-level cache hierarchy, different levels of the cache are shared by different pro- cessor cores. In the case of Intel i7 processors, a single L3 cache is shared by every processor core within the processor package (typically four or six) and each processor core contains an L2 and L1 cache. The L1 cache is then shared between its two logical cores or hard- ware (Hyper-)threads (two per core with current designs) also known as simultaneous multi- threading (SMT) [Hyper-threading is an Intel proprietary technology]. Each L1 cache is split into two equal parts to serve for data and instructions separately. See figure 3.1.

11 3.5 Hyperthreading/Hardware Threads

Hardware threads inside a core may not always operate concurrently and instead operate alternately and opportunistically based on the availability of dependant information in the cache hierarchy. While one thread is waiting for a fetch from memory, the other may compute so long as it has the required resources. There are more and less optimal placements for threads on the processor but the operating system, despite knowing the structure of the processors, often is unable to capitalize on the structure. Noteworthy is that the hardware threads themselves are not different from each other except with regards to the way they are paired and their opportunity to co-execute. They are physically indistinct and are essentially separate execution contexts with substantial portions of the core shared between them. Contrast the relationships of hardware threads to L1 cache in figure 3.1 versus figures 3.2, 3.3, and 3.4.

Realistically, because of opportunistic utilization of resources, processors do not operate symmetrically, despite their physical geometric symmetry. In current architectures, individ- ual hardware threads are identical. For purposes of notation here, where a core has two hardware threads, if only one is active with a software thread it will be considered to be a

‘complete’ core (nc). When two hardware threads are active in the same core with software

threads, one thread will be considered ‘complete’ and the other ‘virtual’ (nv) with the pair

‘shared’ (ns). In principle, these are either scheduled or yielding to the other, flipping the notion of complete and virtual between the two. The system performance is to some degree an average of the performance of the two threads. Scogland et al. emphasize how physically symmetric processors are present in hardware, and circumstances of execution then lead to substantial asymmetry in behavior [91].

12 Computer Name Intel Speed Cores HT’s P L1 P L2 P L3 #L1(d + i), 2, 3 CoyoteTango Core 2 Quad Q8200 [34] 2.33GHz 4 4 256kB 4MB NA 8/2/0 RomeoBlue Core 2 Duo E7400 [26] 2.80GHz 2 2 128kB 3MB NA 4/1/0 Ares Core i7-860 [33] 2.80GHz 4 8 256kB 1MB 8MB 8/4/1 Styx Core i7-4820K [32] 3.70GHz 4 8 256kB 1MB 10MB 8/4/1 ChernoAlpha Core i7-4720HQ [31] 2.60GHz 4 8 256kB 1MB 6MB 8/4/1 Xerxes Core i7-3930K [29] 3.20GHz 6 12 384kB 1.5MB 12MB 12/6/1 L09473-1 Core i7-4700MQ [30] 2.40GHz 4 8 256kB 1MB 6MB 8/4/1 QSI-PC Core i7-2700K [27] 3.50GHz 4 8 256kB 1MB 8MB 8/4/1 L09473-2 Core i7-2820QM [28] 2.30GHz 4 8 256kB 1MB 8MB 8/4/1 CrimsonTyphoon Core i7-860 [33] 2.80GHz 4 8 256kB 1MB 8MB 8/4/1 YourMom Xeon E5335 [25] X2 2.00GHz 8 8 512kB 16MB NA 16/4/0

13 Table 3.1: Some Intel-based architectures used for experiments and comparative consideration.

Computer Name AMD Speed Cores HT’s P L1 P L2 P L3 #L1(d + i), 2, 3 StrikerEureka FX-8350 [24] 4.00GHz 4 8 640kB [128,512] 16MB 8MB 16/8/1 MCMA Opteron 6134 X4 2.30GHz 32 32 4MB 16MB 80MB 64/32/8

Table 3.2: AMD-based architectures for comparative consideration Figure 3.1: Typical structure of an Intel i7 processor. Four cores (sometimes more) including L1 and L2 cache, each with two hardware threads sharing L1 cache. All cores share a common L3 cache.

Figure 3.2: Intel Core 2 architecture predates the i7. Each core maintains its own L1 cache and pairs of cores share L2 with no L3 at all. The Core 2 Quad consists of two identical processing units in one CPU package.

14 Figure 3.3: The AMD FX series of processors boast a memory hierarchy with substantially less contention. L1 and L2 caches are exclusive to each core while L3 is shared.

Figure 3.4: The Intel Xeon E5335 is more reminiscent (and contemporary to) the Core 2 architectures. L1 caches are each dedicated to independent cores with L2 shared between core pairs. L3 is not present.

15 Chapter 4

Parallel Performance Models

4.1 Models of Parallel Computation

(Practical Domain)

4.1.1 Amdahl’s Law

Amdahl’s Law [4] describes the most basic concept of parallelism by taking a fixed application and distributing the computation portion of it (the parallelizable part, p) over n separate resources (processors). It is generally expressed as:

 p  T = T s + , s + p = 1, p ∈ [0, 1] p s n

Where Ts is the sequential runtime of an application and Tp is the parallel runtime of an application when operated on n processors, as s increases, the opportunity for parallel perfor- mance diminishes. Amdahl’s Law deals with what are considered to be ‘fixed-size’ problems which are expected to get faster with the assignment of more computational resources. Fixed-

16 size problems are prevalent in domains where the size of the computation is either limited by the size of the machines available (and machines with more processors/cores do not nec- essarily accommodate proportionately more system memory) or are already solved/solvable to a degree which does not demand higher resolution or more work.

As discussed in a recent survey paper by Al-Babtain et al., Amdahl’s Law finds itself extended in a variety of ways to consider different multi-core architectures [3]. These extensions are oriented more towards making hypothetical architectural decisions, rather than working in the software domain and understanding performance.

In application, in 1996 Shi found that the sequential and parallel fractions are not practi- cally obtainable and they generally neglect further overheads involved in parallelization [92] which, at this time in history, may substantially include behavior of the particular computer architecture and not just software mechanisms.

The types of applications we consider fall under the ‘fixed-size’ problem domain and we use Amdahl’s Law as the starting point for our work.

4.1.2 Gustafson’s Law

Gustafson’s Law [53] describes a concept of parallelism as follows. As additional computa- tional resources are applied to an application, the size of the parallel computational portion increases proportionately. Therefore, a problem is better solved (higher resolution, smaller error, etc.), but in essentially the same time. These types of problems are referred to as ‘fixed-time’ problems and are prevalent in large-scale computational environments such as weather prediction which are under practical constraints for the availability and utilization of their outcomes. Through the progressive improvement of , a particular problem may in practice transition from being a fixed-time to a fixed-size problem.

17 Substantial controversy seems to occur in the literature between these two laws depending on what type of parallel computation is used. Arguments are posed as to which is right while the applicability of these laws depends entirely on the details of the applications at hand.

4.1.3 Unification of Amdahl’s and Gustafson’s Laws

Shi argues that the two laws are essentially the same[92]. Juurlink and Meenderinck [60] attempt to compromise between Ahmdahl and Gustafson with an enhancement for asym- metric and dynamic multi-cores. Hill and Marty [56] extend Ahmdahl’s concept with some basic models for more sophisticated multi-core designs. Gunther’s Universal Scalability Law (USL)[51] was developed to unify the two models.

4.1.4 Parallel Speedup

The concept of parallel speedup exists as an evaluation of how much faster an application becomes with the utilization of additional computational resources. Generally the expres-

T (1) sion Speedup = T (n) is relied upon, with T (n) expressing the runtime of an application on n resources, but not without general controversy over the T (1) term. “Since parallel implementations may introduce computations that are unnecessary with respect to serial implementations, T (1) is the time required to execute the task on a single processor using the ‘best’ serial implementation.”[13] The matter is highly circumstantial whether T (1) IS a sequential application or else a parallel application operated on one thread. Serial implemen- tations may not generally exist for general parallel applications for any number of reasons including budget, time, and lack of further utility. Herein, T (1) for us means a parallel application operated on a single thread or processor and physically the same executable as used for T (n).

18 1 The expression for speedup derived from Ahmdahl’s Law is: Speedup = s+p/n . The result is diminishing returns through added parallelism and assumes a fixed problem size. Gustafson’s Law resolves to Scaled Speedup = N +(1−N)s, where N corresponds to both the processor count and the problem size which vary together. Counter to Ahmdahl’s speedup, continual improvement is achieved through the corresponding increase of problem size.

It’s worth noting that while speedup is an interesting metric, information is lost. The continued reliance on speedup as the primary measure of performance may be attributed to the use of execution time as the unit of measure. Using time as a measure of work has several drawbacks. First, it varies with the computer used. Second, it is simply a statistic which does not provide any insight about the algorithm [implementation].” [13]. By predicting runtime we can always use that to generate scalability. If we were to JUST focus on scalability prediction, the real-world connection, i.e. how long will it actually run, would be missed.

4.1.5 Roofline

Prinslow emphasizes the notion of computational intensity as a major point of interest, dis- cussing how program blocks may be either compute-bound or memory-bound [83]. Williams, Watterman, and Patterson develop the Roofline model motivated towards the diagnosis and improvement of parallel applications [109]. Roofline formalizes this into a performance- analysis framework for optimizing implementations. Roofline relies on measures of memory bandwidth and also computational power (generally giga-flops [GFLOPS]) and their ratio: ‘Operational Intensity’. Roofline allows one to characterize an application as being clearly memory- or compute-bound. Nugteren and Corporaal bring Roofline into the theoretical domain and extend it for the analysis of algorithms [79].

Mega-flops [59], millions of floating-point operations per second, (MFLOPS) has been a very traditional measure for performance capabilities of computational hardware and a reasonable

19 target for any piece of software to achieve on such a platform. This measure seems to be less often reported but does appear in recent literature in comparisons of architecture per- formance [11]. The meaningfulness of such a measure is less relevant when the performance of tasks which are specifically non-numerical, or even transitioning away from floating-point, are at hand. It’s also one of the easiest measures to abuse when making performance claims [7]. Frigo and Johnson measure runtime but then express their results in MFLOPS for a wide range of software (where the internal operations are unknown) with the caveat that “The MFLOPS measure should, thus, be viewed as a convenient scaling factor rather than as an absolute indicator of CPU performance” [47].

MIPS (Millions of Instructions Per Second) is another hardware-centric performance metric often encountered. It’s simply a measure of how many machine instructions are processed per second. A drawback of Roofline is that it specifically relies on some performance quality like MIPS or FLOPS. Frigo and Johnson observe that “...there is no longer any clear connection between operation counts and speed, thanks to the complexity of modern computers.”

4.2 Algorithmic Computational Models

(Theoretical Domain)

Algorithms exist simply as concepts and have no tangible performance measure, only theo- retical. Theoretical performance is expressed asymptotically, in big ‘O’ notation, as a notion of time relative to the size of the input while the input approaches infinity. Big ‘O’ nota- tion drops all but the most significant terms in the expression and also drops all constant coefficients yielding typically quite simple expressions. To measure actual performance of an algorithm it must be implemented in a language and operated on an actual computer. We

20 refer to Schatzman: “...we should note that computer languages are neither fast nor slow − only implementations can truly be associated with speeds” [89].

The Random-Access Machine (RAM) [42] is a model for analyzing algorithms on an ideal sequential machine. Parallel Random-Access Machine (PRAM) [43] is an extension to RAM for parallel algorithms on ideal shared-memory machines. The LogP machine model [38] is also a parallel machine model for distributed systems. Bulk Synchronous Parallel (BSP) [102] is another parallel model including more substantial communication concepts for distributed systems. These models are all focused on algorithm analysis on abstract machines. With this focus, they are language independent and know nothing of actual real technologies but rely on their parametric shapes.

Other more advanced models of parallel computation exist which fall into the theoretical domain of system modeling and algorithm performance optimization. Valiant invents the Multi-BSP model, derived from the BSP model, for aiding in the design of ‘portable algo- rithms’ which may be simply adjusted in a predictable way at compilation or implementation time for ‘optimality’ [103]. Here, both the algorithms and hardware are necessarily white-box and grey-box entities correspondingly (contents are well or sufficiently known) so predictive capabilities are both preemptive and restricted to situations with explicit knowledge. No attempt is made at system identification so the model lacks retroaction. Variations in im- plementation are also outside the scope here.

Where time complexity (the ‘big-O’ notation and its relatives) is used for assessing algo- rithms, Chellappa et al describe the serious problems that exist when trying to use any kind of algorithm-oriented model for any kind of actual prediction on real machines: “The O-notation neglects constants and lower order terms; for example, O(n3 + 100n2) = O(5n3). Hence it is only suited to describe the performance trend but not the actual performance itself. Further, it makes a statement only about the asymptotic behavior, i.e. the behavior as n goes to infinity. Thus it is in principle possible that an O(n3) algorithm performs better

21 than an O(n2) algorithm for all practically relevant input sizes n.”[19] n, of course is a very finite quantity due to the structure of real hardware. Further, they observe two orders of magnitude difference in runtime over four different parallel implementations of matrix mul- tiply each requiring exactly 2n3 operations and conclude that correlating actual runtime to time complexity is unlikely. [19]

Singh notes that theoretical models of (parallel) computing like RAM and PRAM are useful for algorithmic analysis but not much else [94].

Regarding algorithmic analysis, Crovella concurs “Although analysis provides the concep- tual tools to predict parallel program performance, most previous work in analysis has not been directly used by programmers to predict performance of real applications for two rea- sons:”, “...alternative implementations of a program may often have the same asymptotic performance function, yet differ in important ways in the values of the associated constants”, “The work required in developing an analytic model can greatly outweigh the effort in sim- ply implementing and measuring a proposed alternative program structure”[37]. The same variation may be true with just the performance of compilers or interpreters for a given codebase.

4.3 Drawbacks in the Modern Era

Except for Roofline and its variants, which are specifically developed to address the constrain- ing effect of memory bandwidth on modern processors, the other models suffer substantially in their applicability and portability in modeling the performance of applications on modern systems. Any method for modeling an algorithm will show wide variation when used for an actual application due to high-level language choice, data structure selection, library/API,

22 and compiler/interpreter effects. With the further complexities of machine hardware, there is little opportunity for such simple models to provide accurate predictions for real applications.

Pre-multi-core, it was already known that neither Ahmdahl’s Law nor Gustafson’s Law were sufficient to identify invariants of the application such as the sequential (parallel) part [70]. Into the multi-core era, it has become quite clear that because machines differ, and speedups obtained for one machine may not translate to speedups on others [103]. See figure 4.1.

23 t(n) Figure 4.1: The normalized runtimes tn = t(1) for a parallel solid meshing application are pre- sented for cross-system comparison. Two different machines are operated of similar vintage but different architecture: ‘Yourmom’ is a dual-socket SMP Intel Xeon E5335 [25] and ‘Ares’, a first-generation Intel i7-860 [15][33]. Ares doubles for two platforms with Hyperthreading both enabled (4 cores, 8 hardware-threads) and disabled (4 cores, 4 hardware-threads). The theoretical performance curve of a perfectly parallel application (no sequential part) is also presented. With no variation in either the application or its dataset, nontrivial differences in the ac- tual performance on real hardware are revealed. Where the core concept of Amdahl’s Law p  t (n) = t (1) s + n is describing the invariant proportion of the software which is either se- quential or parallel, this demonstrates that Amdahl’s law alone is not predictive of software performance on modern hardware, nor can it identify the invariants of the implementation. Using the runtime data for these three platforms and minimizing the mean absolute error (MAE) in finding the invariant parallel part p ∈ [0, 1] across that data, we obtain the fol- lowing results: Ares (HT off), p=0.768004, MAE=0.0022506 Ares (HT on), p=0.728, MAE=0.0107833 Yourmom, p=0.984998, MAE=0.000749495 If all results are evaluated simultaneously: Composite, p=0.768004, MAE=0.0705001 Evaluation with Yourmom reveals nearly perfectly parallel behavior by both visual inspection and also the derived parallel portion p = 0.984998 : 98.48%. On the other hand, evaluation on either Ares variant suggests substantial deficiency in parallelism with p = 0.768004 : 76.80%. Clearly, the architectural differences between the machines are the cause for performance variation and a more sophisticated expression is necessary.

24 4.3.1 Runtime Variability, Performance Uncertainty, and Noise

Variation in any quantity makes individual instances of that quantity more difficult to pre- dict. Barnes observes that operational noise in a system causes random execution time variability which leads to reduced accuracy of scalability models[9]. If the magnitude of vari- ation is sufficiently large, the quality of any discrete prediction will lose its meaning. Barnes notes that significant variability in runtime leads to overall difficulty and reduced accuracy for performance and scalability prediction [9].

Performance variation can arise from several causes. Operating system and kernel opera- tion during parallel computation perturbs program runtimes[78]. A multitude of different services in the operating system will have different perturbing effects with varying duration and magnitude. Even a well-written application in a controlled environment will realize perturbation.

Hardware effects in complicated systems are increasingly interdependent. Hennessy finds that performance-helpful aspects of modern processors may not be universal improvements. For example, microachitectural features aimed at a specific program behavior could nega- tively impact some applications [55].

Even before we had access to multi-core machines it was clear that while the main factors in performance of parallel programs were the computational workload, communication re- quired between processes, contention for shared resources, and associated synchronization constructs caused further delays. Delays due to hardware disproportionally impact parallel rather than sequential code due to the specialized synchronization requirements that arise [1].

Program behavior may be unpredictable and unrepeatable due to memory behavior, execu- tion skew across several processors, and measurements which disturb the actual performance

25 [40]. Increasingly popular in parallel development are parallel computing libraries and API’s, often built directly into the compiler, which often do not apply basic concepts like process and thread affinity. While these libraries and API’s have broadened the accessibility of par- allelism to many new applications and programmers, they can also introduce further causes for variation.

Necessarily, the variation in the ‘true’ runtimes leads to probabilistic models for prediction [58]. Scalar models are therefore limited and aren’t particularly valuable. To feed and generate a probabilistic model, multiple data points are required. Kramer and Ryan highlight the variability in execution time on distributed systems based on statistically significant performance evaluations on each system using a variety of applications [64].

When cross-prediction is the goal and a probabilistic model is not used, loose bounds seem to

1 be the outcome. Mendes shows upper and lower bounds of factors of nearly 4 and 3 [74]. The following year Mendes and Reed show improved upper and lower bounds each consistently

1 within factors of 2 and 2 from observed results [75]. In a practical sense, a performance model with no architectural information and no independent machine performance information will yield no viable avenue for cross-prediction. Baker et al. worked with performance optimization on large distributed systems and observed “A general solution is not possible without taking into account the specific target architecture.” [8]. One cannot simply curve-fit analytic expressions.

4.3.2 System Symmetry

Processor hardware is physically symmetric in the geometric sense, but not so much in practice [91]. Studies such as Sun and Chen[96] rely on this symmetry and therefore risk presenting overly optimistic estimates of performance. Most models lack any notion of contention, memory hierarchy, and assume infinite (or at least sufficient) memory bandwidth,

26 implying, intrinsically, that applications are compute-bound. With modern systems and applications, this seems more often a safe assumption for sequential applications. Parallel applications are, of course, more demanding on the system.

4.3.3 Lack of Hierarchy

Flat memory approximation ignores the speed advantage of things cached close to the pro- cessor cores and the slowdown of things stored beyond main memory (disk, network, etc.) [54]. This assumption is also characteristic in the long history of parallel task scheduling in the literature. Performance effects relating to the memory hierarchy may lead to opportu- nities for super-linear speedups even if super-linear behavior is impossible on homogeneous (and symmetric) systems [94].

4.3.4 Continuous Functions

The performance models typically presented in the literature are expressly continuous func- tions. Implicit is the assumption that the parallel work available for computation is infinitely divisible. This assumption is also characteristic of parallel task scheduling in the literature. This assumption is more appropriate to very fine-grained parallelism, balanced work, and unperturbed execution but is not applicable to coarse-grained or task parallelism which are quite prevalent in contemporary systems. Where modern task-parallel API’s are used, even fine-grained loop parallelism is broken into larger task blocks to reduce the overhead of the parallel system.

Both of our implementation case studies involve computations on a voxel space from the beginning to end of an experimental process. Our first case is a time-evolving numerical integration and our second case is a transformation on that data. Both cases are spatially

27 parallel with no opportunity for temporal parallelism [81]. Parallelism is applied, generally, across the x-axis in the 3D space resulting in a very coarse-grained task parallelization as seen above, which is favorable, or at least minimally antagonistic, to OpenMP [101]. All computations are performed in a synchronous manner with no task communication with the exception of some trade-offs in feature selection considered later. Tasks are assigned to computational threads dynamically so communication with the internal parallel API task scheduler is implicit.

28 Chapter 5

Operating System Process/Thread Scheduler Effects

Multi-tasking is fully preemptive in modern operating systems . Application processes are allocated processor resources dynamically and scheduled slices of time according to some prioritization or fairness criteria (figures 5.1, 5.2, and 5.3 show the transient nature of this scheduling on Microsoft Windows 10). Processes and their threads may be constrained to particular resources with corresponding affinity masks. Generally, they are created uncon- strained for greatest flexibility in scheduling [35].

The migration of threads across different processors causes performance problems as a result of processor architecture. Memory performance is already known to be a limiting factor for modern systems. The cache hierarchy has come into existence to support the disparity in performance between processors and system memory. When threads are migrated from hardware thread to hardware thread (core to core), extra work must be performed to flush dirty information from the old core cache (and possibly the new core) and then refill the cache on the new core with data to serve the new process as well as necessary machine

29 Figure 5.1: CPU history shows steady 50% utilization with 4/8 threads running. Despite the steady usage history, all cores show almost random activity. Affinity is clearly absent. instructions. Migration has varying effects depending on the structural relationship between cores. The cache flush may occur down to L1, L2, or L3 depending on the destination. Refilling always occurs up to L1, of course. There is the potential for reusing existing cached memory depending on the circumstances. Reuse of cached information can only occur for threads sharing a memory address space (i.e. in the same process).

30 Figure 5.2: CPU history shows steady 79% utilization (Windows 10 sometimes overstates this quantity), with 5/8 threads running. Activity is unsteady across all cores, but fuller than Figure 5.1 with 4/8 active.

Figure 5.3: CPU history shows steady 100% utilization (Windows 10 sometimes overstates this quantity) with 7/8 threads running. Activity is unsteady, but still more regular than Figure 5.2 with 5/8 active.

31 5.1 Process Affinity

Process affinity describes the set of processors in a system on which a particular process may be executed. By setting an applications processor affinity, a process may be constrained to a subset of all available processors as it is executed on a system. Affinity may be set externally by the OS, by some other agent acting through the OS, or internally by the application as a recommendation for a subset of processors made available to it by the OS. If unconstrained, the OS is free to schedule and migrate the process amongst different processors (the OS always has ultimate authority on this matter regardless of how an application configures itself).

5.2 Thread Affinity

Thread affinity describes the set of processors in a system on which a particular thread of execution within a process may be scheduled to execute. Thread affinity is necessarily a subset of process affinity and, again, serves as a strong suggestion to the OS for where to place the thread, but the OS retains ultimate authority on the matter. [36]

5.3 Thread Placement

While migrations result in negative performance effects, processor allocations (the placement of threads on processors), even without migrations, will exhibit irregular performance. Even with symmetric and identical cores there exists some degree of contention between them through concurrent access to shared levels in the memory hierarchy. The operating system may or may not consider processor architecture for performance effects. The application developer may or may not consider processor architecture either. Consideration given by the

32 application or operating system may or may not complement each other or even be suitable across a range of architectures. Best, worst, and probabilistic average-case performance can be modeled based on these conflicts.

5.4 Affinity in Practice

While configuring thread affinity should lead to fewer harmful thread migrations during scheduling, the system thread scheduler necessarily has less freedom in scheduling. There exists opportunity, increasing with system load or decreased load balance, for negative per- formance effects. Where communication or synchronization are required between application threads, spurious deferral of one thread leads to a chain reaction of deferral for dependant (waiting) threads, and the temporary idling of those resources. The coarser the parallelism (the larger the tasks), the larger the impact. Fine-grained parallelism with smaller tasks will have shorter idle intervals at the expense of greater task management overhead. Not all developers set (or know of) affinity so threads are free for migration. Parallelism is often left fully managed by task parallel API’s such as OpenMP[39], Intel Plus[67], Intel TBB[82], etc. Affinity is not universally set by the system.

5.5 Variation in Performance

The variety of possibilities in thread placement, migration, and the dynamic nature of mi- gration alone can lead to substantial variation in runtime (wall-clock time from start to end including system overhead) for a real application. Transient (and especially continuous) ef- fects within the operating system will only contribute negatively. Any change of execution context and will require some degree of flush and fill in the memory hierarchy.

33 Some substantial efforts have been made for characterizing and modeling performance in the presence of noise on large-scale computer systems [9]. The degree and variety of transient events and their effects on individual user-level workstations with off-the-shelf operating systems in real operating environments (home computers, academic, and industrial work- stations) are innumerable and exceed low-level noise. To counteract transience and also capture variations, our benchmarking will rely on a statistically significant number of mea- surements, outlier rejection, and averaging. Transient events could include, but are not limited to, passive and active actual user activity(e-mail, Internet, application usage, etc.), system maintenance (updates, disk maintenance), system security (anti-virus, anti-malware, etc.), actual malware (but hopefully not), device drivers, scheduled and recurring events, etc. These events may or may not be detectable.

5.6 Experimental Thread Affinity Effects

This experiment is on a small numerical kernel (a toy) which accesses memory either ran- domly (upper curves) or sequentially (lower curves). The kernel is run with incrementally more parallel threads in one process, each performing the same work (runtime should re- main constant with increased concurrency; this is a ‘fixed-time’ simulation in reference to Gustafson’s Law). The memory footprint is intended to not be large and therefore not emphasize bandwidth limitations.

OpenMP is used for parallelism. Through thread affinity, threads are mapped to hard- ware architecture where threads #0,#1,#2,#3,#4,#5,#6,#7 generally map to processors 0,2,4,6,1,3,5,7 where processors 0,1 and 2,3 and 4,5 and 6,7 share L1,L2 cache on the Intel i7 architecture with specific variations described (figures 5.6, 5.7, 5.8, and 5.9). Where archi- tectures have more or fewer processors (Intel Core2) the pattern, of course, is compensated appropriately (figures 5.4 and 5.5). Thread affinity is considered in a variety of patterns:

34 Figure 5.4: Runtime is plotted versus thread count for various thread affinity assignment experiments. Showing negligible variation in runtime, the Core 2 Duo architecture, lacking SMT technology, is essentially ambivalent to these thread affinity experiments.

‘OFF’ indicates normal system behavior where the OS is free to schedule threads on ANY core. ‘ON’ indicates a strict affinity of 1 thread per core filling the architecture: #0,#1,#2,#3,#4,#5,#6,#7 mapping 0,2,4,6,1,3,5,7. ‘HALF’ indicates that the first four threads follow the strict mapping of ‘ON’ and the latter unconstrained as ‘OFF’. ‘SMT’ indicates threads map as pairs #0,#4: 0,1 #1,#5: 2,3 #2,#6: 4,5 #3,#7: 6,7. ‘HALF SMT’ indicates the first four threads map according to ‘SMT’ and the latter uncon- strained as ‘OFF’.

35 Figure 5.5: Runtime is plotted versus thread count for various thread affinity assignment experiments. Showing negligible variation in runtime, the Core 2 Quad architecture, lacking SMT technology, is essentially ambivalent to these thread affinity experiments.

Figure 5.6: Runtime is plotted versus thread count for various thread affinity assignment experiments. The Core i7-4820K shows almost flat performance up to four threads when affinity is properly configured. Performance beyond that point shows marked degredation.

36 Figure 5.7: Runtime is plotted versus thread count for various thread affinity assignment experiments. The Core i7-4700MQ shows almost flat performance up to four threads when affinity is properly configured. Performance beyond that point shows marked degredation.

Figure 5.8: Runtime is plotted versus thread count for various thread affinity assignment experiments. The Core i7-4720HQ shows almost flat performance up to four threads when affinity is properly configured. Performance beyond that point shows marked degredation.

37 Figure 5.9: Runtime is plotted versus thread count for various thread affinity assignment experiments. The Core i7-3930K shows almost flat (but slightly deteriorating) performance up to six threads when affinity is properly configured. Performance beyond that point shows marked degredation.

38 Architectures considered are either SMT-based (figure 5.6, figure 5.7, figure 5.8, figure 5.9) or non-SMT (figure 5.4, figure 5.5). Non-SMT based machines are essentially ambivalent to the matter of thread affinity as demonstrated and all plots overlay with no outstanding variation. SMT-based architectures all behave similarly, but markedly different than the non-SMT architectures. Of the five affinity designs considered, only three plots generally arise from logical duplicity at the OS scheduler level (minor variation leads to a fourth plot).

For affinity OFF, runtime continually increases with increasing thread count (resource con- tention and migration is implied). With all other options, runtime is FLAT for up to half of the maximum threads (no contention is implied which agrees with the code). As illustrated previously in figure 3.1, after half of the maximum threads, cache contention for L1 and L2 physically becomes a factor between paired threads on SMT architectures (L3 is always in contention) and extends runtimes accordingly.

39 Chapter 6

Structure of a Parallel Application

6.1 Composition

As described by Ahmdahl’s Law, a parallel application can be most simply considered as an application having some portion which is entirely sequential and the remainder which is parallelizable. Often, the parallel portion is portrayed as a single block of infinitely divisible work, an abstraction which is quite far from reality. The parallel portions are often composed of multiple parallel sections each separated by sequential blocks (see figure 6.1). Each parallel section is composed of one or more tasks and each task may be itself parallelized or the tasks may be collectively processed simultaneously using the task-parallel model.

Beyond just the sequential and parallel portions, Sun and Ni describe applications composed of computational blocks each with varying degrees of parallelism and demand within an application [97]. These may result from algorithmic limitations, implementation limitations, or data size limitations (not enough work to spread around). With any block engaging less than all parallel resources, a reduction of the average parallelism or the ‘degree of parallelism’ (DOP) [58] of the application results.

40 Figure 6.1: Parallel software may be structured with a wide variety of patterns. Frequently encountered patterns which negatively affect parallel performance involve sequential com- putation. Alternating parallel and sequential sections allow for setup/teardown/transition between parallel parts where parallel operation is inconvenient or even impossible to orga- nize. Parallel sections may also include critical sections which moderate access to contended resources. Access is restricted while any thread is occupying the resource requiring all other threads to wait until the resource is relinquished.

Threads of execution within a parallel section may require access to resources, software or hardware, which can be accessed by only one thread at a time exclusively (i.e. sequentially). While memory may be simultaneously read from multiple threads, simultaneous reading and writing is a problem. Any resource in memory subject to being updated or written to is subject to this kind of constraint. Protection of mutually exclusive-use resources may be through a variety of constructs and mechanisms with the net assemblies often regarded as ‘critical sections’ or ‘mutexes’ (see figures 6.1 and 6.2).

The range of problems available in computing are both wide and deep as computers serve a multitude of needs in a multitude of environments and disciplines. Some problems are of a quality that can be solved in fixed-time and can be scaled in complexity accordingly (Gustafson). Some problems are of fixed size and do not scale any further for any practical reason (Ahmdahl). Worlton converges similarly [111]. Some computations are fundamentally small kernels and their demand is through continually streaming memory while operating on small predictable local blocks (PDEs, FDI, etc.) At no time is the whole problem addressed in comprehension.

41 Figure 6.2: The existence of critical sections alone is not a performance liability. The performance impact of any particular critical section is a factor of the contention for its usage (concurrency) and the proportion of the parallel section it occupies which is a factor of both its physical size and the frequency of its usage in the parallel section. A large section called infrequently may be just as disruptive as a small section called constantly. All parallel portions may end up suspended, stacked front to back in a ‘convoy’ during priods of high demand. Catastrophically, worst-case leaves the entire parallel section sequentialized. Realistically, there also exists some overhead for the sequentialization which extends the parallel runtime beyond the equivalent sequential application.

Some computations involve operations on graphs, databases, and other similar data struc- tures which are more distributed and less regular in memory. They suffer from poor locality of reference and a low degree of predictability with essentially random access patterns. Ma- trix multiplication and FFT are well known and important examples here and are more tuneable to target particular architecture through decomposition [103]. “Data intensive ir- regular applications that rely on pointer based data structures, such as graphs, are harder to optimize due to their intrinsic usage of pointers to access data and to their less-predictable pattern of data access.” [44]. Not all of these applications can be decomposed or else are not decomposable with reasonable overhead in either time or space.

Parallel work tends to occur on several forms of data (see figure 6.3). Data may be configured in blocks which are carefully arranged for the application and operated on with multiple threads simultaneously. Mathematical operations like FFT and matrix multiplication often fall into this category, see figure 6.5. Data also may be arbitrarily complicated structures

42 Figure 6.3: Different data structures used by different algorithms may have dramatically different representations in memory. The algorithms in operation will normally access just portions of the data structure and they may be highly tuned for efficient operation (or not). Pictured is the boundary of all memory (purple) and the memory occupied by a data struc- ture (green) with pointers (black). The moving window of L3 or L2 cache is abstracted as the cyan square. Memory accesses may be very stationary as in the blocked/partitioned model. In the stream- ing case, accesses may be highly sequential with new data “sliding in” as older data “slides out”. For an irregular or graph-type structure, memory accesses may be irregular, erratic, and seemingly random. Of course, multiple data structures may be simultaneously accessed, particularly for input and output or multiple structures for either, further complicating the situation. which are not simplifiable and are scattered throughout memory and have irregular access patterns. 3D rendering, mesh manipulation, databases, and irregular graph or unstructured grid processing may be this way. Data may also come in streams occupying sequential and contiguous memory (see figure 6.4). Streams may represent spatial or temporal information. Audio, video, and highly structured problems like finite-difference integration are of this type. We specifically address this type of work. Outputs of parallel work may, of course, take on similar structures, not necessarily congruent to the input, but also not necessarily entirely separate structures.

Problems of the streaming type may be less concerned with the size of a multi-level cache relative to the size of the problem so long as the data can stream. In the latter case where

43 Figure 6.4: The wood processing equivalent to streaming-type algorithms, the chipper shred- der has a fixed throat-size (L2,L3 cache) and processes material from front to back (main memory) with ‘results’ expelled as rapidly as material is ingested. Only a small amount of material relative to the total workload is processed in the throat at one time, but it is completely processed. [110]

Figure 6.5: The CNC milling machine or router exists in contrast to the chipper-shredder. It has a large parameterized workspace (partitioned or tiled data) which is operated on until the product is finished and exchanged when complete; in a degenerate condition the workspace (like the cache) may hold the entire workload [99].

44 streaming is not so capitalizable, the size of the cache at different levels, and ultimately its speed, may be of more critical importance. An alternate approach described by Badur, et al. (streaming is referred to as ‘na¨ıve’) suggests that parallel operation on L2-sized blocks to be a higher performance option but demonstrates it’s no more than 13% better on ‘large’ problems [6]. This might require very careful targeting due to the reality of private L1 caches and the prospect of sharing conflicts.

45 6.2 Decomposition

Regardless of the internal complexities of a parallel application, it’s generally necessary to view it more simply and abstractly. Even a parallel application with a simple structure may have strong data dependencies leading to exceedingly difficult static analysis. Referring back to Ahmdahl’s Law, the application can be viewed as aggregates of sequential and parallel operation with behavior of those aggregates statistically averaged by necessity. More internal details can lead to a more detailed model (and hopefully more accurate predictions) of outward behaviors. Worst-case is that no internal details are available and it’s a completely black-box. This is especially pertinent to commercial off-the-shelf applications or systems which are so opaque or otherwise complicated (or having strong data dependant behavior) that static analysis is utterly infeasible.

The parallel part of an application will most likely be a collection of Np discrete tasks to be performed. Here, we assume that the tasks are identical in nature at least in the average case, or are so scheduled for work on threads internally through a load balancer. While multiple parallel and sequential portions in general are possible in the operation of an application, we treat them as bulk terms for the total of the sequential and parallel parts and assume a similar task load per parallel portion, at least on average. When Np is either small compared to the number of processors ‘n’ applied to the application or ‘n’ is not a numerical factor of

Np, the apparent parallelism of the system or its parallel performance may suffer due to the remainder of work leftover for various ‘n’ resulting in an apparent ‘load imbalance’. If Np is large, the parallel part may be approximated as continuously divisible as the effect of the remainder on the outcome is small.

46 6.3 Parallelization (from the literature)

Some variation in terminology exists across different disciplines within the parallel scheduling literature, so here we disambiguate. The scheduling literature was strongest during the 80’s and 90’s, focusing mostly on supercomputing applications and system sharing on large-scale computing environments as would be the prevalent parallel computing environments. In terms of parallel scheduling, a ‘task’ (user-requested) is specifically an application requiring both resource allocation and also scheduling of time on a system for operation. ‘Resources’ are characteristically processors and, to a lesser extent, system memory.

A sequential task is any task which may strictly be run using only a single resource. A parallel task is simply a task which may be operated on concurrent resources [50]. The quantity of resources which may be utilized may be fixed (predetermined) or bounded (upper or lower). A parallel task with a fixed resource requirement is historically regarded as simply ‘Parallel’ (which is ambiguous contemporarily) or specifically ‘Rigid’ [12]. A parallel task which may be configured to operate on a fixed quantity of resources, generally at startup, is regarded as ‘Moldable’. Moldable tasks may accept an arbitrary number of processors or particular allocations conforming to particular constraints (e.g. powers or multiples of two or four). A ‘Malleable’ task is a task which may have its resources dynamically reassigned during operation [77]. Malleable tasks are substantially more complicated, requiring collaboration between the operating environment and application. When executive control is asserted by the operating environment an application must either be notified of such a change or otherwise make regular interrogations of the system to observe it. Neither of which is common in contemporary practice. Generally, applications are free to operate on a system and assume dominance and priority over resources unless specifically configured a priori.

We concern ourselves explicitly with moldable parallel tasks which will accept arbitrary configurations. In general, the majority of real-world tasks (task parallel parallelism) are of

47 (or CAN be) this type. Where sufficient degrees of architectural control exist, we consider that a single moldable task may be decomposed internally into a chain of dependant tasks, some of which may be moldable parallel tasks and some strictly sequential.

48 Chapter 7

Parallel Benchmarking

Seeking to avoid invasive measurement methodologies due to difficulties arising from the unbounded complexity of applications which we may be interested in analyzing (such as the inability to instrument black-box code, the complexity level of the application may be arbitrarily large, instrumentation itself is potentially unsuitable for potential end-users “...the user is required to have statistical expertise that is not common to parallel programmers.” [73], source code not available or inordinately complicated, hardware instrumentation may be unavailable on certain processors) we actively avoid performance counters and other mechanisms for quantifying ‘symptoms’ of system operation. Several styles of benchmarks are considered for use depending on the application being worked with.

In order to model parallel structure of an existing application from runtime information, sufficient data is required to fulfill basic requirements of the numerical model. A model consisting of n parameters requires at least n + 1 data values in order to fit. For a typical moldable parallel application, operable on [1 : n] processors, not more than n data points are available by simply clocking regular execution runs which can, at best, help to char-

49 acterize the parallel part. Through concurrent operation, not just parallel operation, more information becomes available and the sequential part also becomes characterizable.

‘HYDRA’ is a self-developed benchmarking tool built into our data collection and data analysis experimental applications Ponos and Pandora. HYDRA collects several benchmark types according to application structure (see figure 7.1):

HYDRA-1 (H1) benchmarks are timings of concurrent executions as separate sequential processes. Any application can be operated in this manner.

HYDRA-2 (H2) benchmarks, similar to H1, are timings of concurrent sequential execution threads inside a single process. Special application design is required for this; threadsafe DLLs are highly appropriate.

HYDRA-3 (H3) benchmarks are timings of individual parallel executions. Any moldable parallel application can be operated in this manner. Applications which automatically set their degree of parallelism (no user control) cannot be used to provide adequate data.

To characterize an application sufficiently, H3 and either H1 or H2 data is required. For this thesis we work with H1 and H3 specifically, describing the characteristics of H2 for completeness.

50 Figure 7.1: The structural relationship between the several types of HYDRA benchmarks are illustrated here. HYDRA-1 benchmarks are composed of separate processes (green box) each containing a single thread of execution (red box). The thread executes both the sequential work (yellow) and parallel work (blue) in a sequential manner. HYDRA-2 benchmarks are substantially similar to HYDRA-1 in that single threads perform identical work concurrently with the exception that the threads are contained in a single shared-memory process. HYDRA-3 is the most unique of the three, but also the most conventional with regard to the concept of a parallel application. Like HYDRA-2, all threads are contained in a single process but parallel work is divided across a group of threads. This is the canonical parallel work model described by Amdahl’s Law.

51 Figure 7.2: In order to model with different types of benchmarks, the relationships between them must be known. HYDRA 3 benchmarks are expected to be timings of the operation of a parallel kernel inside an application. Explicitly sequential portions may or may not be included. HYDRA 2 benchmarks consist of the time to completely operate an application thread including startup and shutdown. Timings are also produced internally. HYDRA 1 benchmarks are the total execution time of separate system processes. Timings are performed by an external application and include process startup and shutdown.

7.1 Relationship between HYDRA benchmark types

7.1.1 Concurrent Processes (independent memory address spaces)

HYDRA-1 benchmarks are measured externally by a driving application and therefore rep- resent the total span of the execution including process startup and shutdown (figure 7.2). Relative to the H3 benchmark, an additional sequential portion may need to be added to represent the extended head and tail together. H1 benchmarking lacks intra-process exclu- sive resource contention, but contention may exist at the OS and is expected at hardware levels through resource sharing. Data is collected for each of [1 : nmax] concurrent processes.

52 7.1.2 Concurrent Threads (common memory address space)

HYDRA-2 benchmarks are operated as HYDRA-1 except that threads in a single shared- memory host process are used rather than independent processes (figure 7.2). Within a single process more resource conflicts arise including mutexes in the application and within dependant libraries. These resource restrictions may not only lead to delays of the parallel parts, but extended runtimes of the sequential parts too. Contention in the parallel part may be in addition to that of H3 parallel contention (figure 7.3). Worst-case is that all concur- rent threads are convoyed completely and run entirely sequentially (potentially with further overheads). More complicated models are required to account for this kind of behavior. For these benchmarks to be possible, the application must be built for multi-threaded calls and those calls must be fully threadsafe. This is a higher standard than necessary for each of H1 and H3 and relatively unusual. H2 sequential application threads are operated concurrently for [1 : nmax] threads at a time.

7.1.3 Individual application task parallel computation

HYDRA-3 benchmarks may be subject to some degree of exclusive resource contention in the parallel portion; sequential operation is uncontested and free of this by definition (figure 7.3). H3 benchmarks may represent the operation of the overall application (and the value is reported by an external controller) or a meaningful subset of it such as a computational kernel (reported by the application itself). H3 benchmarks are able to ascertain information about the performance qualities of the parallel part(s) of an application. No such information can be extracted about the sequential part, H1 and H2 fill this information in.

Incremental variations or hybrids between each benchmark style are possible, especially be- tween H1/H3 or H2/H3 such as concurrent multi-threaded executions. Two simultaneous instances of 4-threaded applications and four simultaneous instances of 2-threaded applica-

53 Figure 7.3: Contention for software resources inside the same memory address space in a parallel application may only exist where there are simultaneous demands in place on that resource. For HYDRA-3, meaningful contention can only exist during parallel operation of the parallel portion of the application. For HYDRA-2, parallelism is not expressed directly and arises through concurrency instead. Contention may occur in the sequential portion of the application, in the parallel portion (same mechanism as H3), and also in the parallel portion, but using alternate mechanisms. tions are just a couple such variations. It’s not expected that these would yield any additional information than that of the existing types.

54 7.2 Benchmarking Protocols

It’s well known that parallel applications rarely scale perfectly on their own. This is at- tributable to both qualities of the algorithm and the operating environment (OS scheduling and system architecture). This is captured with H3 benchmarks. H1 benchmarks are free of algorithmic influences and deficient scaling is entirely an environmental matter. H2 bench- marks are similar to H1, but may manifest other software artifacts as described above.

HYDRA collects a statistically significant number of executions per configuration. For H3, applications are operated exclusively on n = [1 : nmax] threads for no less than 10 executions for every n. Applications are operated randomly to avoid any memorization or cache advan- tage. Outliers are incrementally rejected until quotas are full. Each benchmark generates a single run time for a particular number of processors and runtime generally decreases with increasing processor count. See figure 7.4.

For H1 and H2, each application implementation is either sequential or parallel, supporting up to nmax instances. The run times with increasing concurrency are observed to only deteriorate (lengthen) and variation among runtimes become increasingly variant (see figure 7.5). We aim to work with, but not capture the increasing variation and deviation at this time. Benchmarks are run on sequential applications (processes), one application at a time, concurrently for each of [1 : nmax] processors and not less than 10n trials for each. For H1 these instances are separate processes and exist in separate memory spaces while for H2 each instance is a separate thread in a common memory space.

The relationships between H1 and H3 benchmark results can be seen in figures 7.6 and 7.7.

55 Figure 7.4: H3 benchmarks are performed at least 10 times for operation on each of 1 to nmax software threads. Variation is observed and the mean value of each is used following statistical rejection of outliers. Runtime generally decreases with increasing processor count up to a certain point which is application and system specific. Here, regression appears to begin at 7 threads on a 12-thread machine and behavior becomes less predictable following that.

56 Figure 7.5: Both H1 and H2 benchmarks are executed on 1 to nmax concurrent instances either in a single process (H2, shared memory space) or multiple processes (H1, separate memory space). With increasing concurrency there exists increasing contention for hardware resources such as memory bandwidth. Run times are seen to drift longer and also exhibit greater variation with increasing concurrency. H1 is pictured, H2 results would appear quite similar.

57 Figure 7.6: Both H1 and H3 plotted together reveal their relationships to each other. Starting at nearly the same origin with n=1, increasing thread count on H3 follows the lower curve while for H1 the upper curve is developed. If H2 results were available they should fall somewhere between the two curves, but likely near to the H1 curve for most circumstances.

58 Figure 7.7: Here, the preceding H1 and H3 data are group-wise averaged (mean) and pre- sented as curve-connected data. H1 data is normalized according to the number of processes. The difference between curves at one processor is (or should be) indicative of the charac- teristic difference between benchmark types. H3 data may exhibit some logical contention between parallel threads manifesting as suboptimal scaling. H1 data, on the other hand, is essentially contention free. Should logical contention exist in H3, the curves would deviate from each other progressively more as process/thread count increases. Because we are able to assert that no logical contention exists in H1 (not a software problem) and we know that no contention exists in H3 data, we can conclude the sub-optimal parallelism is strictly a hardware performance matter.

59 Chapter 8

Modular Performance Model

In this section we develop our comprehensive parallel performance model. Our model is horizontally decomposed with terms corresponding separately to the application, its runtime environment, and the layers of the hardware architecture [73]. We examine our experimental applications on multiple multi-core parallel machines[61], some of which exhibit structural variation in CPU memory architectures. Some information about the specific machines being operated is necessary. Information about algorithm and operating system behavior are inferred by the models (with one exception).   0 if x = 0 Where called out, H (x) is the Heaviside Step function: H (x) =  1 if x > 0

The parameter n is used circumstantially and interchangeably as a result of conventional constraints. n software threads may be specified (with variation) for an application to operate with. Each software thread will at all times be able to run on at least one core or hardware thread. At no time will n exceed nmax (oversubscription) and at no time will more software threads be affinity constrained to operate on a core than the number of hardware threads it physically contains.

60 8.1 Hardware Parameters

There are some requisite system architectural parameters and also some basic performance qualities needed to support the model. For hardware structure we collect:

Number of cores, nc and hardware threads per core TPC. T P C > 1 implies the sharing of L1 cache; TPC = 1 or 2 here and for all known architectures at this time. The number of ‘virtual’ cores (as some people tend think of them), those subject to significant contention, nv = nc(TPC − 1). The total number of logical cores or hardware threads available nmax = nc + nv = nc · TPC.

L1, L2, L3 cache presence, whether they are shared and to what degree, and their individual

(not collective) physical size at level x: ALx. L0 is the notation used for main memory allowing for additional cache levels to be extended beyond L3 without misunderstanding. L4 caches present on newer processors just starting to become available is not explicitly considered here, but further extension of the concepts are not difficult and should come quite naturally.

Collected but unused are cache line size and NUMA node counts. NUMA architectures with more than one node are beyond the scope of this work. Cache line size is invariant on these systems so no parameterization on that quantity is required or possible.

For hardware performance we collect approximate average memory bandwidth per level in the memory hierarchy by accessing a large volume of data, 50MB, from a block sized to fit the cache level. At each level, Lx, testing is performed for combinations of read (I) and write (O) to sequential (S) and random (R) access. Simon and McGalliard performed benchmarking of the memory hierarchy and also operated the same benchmarks in a concurrent manner to demonstrate contention [93]. Correspondingly, for each of our four combinations we measure the bandwidth for [1 : nmax] affinity-locked threads operating into both a common block

61 (CB) and independent blocks (IB) in shared memory V I R CB . CB measurements are O , S , IB ,Lx,n used for H3 based modeling while IB are appropriate to H1 and H2.

The choices between S and R and CB and IB are free variables in the modular model while I and O are chosen according to the usage.

8.2 Algorithm Bandwidth

Algorithmic input and output memory demands are very similar to memory read velocity

[bytes] VI,n and write velocity VO,n with units [second] . For each we consider the possibility of linear and random access according to algorithmic and data structure properties. VI,n and VO,n

derive directly from system benchmarks V R CB and V R CB . The selection of R or I, S , IB ,L0,n O, S , IB ,L0,n S and CB or IB is a parametric matter for model fitting.

Per-thread bandwidth is some interpolation between VI,n and VO,n. θ is the split in bandwidth between I/O, θ ∈ (0, 1). Therefore:

MT,n = VI,nθ + VO,n (1 − θ)

.

θ is a free variable of the modular model.

62 8.3 Software Parts

8.3.1 Amdahl’s Law

p  Amdahl’s Law [4], Tp,n = Ts s + n , serves as an excellent starting point for the composition of our model as we are interested in a fixed-size rather than a fixed-time [53] scaling model.

Tp,n represents the execution time of a parallel application on n processors. Ts is the time to operate sequentially on one resource, Ts ≡ Tp,1. Rosas, Gimenez, and Labarta also develop a model based off Ahmdahl’s Law [88]. Anything more sophisticated will require variability in problem size which then demands analysis of the internal algorithm complexity, concurrent and otherwise, which is extra curricular to our intent. p is a free variable in the model (s is constrained as s = (1 − p)).

8.3.2 Modularity

Rather than a monolithic and complicated function, our model is modular with each addi- tional term introduced to the model as a perturbation of the most basic form as appropriate for simplicity of composition. The net outcome is a modular system which exists as a combi- natoric field of possibilities from all the various perturbations. Through modularity we seek to avoid inherent overfitting of source data leading to deficiencies in cross-prediction. These are considered in bulk later.

The extent of Amdahl’s Law involves parameterizing the degree to which an application is parallel and sequential. “...determining the amount of time a parallel computation spends in sequential execution can be nontrival.” [71] This notion was made in 2000, quite some time before complicated multi-core architectures came into being which only make the problem more difficult.

63 Working with H3 benchmarks, which represent canonical parallel applications, the parallel and sequential portions are parameterized:

 p   p  T = T s + , s + p = 1, p ∈ [0, 1] → T (1 − p) + , p ∈ [0, 1] p,n s n s n p is intentionally a property of the application and should, in principle, be invariant of its circumstances of operation.

8.3.3 Task Parallelism

Accommodating that the “average parallelism” across parallel tasks may be less than fully parallel [97], we can break the application work into blocks (figure 8.1). With Np individual tasks which may be operated in a parallel manner (generally they are queued using some

mechanism and operated on opportunistically), the number of fully parallel blocks, BFp ,

j Np k which can engage all n threads are BFp = n and the number of remaining leftover

(partially parallel) tasks τPp is the same as the number of threads able to operate on them

nPp : τPp = nPp = mod (Np, n) = Np − n · BFp . The division of parallel work between the n·B fully (p ) and partially (p ) parallel portions are p = Fp and p = 1 − p . Amdahl’s Fp Pp Fp Np Pp Fp Law can be extended:

     p · pFp p · pPp pFp pPp Tp,n = Ts s + + = Ts s + p + n nPp n nPp

Although Np may be a free variable, because we do have this knowledge about our applica- tions and it is not unreasonable to know this kind of information for many different kinds of applications, we treat it as a constant with Np = 100.

64 Figure 8.1: When a group of discrete tasks are distributed among a number of threads n, often there results in some amount of leftover tasks nPp which cannot be operated on using

all n threads. Here, 65 tasks are divided among n = 12 threads resulting in BFp = 5 blocks of 12 tasks each for fully parallel operation and nPp = 5 leftover tasks for partial parallel operation.

65 8.4 Hardware Parts

8.4.1 Main Memory Bandwidth

As Gustafson suggests, 15 years ahead of the complex caches of multi-core chips, variable access speeds in the memory hierarchy (at the time meaning CPU cache, main memory (L0), and disk rather than L1,L2,...,L0) lead to different performance outcomes for (for him) different problem sizes [54]. Assuming sufficient main memory to avoid interaction with disk storage mechanisms (generally true and expected for high-performance applications), multi-core processors have no less than three layers of memory for general problem sizes and, as performance software, assume that we will not be operating off of disk. Simon and McGalliard take an interest in the performance of different levels of multi-core cache and note that the effective latency and speed of main memory is unchanged with multiple cores, but the bandwidth is shared proportionately [93].

VMAX,Lx is the fastest of all per-thread memory bandwidth measurements for independent

nmax blocks in random/sequential VMAX,Lx = maxn=1 (n · max (VI,Lx,n,VO,Lx,n)).

“Off-chip bandwidth is now recognized as a first order bottleneck with multi-core chips...” [40]. Memory bandwidth is known to be a limiting performance factor on modern processors. We consider an amount of bandwidth demand on a per-thread basis (parameter of software) and an amount of main memory bandwidth available on each system (parameter of hard-

 VMAX,L0  ware). The saturating bandwidth function is: MT,n = min 1, . MT,n is defined in MT,n·n section 8.2. Amdahl’s Law can be extended:

 p  Tp,n = Ts s + , neff = MT,n · n neff

66 8.4.2 Sequential Boost

Processors may have various technologies [17] for boosting single core speeds for sequential workloads which may become active for single computational threads, but then become dormant for multi-thread workloads.

Here, all matters being relative, we penalize the parallel portion rather than boost the sequential. seqn represents the relative penalty for any section operating with n > 1 threads. This applies, in principle, to both the parallel and sequential portions with effect contingent on the number or processors operating in that portion. This effect will become better

 p  understood later. We have Tp,n = Ts s + , where: neff

neff = n (−H (n − 1) + 1 + H (n − 1) · seqn) , seqn ∈ (0, 1)

seqn is a free parameter of the modular model.

8.4.3 “Virtual” Core Efficiency

8.4.3.1 Generic Latency

Hyper-threads, SMT, hardware threads, all being secondary hardware threads in a core, or ‘virtual’ cores (T P C > 1), where supplementary logical cores or hardware threads are incorporated into each core are known to be less efficient due to design factors and resource

 p  contention so we expand: Tp,n = Ts s + , where neff = nc + e · nv. e ∈ [0, 1], virtual neff core efficiency, relates the application and platform together here. e is a free parameter in the modular model.

67 8.4.3.2 Effective Latency

“Previous work has shown that many parallel applications can be performance limited by available memory bandwidth any additional threads spend their time waiting for memory accesses rather than computing.”[52]. To lift the ambiguity of meaning for e, we create an expression e = η−B of a new software parameter B ∈ [0, 1] and a hardware-derived parameter

MT,n·n η = . V0,n is defined in section 8.2. Virtual cores generally become active as other V0,n cores stall due to latency of requests to main memory. As greater demand is placed on the memory system, latency will rise. The total bandwidth demand of all threads is considered relative to the main memory, L0. As latency rises, efficiency falls. Where η > 1, e < 1 (parallel efficiency reduced). Where η < 1, e > 1 (parallel efficiency enhanced).

B is a free parameter of the model.

8.4.4 Lx Space Contention

For each level x where Lx exists, as total thread count n∗ sharing that particular cache resource (not the entire level) increases, competition for that Lx space increases. As compe- tition increases, performance decreases for threads operating on the related hardware threads.

Population is a mixture of both input φI,n (data and instructions) and output (data) φO,n

(see figure 8.2). VO,Lx,n and VI,Lx,n are defined in section 8.2. The proportion of cache space dedicated to read φI,n and write φO,n operation follows:

VI,Lx,n∗ · θ φI,n∗ = , φO,n∗ = 1 − φI,n∗ VI,Lx,n∗ · θ + VO,Lx,n∗ (1 − θ)

68 Figure 8.2: Population of the cache hierarchy will exist as a blend of input data, output data, and also executable code. Cache population can essentially be divided by some proportion θ between input and output (left). For multiple threads of computation (right), some degree of sharing may occur between threads. Here we express sharing as a proportion φ of the input portion θ: θφ. Sharing may be conceptually more general.

Cache coherency generally requires that information present in a higher level (L1) to also exist in all lower levels (L2, L3) down to main memory (L0). This assumption may not be true of all architectures.

The necessary space in cache for a thread to operate remains constant, but the cache size available per thread essentially decreases with an increased thread count. Caches are ex- pected to be fully utilized by all active dependant processors as a feature of memory controller prefetch mechanisms. Regardless of the degree of contention, in the absence of information

sharing, population of a level Lx cache with a size in bytes ALx is divided into input AI,Lx,n∗ and output AO,Lx,n∗ :

AI,Lx,n∗ = ALxφI,n∗ ,AO,Lx,n∗ = ALxφO,n∗

These values are normalized per thread:

0 φI,n∗ 0 φO,n∗ φ ∗ = , φ ∗ = I,n n∗ O,n n∗

0 0 0 0 AI,Lx,n∗ = ALxφI,n∗ ,AO,Lx,n∗ = ALxφO,n∗

69 Where nx is the number of threads sharing a particular cache, the performance penalty factor must tend towards zero as nx increases, and start at 1 where nx = 1. It can then be

−(nx−1) expressed ρx = β , β ∈ (0, 1).

 p  Tp,n = Ts s + , where for cache levels which are fully shared amongst all software neff threads neff = ρxn is readily expressible where nx = n. Otherwise the particular distribution of threads must be known. This we consider later.

β and θ are free parameters of the modular model.

8.4.5 Lx Space Sharing

The ratio of input to output is expected to be constant, however, if tasks being performed in parallel are similar, they very likely share instructions and may possibly share input data. Considering the streaming nature of applications we model, space required for output is expected to be an exclusive stream. ω ∈ [0, 1] is the portion of thread input which is in exclusive use to that thread. MDOS Lx is the maximum degree of sharing of a cache according

∗ to the hardware structure. n ∈ [0, MDOS Lx] is the degree of sharing of the particular cache level and varies throughout the hierarchy and from core to core. The input part of the

(n∗·ω+(1−ω))·θ ∗ cache bandwidth is then θI,n = (n∗·ω+(1−ω))·θ+n∗(1−θ) . The output part is then the remainder

θO,n∗ = 1 − θI,n∗ . (1 − ω) represents the common thread-shared parts of the data. No actual sharing can occur where only one thread is present but it must still be accounted for in the event of thread migration.

VI,n∗ θI,n∗ Space is partitioned φI,n∗ = , φO,n∗ = 1 − φI,n∗ . VI,n∗ θI,n∗ +VO,n∗ θO,n∗

Per thread values then follow:

V ∗ θ0 0 (ω)θ 0 I,n I,n∗ Input: θ ∗ = ∗ ∗ , φ ∗ = I,n (n ·ω+(1−ω))θ+n (1−θ) I,n VI,n∗ θI,n∗ +VO,n∗ θO,n∗

70 V ∗ θ0 0 (1−ω)θ 0 I,n S,n∗ Shared: θ ∗ = ∗ ∗ , φ ∗ = S,n (n ·ω+(1−ω))θ+n (1−θ) S,n VI,n∗ θI,n∗ +VO,n∗ θO,n∗

0 θO,n∗ 0 φO,n∗ Output: θO,n∗ = n∗ , φO,n∗ = n∗

As nx increases, bandwidth and the population of the cache increases disproportionately towards output. The corresponding proportion of space remains:

AI,Lx,n∗ = ALxφI,n∗ ,AO,Lx,n∗ = ALxφO,n∗

Per thread:

0 0 0 0 0 0 AI,Lx,n∗ = ALxφI,n∗ ,AS,Lx,n∗ = ALxφS,n∗ ,AO,Lx,n∗ = ALxφO,n∗

The penalty is reduced as an effect of the shared space:

0 −(nx−1)(1−φ ) ρx = β S,nx , β ∈ (0, 1)

 p  Tp,n = Ts s + , where for cache levels which are fully shared amongst all software neff threads neff = ρxn is readily expressible where nx = n. Otherwise the particular distribution of threads must be known. This we consider later.

ω, θ, and β are free parameters of the modular application.

8.5 Contentious Parts

Due to the shared memory address spaces present in H2 and H3 benchmarks and the inde- pendent memory spaces of H1 (independent address spaces on shared-memory machines),

71 HYDRA\Mutex S P T S P T H1 No No No → No No No H2 Yes+ No Yes+* → Yes No No H3 No Yes* No → No Yes No

Table 8.1: Applicability of individual mutex types according to each HYDRA type. (+, ∗) indicates potential ambiguity between pairs which may be unresolvable in computation (left) and their mappings to practical implementation (right).

HYDRA\Mutex S P T \Section Sequential Parallel,Full Parallel,Partial H1 n n n n n n H2 1 n 1 n n n

H3 1 1 n, nPp 1 n nPp Table 8.2: The number of threads in operation per mutex type (left) and computational section (right) according to each HYDRA type. nPp is established in section 8.3.3. software mechanisms controlling contention will only be present and active in the cases of H2 and H3. Several types of mutexes are modeled here.

We consider the P type of mutex which is the classic expected mutex occurring in canon- ical parallel code, represented by the H3 benchmark. The S type occurs from concurrent execution of sequential code in a shared memory space and relates to H2 benchmarks. The T mutex type would correspond to mechanisms which may be active in sequentially oper- ated parallel parts of the application in shared memory space. Attempting to model the T type may be ambiguous with respect to either P or S types so although we consider it conceptually, we exclude it from the model (table 8.1).

Hardware based sequentializing events may also affect H1, H2. Spurious drive I/O or network variation from machine to machine may complicate prediction (not measured) and is expected to NOT occur in H3 benchmarks which are more intentionally kernel times.

Through concurrent operation, the sequential portions of the application will interfere with each other in the memory hierarchy in much the same way the parallel parts in the H3

72 benchmark do. Further, the sequentially operated parallel part in H1 will be free of mutex locking that may present itself in H3, but further locks may present themselves in H2. The sequential portion can be similarly analyzed and the model further complicated.

8.5.1 H3 Parallel Mutex, Simple

Parallel threads may conflict when accessing mutually exclusive (mutex) and other critical

sections of the parallel parts and sequentialize their operation. pm,p ∈ [0, 1] represents the portion of the parallel section (or the relative size) of the mutex which is removed from

parallel computation and added to the sequential. pm,p is intentionally a property of the

application, but is non-deterministic here. pm,p, as a scalar, is ambiguous as a part of s, so although simple we cannot use this expression except as a basis for a more complicated model.  p − p · p  T = T s + p · p + m,p p s m,p n

The factor ‘p · ...’ in both cases can be modified for the task parallel form of the expression

by scaling it according to the full and partial parallel parts, appropriately: ‘p · pFp · ...’ and

‘p · pPp · ...’.

8.5.2 H3 Parallel Mutex, Parameterized

Mutex conflict is a function of the total threadcount, n. The mutex section has a certain

0 size relative to the parallel part p: pm,p and is active only when n > 1 software threads are active. The duration of the runtime is extended through sequentialization: pm,p =

0 0 p·pm,pkp·n pm,pe . The likelihood of sequentialization is a function of both n, the size p · pm,p,

0 and the relative proportion pm,p. kp ∈ [0, 1] is a normalizing term relating to the frequency of call. Sequentializing effects of the mutex must vanish for situations where n = 1. The size

73 is deducted from the parallel part and added outside of it as a penalty factor.

 0  p − p · pm,pH (n − 1) Tp = Ts s + p · pm,pH (n − 1) + neff

0 pm,p and kp are both free parameters of the modular model.

8.5.3 H2 Sequential Mutex, Parameterized

In the case of H2 benchmarks, mutex conflicts may occur between the sequential parts of independent application threads. The phrasing and form of section 8.5.2 apply exactly to

0 the s part and denoted sm,s and sm,s.

0 sm,s and ks are both free parameters of the modular model corresponding to the prior section.

8.5.4 H2 Thread Mutex, Parameterized

Under H2, parallel portions (operated sequentially) may also interfere with each other by accessing global resources unrelated to the parallel operation mutex. With representation identical to the section 8.5.2 , this is ambiguous with regard to H3 critical sections and unresolvable and not included in the model.

8.5.5 H1,H2 Model Extension

All parameters are likewise extended to H1 and H2 results. Parameters pertaining to se- quential portion Mutex parts are restricted to H2 (shared memory, threads, section 8.5.3)

0 versus H1 (private memory, processes) and do not appear for H3. The terms sm,s and sm,s

0 originate from section 8.5.3 while pm,p and pm,p originate from section 8.5.2.

74 n is specified per sequential section and parallel section (1, n) for H3 and (n, n) for H1,H2 according to table 8.2. The new Amdahl derived equation becomes:

 0  s − s · sm,sH (nseq − 1) Tp = Ts s · sm,sH (nseq − 1) + + ... neff,sequential  0  p − p · pm,pH (npar − 1) ...Ts p · pm,pH (npar − 1) + neff,parallel

H1 and H2 results are reported as normalized runtimes rather than elapsed time due to the

bulk, but sequential nature of their operation as can be observed in figures 7.6 and 7.7. Ts,H∗ represents the averaged sequential runtimes observed. Where H∗ corresponds to H2 and H1,

Tp,H∗ the average value is then normalized by n: n . Tp,H∗ represents Tp for whichever HYDRA type data H∗ being worked with.

To compensate for the measurement boundaries between HYRDA types (image 7.2), the total expression must be extended by the delta relative a fixed reference which we choose to

(Ts,H∗−Ts,H3) be the H3 type : nseq

Accommodating the variation in benchmarking between H3 and H1,H2:

  0  Ts,H∗ − Ts,H3 s − s · sm,sH (nseq − 1) Tp,H∗ = + Ts,H3 s · sm,sH (nseq − 1) + + ... nseq neff,sequential  0  p − p · pm,pH (npar − 1) ...Ts,H3 p · pm,pH (npar − 1) + neff,parallel

Operating system scheduler terms are common to both parallel and sequential portions. Sequential boost also becomes a common parameter. The parallel part is strictly a property of H3.

75 8.6 Operating System Parts

Performance is a factor of contention for use of the machine resources. An optimal dis- tribution of threads on the processor will minimize contention; a suboptimal distribution will emphasize contention through populating shared resources. Between the best and the worst-case situations there exists a range of intermediate assignments. The aggregate can be expressed probabilistically. The act of migrating threads from processor to processor only worsens any particular configuration.

The concept of best, worst, or probabalistic arrangement of software threads in the hardware is a free enumeration of the modular model.

8.6.1 Thread Placements

Here, we generate three models for placement schedules on a single process: best, worst, and probabilistic. For illustration purposes, an H3 application with 5 threads on 4 cores with TPC = 2 is demonstrated (figure 8.3).

We use bitmasks to show software threads being populated into hardware threads in a processor. ncp is the number of cores (nc) populated with at least one software thread and nvp is the number of cores with two software threads (nv) populated. nvariant is the number of software threads for which we are uncertain of their location.

For the best distribution using n = 5 software threads, the minimal number of software threads share cores: 10, 10, 10, 11 (reference figure 8.3 top)

ncp = min (nc, n) , nvp = n − ncp, nvariant = 0

76 Figure 8.3: Two essentially orthogonal notations are used for describing the distribution of threads throughout a particular machine. They are more or less convenient for concep- tualizing and computing variations in performance. Mapping back and forth between both notations is possible. If (despite symmetry) cores are thought to be ‘primary’ and ‘secondary’ or ‘complete’ and ‘virtual’, the nc and nv notation may make more sense. nc indicates the sum of unshared cores and also shared pairs while nv indicates the number of shared pairs. Alternately, we can consider the number of fully populated pairs P , partially populated pairs S (singles), and unpopulated pairs Z (zeros).

77 For worst-case, the maximal number of software threads operate on shared cores: 11, 11, 10, 00 (reference figure 8.3 bottom)

j n k n = + mod (n, T P C) , n = n − n , n = 0 cp TPC vp cp variant

In the case of this particular ideal distribution, there exists 18 performance equivalent config- urations, ‘isomorphs’. For the worst-case distribution, there exists 16 equivalent configura- tions. For an application with unconstrained process and thread affinity, the OS may freely migrate amongst any of these possible configurations both isomorphic and heteromorphic. The weighted average performance for all possible configurations can be considered with a model first by figuring the limits and the variability by combining the two prior distributions to ascertain the minimum and maximum values for ncp and nvp as well as the number of software threads contributing to variation (nvariant):

j n k n = min (n , n) , n = + mod (n, T P C) cp,max c,max cp,min TPC

nvp,max = n − ncp,min, nvp,min = n − ncp,max

nvariant = ncp,max − ncp,min = nvp,max − nvp,min

The OS Scheduler may dynamically load balance/migrate threads and will react according to the applied system load which may be aggravated by transient system events. Full subscrip- tion is a machine were n = nmax and we do not consider situations beyond full subscription due to the expected negative performance implications there for highly computational code.

78 8.6.2 Probabilities and Structure of Migrations

When process and thread affinity is not configured there is opportunity to migrate software threads between processor cores. For a migration to occur one must first: have a thread and second: have a location to move the thread to. Working with a perfect and fully-subscribed

system where n = nmax there should exist no opportunity for migration at all. Realistically, the number of threads on a typical system far exceeds nmax and the real circumstance is substantially imperfect. Transient events may occur requiring non-application threads to receive priority. As threads rise and fade, application threads become suppressed and then resume but may also move around as a result.

Based on prior examples, the OS thread migration mechanism is assumed to be a random discrete process that knows little to nothing about the performance implications of the processor architecture and absolutely nothing about the application software. It can then be conveniently represented with a Markov Chain. We require that only one thread can be migrated from an occupied core to an unoccupied core at a time and that two or more threads may not have their placements simultaneously exchanged.

For a 4-core architecture where TPC = 2 (a typical i7) and the maximum number of threads is nmax = 8, threads may be populated per core in pairs P , singles S, or cores may remain empty Z. For n threads, there exists nvariant threads which may contribute to either P or S populations. The number of heteromorphic placement configurations for each value of n are

simply #configurations = nvariant + 1.

The opportunities to migrate from one configuration to another can be approximated as:

#migrate ≈ n (nmax − n). For T P C > 1, some movements are missed by this expression, but it serves as a lower-bound. Errors for this expression can be as high as 66%. (see figure 8.4)

nmax Based on prior notation, P = nvp, S = ncp − nvp, Z = TPC − ncp.

79 Figure 8.4: The number of opportunities for threads to migrate between processors on a system is trivially proportionate to both the number of threads n and the number of vacant processors nmax − n (baseline). Only migrations to and from a core exist (there are no migrations within a core). Where multiple threads may exist in a single core, several types of performance affecting migrations may occur. From simply having a thread count it is often indeterminate what configuration the threads actually consist in and multiple variants exist (here, 0,1,2) depending on the thread count. The migration opportunities for each variant are equivalent to the baseline expression over a restricted domain. The sum of migrations for all variants describes the migration profile for the particular architecture.

By moving one thread at a time, four particular types of movements are possible with three different outcomes (figure 8.6). By expansion, we mean the separating of paired threads into two individual threads. By collapse we mean the combination of two separate threads into a single pair. By stay we mean the movements of threads which result in another isomorphic configuration (figure 8.5). The opportunities for movement from each configuration are:

#expand = (2P ) (2Z)

#stay = (2P ) S + (2Z) S + S

#collapse = S (S − 1)

To propose generalization of these expressions, with TPC = 2 constants of value 2 want to be exchanged for TPC, but doing so yields an incomplete model. For TPC = 1, P values are

80 Figure 8.5: Isomorphic configurations are those which demonstrate equivalent performance on the machine architecture. Isomorphism considers the degree that the hardware is popu- lated (combinatoric) but not the arrangement of software work within that population (it is not permutative). With regard to the differences between software work threads, the model assumes uniformity and is otherwise agnostic.

Figure 8.6: Thread migrations can only occur by a thread moving from a core where it is active to a core where it is not active. Migrations come in several varieties, some of which result in another isomorphic configuration while others result in heteromorphic ones.

81 Figure 8.7: Transitions between configurations or states is not numerically obvious. Any one state may transition to one or more heteromorphic states or none at all. Similarly, transition to an isomorphic may or may not be possible. The probability of making any particular tran- sition (including isomorphs) is the ratio of possible means to make the particular migration to the total number of possible transitions from a particular state (including isomorphs).

inappropriate as are #collapse and self-migrations in the #stay term. For TPC ≥ 3 (beyond current architectures), additional terms and interactions are required for triples and higher.

#migrate = #expand + #stay + #collapse ≈ n (nmax − n)

The probabilities for migration from any state to an adjacent one are the ratio of the specific opportunity to the total opportunities (figure 8.7):

#expand pexpand = #migrate

#stay pstay = #migrate

#collapse pcollapse = #migrate

The exception here exists when n = nmax:#migrate = 0. We force the probabilities to conform:

pexpand = 0, pstay = 1, pcollapse = 0

82 The probabilities can be filled into a square state transition matrix P of size #configurations representing the regular Markov chain. The limiting probability vector, the vector of prob- abilities w describing the average time spent in each state on a sufficiently long timeline, can be derived from the equation wP m = w for an integer m ≥ 1. Alternately, P m for some m >> 1 may be evaluated and converge at some precision level. Each row of P m will be identical and equivalent to w. Each component of w corresponds to a particular

configuration c and is the likelihood that that configuration will exist pn,c.

From w and the configurations represented, the weighted average performance can be calcu-

lated. ngeff is the vector of processor effectivities corresponding to each of the #configurations

and also w. neff is the scalar product of the two vectors: neff = w · ngeff .

For these calculations, ncp = ncp,min + [0, #configurations) and correspondingly nvp = nvp,min +

(#configurations, 0].

8.6.3 The Cost of Migrations

While the performance of isomorphs are identical, the cost of migrations vary. The costs to expand or collapse will be uniform as only one type of alteration is present per movement for

TPC = 2. Three different movements are possible for #stay: S exchanges position within its pair, S moves into either position of an empty pair, or one thread from a P moves into another S leaving a new S and generating a new P . To determine the average costs of migration, the costs of each type of migration between and within isomorphs are needed as well. The number of each movement type are available from the terms of each enumeration

#expand,#stay, and #collapse.

The number of isomorphs are equal to the number of #stay movements plus the base-state:

#iso = 1 + #stay. The number of each type of movement are the individual terms of #stay:

83 #stay,ps = (2P ) S,#stay,sz = (2Z) S, and #stay,s = S Their relative probabilities for each stay movement are then: p = #n,c,stay,ps , p = #n,c,stay,sz , and p = #n,c,stay,s n,c,stay,ps #n,c,stay n,c,stay,sz #n,c,stay n,c,stay,s #n,c,stay

The cost to migrate a thread involves two primary operations: eviction of dirty data (output) from the origin down the hierarchy to the nearest common level (least-common ancestor, LCA) and population of the cache layer back up to the destination processor.

The corresponding times for each migration (expected to be handled sequentially with no

overlap) are then tn,c,stay,ps, tn,c,stay,sz, and tn,c,stay,s resulting in the average migration time for an isomorphic configuration t¯n,c,stay:

t¯n,c,stay = pn,c,stay,ps · tn,c,stay,ps + pn,c,stay,sz · tn,c,stay,sz + pn,c,stay,s · tn,c,stay,s

Overall the average migration time for heteromorphic and isomorphic configurations t¯n,c:

t¯n,c = pn,c,stay · t¯n,c,stay + pn,c,collapse · t¯n,c,collpse + pn,c,expand · t¯n,c,expand

Migrations may occur at some frequency f. The time fraction per second spent (wasted)

ˆ ˆ Pnvariant ¯ in migration, tmigrate,n, is then tmigrate,n = f · c=0 pn,ctn,c. pn,c is established in section

8.6.2. Where n = nmax, the possible migrations are zero and tˆmigrate,n = 0, necessarily.

The total time for an extended computation Tn,total is then: Tn,total = Tn + Tn · tˆmigrate,n =  1 + tˆmigrate,n Tn.

Relating to Amdahl’s Law which is the basis for our formulation,

  s − s · sm,s p − p · pm,p Tp,n = Ts · s · sm,s + + p · pm,p + neff,sequential neff,parallel

84 Both Tp and Ts are real experimental values and require normalization:

T T p,n = s (...) 1 + tˆmigrate,n 1 + tˆmigrate,1

Rearrangement yields:

Ts  Tp,n = (...) 1 + tˆmigrate,n 1 + tˆmigrate,1

 Because n varies within an execution, expansion and distribution of 1 + tˆmigrate,n into the terms of (...) and parameterized according to the number of processors active on that term yields:

Ts  Tp = (s · sm,s + p · pm,p) 1 + tˆmigrate,1 + ... 1 + tˆmigrate,1   Ts s − s · sm,s ˆ  ... 1 + tmigrate,nsequential + ... 1 + tˆmigrate,1 neff,sequential   Ts Ts p − p · pm,p ˆ  ... 1 + tmigrate,nparallel 1 + tˆmigrate,1 1 + tˆmigrate,1 neff,parallel

f is a free variable in the modular model.

8.6.3.1 Evict only Dirty Output Data

Because benchmark I/O speeds to different caches reflect the I/O time in the hierarchy up to that level and not just that level itself, some compensation is required. The increase in area between cache levels must also be considered. For every cache level Lx, write its output portion to the next level cache Lx + 1. For every level past L1, remove the time to write to

85 that cache level and add the time accumulated to write the prior level Lx − 1.

xx  ¯  ¯   X θO,nAO,Lx θO,nAO,Lx evictT oL (xx) = + evictT oL (x − 1) − H (x − 1) V V x=1 O,Lx+1 O,Lx

8.6.3.2 Populate only Clean Input Data

Population follows exactly with eviction.

xx  ¯  ¯   X θI,nAI,Lx θI,nAI,Lx populateF romL (xx) = + populateF romL (x − 1) − H (x − 1) V V x=1 I,Lx+1 I,Lx

8.6.3.3 Average Migration Cost

The eviction level Lx is the least-common ancestor (LCA) for both cores involved in the migration. Lx is symmetric on the flush and read sides. The LCA for all possible movements is, however, not necessarily equal. This can be observed in the i7 architecture (figure 3.1) and also the Core 2 Quad (figure 3.2). The difference in costs is generally distinguished by movements inside the core (no or negligible cost) or to adjacent cores. For more advanced architectures like NUMA, this could even involve additional costs between processors.

With the extensive quantity of isomorphs that may be possible, particular configurations can’t be accepted into the model. Instead, the average migration cost amongst all possible destinations is then required:

migrate = evictT oLx + populateF romLx

86 8.7 Performance Model Implementation

The described performance model is not particularly straight-forward to implement and, in fact, is written to describe a variety of models ranging from elementary (Amdahl’s Law) to complicated. Each variation on the model is composed of different parts (a part being an expressive component) while each part activates particular properties (a property being an optimizable variable either a discrete enumeration, integer, or floating-point value) (reference Figure 12.1). Different parts depend on others: some parts are required in combination with others while some are mutually excluded (reference Figure 12.2).

Two different notations have been established for conceptualizing software thread placement within the processor architecture: P,S,Z and nc, nv, ns. Unfortunately, these notations are essentially orthogonal and, although their relationships are somewhat elementary, the com- plication of an actual implementation is non-trivial when facing some of the more intricate details proposed here. Consequently, we have found that decomposing each part of the model further into a set of penalty factors to be applied to each active core in the system to be the best approach to resolve the matter (an inactive processor is penalized with a

factor of 0). Hence, we generally state that the effective processor count, neff , is the sum over all processors (proc) of the product of their associated performance penalty factors

(penaltyproc,m): X Y neff = penaltyproc,m

proc∈nmax m∈parts

87 Chapter 9

Experimental Applications

In order to test and validate our performance model, we require software that we can per- form computational experiments with. We require the ability to collect runtime results corresponding to H1, H2, and H3 benchmark types which also demands that the application allows for external control of its parallelism.

An application is a piece of software designed to perform a particular task. A variant of that application is an implementation with some feature(s) of its implementation (variant features) perturbed such that it is just slightly different from its closest relatives. Each variant performs, essentially, equivalent work for equivalent (not necessarily identical) results. We work with two particular applications and a large family of variant applications for each one.

Generally, these applications are capable of ‘’ implementations and could be implemented with data-parallel rather than task-parallel logic on a SIMD architec- ture. Here, each computational task is entirely independent of each other in principle but in practice, disregarding any sequentializing properties of the hardware, this may be more or less realizable due to conventionally sequential services such as memory management and

88 I/O. Some implementation exercises are explored, internally, to look at different degrees of contention as possible trade-offs.

Each application contains a single parallel computational kernel which is composed of several discrete parallel processing blocks with intervening synchronization points. For these experi- ments the applications are hard-coded to 100 tasks each (100 represents a fixed problem-size which is sufficiently large to be useful in real usage). The number of tasks is carried over into the actual models as the only ‘knowable’ internal quantity to support the models as a constant Np = 100 (section 8.3.3); all other parameters are arrived at through curve fitting.

“Applications that use weak scaling increase the total working set size proportionately as the number of processors increases, while the working set for each processor remains constant” [9]. Fienup works specifically with weak scaling applications with his constant memory per-processor model (CMP) [45].

Weak scaling implies that essentially independent tasks are solved in parallel and the memory for them is not shared. In contrast, both our experimental applications are implemented with strong scaling with some superficial exceptions. A common data structure is operated on by all computational threads and each generates a portion of the results for the whole problem being processed. As a feature of the geometry of the problems being worked with, input data may be shared between processing tasks with variation on locality and operation.

89 9.1 3D Finite-Difference Numerical Integration (FDI)

Cahn-Hilliard Equation, Spinodal Decomposition

Numerical Computational Kernel

Our first experiment is an implementation of the Cahn-Hilliard equation for the purposes of modeling spinodal decomposition as described by Sun, et al [98]. Various minor formulaic variations exist within a variety of publications and some include details of numerical values fed into the equation and enough evidence in their results so as to be reasonably reproducible, [66], [86], [84].

The Cahn-Hilliard equation is a general 4th-order partial differential equation (PDE) which we specifically apply in 3D space for the purpose of studying mechanical structures aris-

∂u ing from diffusion processes by which may be modeled by spinodal decomposition: ∂t = 2 h df(u) 2 2 i 5 du − θ 5 u . Because it’s very general, versions may be imagined in other higher and lower-dimensional spaces. Rashed demonstrates its usefulness for image processing in 2D for sharpening and cleanup purposes [85].

We implement the equation as a classical finite-difference scheme [2] over a cubic voxel space fixed at 100 cells on a side for experimental purposes and integrated over time. Finite- difference integration is a fundamental and useful technique. It has been specifically looked at in another study for performance opportunities on several different machine architectures [11].

9.1.1 Application Characteristics

Implementation of FDI is a numerical kernel where all work performed is through floating point mathematical operations. Because the method is purely mathematical, the operations

90 performed in the kernel are invariant with regards to the data being operated on. FDI is therefore deterministic: even when the input data is different the operation of the application is not only consistent, it is identical [1]. FDI is also regular: it could be implemented in a data-parallel manner (SIMD) but we implement using a task model (SPMD); there are no uncertainties about it [69]. 1,224 implemented variants of this application are generated for experimentation.

9.2 3D Iso-parametric Surface Extraction

Surface Reconstruction (SRA)

Logical Kernel

Given a 3D rectangular volume, considered to be a continuum, approximated with X · Y · Z regularly spaced voxel data points describing some attribute U of the volume, the task is to extract an iso-parametric surface through that space on U and express it as a triangle mesh structure.

Just as 2D digital images (composed of pixels) may arise from a broad variety of sources and have a broad variety of uses, 3D images (composed of voxels) exist and may be generated through physical systems like Magnetic Resonance Imaging (MRI) and Computed Tomog- raphy (CT) scanners or computational systems involving PDE’s and utilizing schemes like but not limited to finite difference integration (FDI). The 3D images we refer to here are specifically spatially regular on a structured grid. This contrasts them to other forms of 3D images such as stereoscopic image pairs [62] and holograms [23] although they could be used to generate either.

91 The markets of interest for this kind of information are broad and varied including medical and environmental science, materials engineering, mechanical engineering, and defense. Just like visualizing a 2D image requires the transformation of the stored data in some meaningful way, normally as colors on a display/monitor, visualization or utilization of 3D data requires some transformation of the stored information; 3D information is less often as conceptually simple as a simple color and so the transformations are more obscure. We’re not interested in visualization as solids [68] [57] . We’re also not reconstructing solid objects [16] but the ideas are closely related. For our purposes, we aim to extract polygonal surfaces representing continuous valued surfaces throughout the 3D volume [104] [14] for the purposes of post- processing with Finite-Element Analysis (FEA).

9.2.1 Application Characteristics

SRA contains a logical kernel. While mathematics govern a portion of the time spent in the kernel, the path through the kernel varies with the data and a substantial portion of it is composed of comparisons, conditional branches, and loop structures. Memory access and mapping between data structures is also a strong component here. SRA execution is therefore non − deterministic, i.e. data-dependant execution behavior is present. Overall the application is deterministic (the results output are well-defined) but the computational kernel makes logical decisions based on the local data. While the code executed is substan- tially similar regardless of the data (no disparate sub-programs are being executed) and the computational result is fully deterministic, the ordering of output data is potentially whimsi- cal (based on the timing and scheduling of different parallel tasks) and the volume of output data for a fixed input quantity does vary [1]. Also a regular application, data-dependency on the volume of output data can be sidestepped as the quantity is bounded in a deterministic manner. 78,234 implemented variants have been generated for study but just a fraction of them actually utilized.

92 Chapter 10

Experimental Toolset

An array of software tools were developed, specialized in the tasks required. Experimental tools fall into three categories: development, logistics, and analysis. Tools within a category are collaborative.

10.1 Development Tools

Development tools are responsible for the management and manipulation of source code and the generation of experimental applications. Two tools were developed in this category.

10.1.1 Prometheus: Combinatoric Build

“According to one legend Prometheus created mankind out of clay and water. When Zeus mistreated man, Prometheus stole fire from the gods, gave it to man, and taught him many useful arts and sciences.”[107]

93 Unlike conventional build systems which are tailored to generating a small number of spe- cific applications from a sourcebase, Prometheus is designed more abstractly to produce many variant applications from a sourcebase. Prometheus is a combinatoric build tool and takes in the BuildControl file to serve as its operational guide. BuildControl describes a variety of features within the software and methods for inducing/producing variations of those features in the software product. The variations of features in an application is viewed as implementation space and is a multi-dimensional integer or enumeration space. Each particular application design can be represented by a unique integer vector in this space.

Prometheus fully reads and digests the BuildControl file and builds a model of the imple- mentation space for the application. The digested information is output into a Manifest file for consumption by other tools which need only high-level details of the structure and not the particular methods for arriving at them. Prometheus is therefore responsible for processing all the possible build states for the application, controlling the compilers and linkers, code generators, and generating all the various combinatoric code configurations. It can control a variety of commercial C++ compilers including Microsoft Visual Studio, Intel, PGI, and some GCC variants. In principle, its design is abstract and can be bent to control any toolchain.

10.1.2 Ilithyia: Code Generation

“...Eileithyia was the goddess of childbirth; and the divine helper of women in labour has an obvious origin in the human midwife...” [105]

Our Ilithyia is a specialized generative programming tool [41] which Prometheus may invoke ahead of compilation. C preprocessor macro programming is one method for describing and permitting variations in an application. C++ template programming is another incremen- tally more modern programming concept motivated toward the flexible reuse of code. Both

94 methods are individually versatile and distinct with their own limitations in application which can be realized after limited experimentation or experience implementing actual code. “Though conventional macros can describe many interesting linguistic abstractions, they are not powerful enough for many other generative-programming tasks.” [65] Correspondingly, when operating in the domain of C++, macro programming is not entirely compatible with more advanced and recent language features such as template programming.

On the low-end of complexity, Ilithyia sidesteps these issues operating at the pre-C pre- processor level making designs which would be impossible due to incompatibility between preprocessor and templates possible. On the high-end, Ilithyia may contain any number of generative programming code processors which may, being informed by abstract tags in the source code, manipulate the code through adding and removing content either explicitly or implicitly. The degree of complexity for any code processor is unbounded. Current imple- mentations include features supporting different parallel API’s (OpenMP, Intel Cilk Plus, sequential, Intel TBB, QuickThread), voxel space data structures with boundary conditions, and 3D finite-difference integration and looping.

From the input source files, Ilithyia parses source dependencies, acquires, replicates, and remaps the entire dependant source hierarchy on a per-build basis into the project build directory. Ilithyia is a text-based system. Consequently, no limitation on language selection is intrinsic to the system but our implementation is specific to C/C++ so modules and functional descriptions are written relative to that. Further, because no language is specified, no language processing specifically occurs in Ilithyia and makes it distinct from source-to- source compilers operating in similar problem domains that appear in the literature [80] [5]. Correspondingly, we generally sidestep portability issues through highly specialized targeting [100].

95 10.2 Logistics Tools

Logistics tools are responsible for distribution and deployment of experimental applications, operation of those applications to produce benchmark results, and the collection of results.

10.2.1 Iris: Distribution and Collection

“Iris is the personification of the rainbow and messenger of the gods. She travels with the speed of wind from one end of the world to the other...” [72][106].

Iris is a multi-mode application designed for the distribution and management of execution experiments. Operating in both client and server roles and flexibly communicates with either an FTP server, a local USB drive, or a local hard drive for transport purposes. Iris manages the synchronization of all experimental information onto the media such that executables are current and benchmark results are up to date. The Iris Client role handles the download of experiments and the upload of results. The Iris Server role, uploads experiments to the server and downloads results. Iris supports self-upgrades and automatic restarts as well as managed control of the benchmarking module Ponos.

10.2.2 Ponos: Automated Benchmarking

“Ponos was the god of hard labor and toil...” [48].

Ponos is an automated benchmarking application. It is responsible for managing execution of all the applications and performing statistical review of their output (each application we worked with was responsible for writing out its own H3 kernel benchmarks). Ponos also handles concurrent benchmarking of the applications (H1). Concurrent same-memory space multi-thread operation (H2) is not supported but could be made possible. Ponos

96 reads system information for memory and processor structure and uses this for guiding benchmarking including benchmarking of concurrent memory performance.

Ponos uses randomized execution on groups of applications as a method for focusing the collection of benchmarks. Groups of applications may be as small as 100 applications or as large as the whole contingent available. Results are accumulated breadth-wise and across all executables within a group until a minimum of 10 results are recorded for each 1 ≤ n ≤ nmax with outliers statistically rejected [49].

Ponos also handles H1 benchmarking operating each application on m = [1 : n] concurrent executions for a total of 10m executions each. This is done because experiments show vari- ation in runtimes on these architectures are seen to be higher and runtimes longer when subject to greater concurrency. Consequently, benchmarking on larger processors is increas- ingly more expensive using this method.

Ponos is directly called by Iris for fully automated operation but may also operate indepen- dently.

10.3 Analysis Tools

Analysis tools are responsible for the post-analysis of benchmark results to generate out- comes.

10.3.1 Pandora: Model Fitting and Cross-Prediction

“According to the myth, Pandora opened a jar releasing all the evils of humanity leaving only Hope inside once she had closed it again. She opened the jar out of simple curiosity and not as a malicious act.” [108]

97 Pandora is an automated analysis system. Pandora maintains several graph structures de- picting relationships between the applications being studied. BuildControl is the input which establishes the structure of the implementation space for building the graph. Execu- tion results from Ponos, collected from all machines participating in the experiments, are also read in and fill in the performance data on the graph.

Pandora performs statistical analysis on the collective performance results. All optimization and curve fitting against performance results is hosted inside Pandora. This is where the actual performance modeling takes place.

Processing of curve fits is a burdensome responsibility of Pandora and can be handled both in breadth-first and depth-first manners. Depth-first processing generates and solves all performance model variants on a per-application basis. Depth-first processing emphasizes completeness of model comparisons. Characteristically, the simpler models are faster to compute and optimize so breadth-first processing more quickly yields results. Breadth-first processing emphasizes the diversity of results for each model (and having the simpler faster models first) and greater statistical significance.

In the interest of distributed operation, curve-fitting was performed in-situ. Each machine participating in the experiments was responsible for performing all of its own curve-fits and storing the outcomes. Collection of optimization data onto portable media proved to be the optimal paradigm for non-networked machines. To generate a large body of data, we operate in both breadth- and depth-first methods, alternately, for both FDI and SRA application sets.

Cross-prediction of results between machines and models was performed upon collection with the most powerful and largest (memory) machine available.

98 Chapter 11

Error Analysis

11.1 Relevance

“A metric is predictive if it measures invariants that can be used to anticipate behavior on configurations that were not explicitly measured. ...a metric is relevant if what it quantifies is applicable to the intended objectives of parallel application development.” [70]

We consider that a model on an application (i.e. applied to) is predictive if it leads to small errors relative to the native model error in cross-prediction. The individual variables are not of particular importance as there may be exist numerical dependence in the expressions which may lead which may lead to multiple representations. Comparisons between the constituent parts is, therefore, of no particular utility. A model is considered to be comprehensive or ideal if it is predictive on all applications.

99 11.2 Outlier Rejection

As has been previously described, variations in execution time are a real effect of the systems to be modeled and also a random effect of the experimental environment. The integrity of experimental data must be considered [10]. Necessarily, aspects of the model depend on scalar values so group average values are taken in some places. Outliers are removed before averaging [58]. Because we’re using relatively small sample sizes, typically 10 to 100 items per group, experimental data is preprocessed and outliers are rejected.

The Rule of Huge Error for Small Data Sets [49] is used for N items xi with standard deviation s and meanx ¯:

xi−x¯ M = s

If 5 < N ≤ 8 and M > 6, xi is rejected.

If 8 < N ≤ 14 and M > 5, xi is rejected.

If 15 < N and M > 4, xi is rejected.

After cleanup is performed, statistics are collected on the remaining data.

11.3 Error Metrics and Characterization

Error between experiment and prediction can be represented several ways. Through similar performance experiments in the literature a variety of them are used, each with a compli- cating or simplifying effect on the optimization process. Here, we use cn as the calculated

value and xn as the experimental value. m is the number of data points.

100 11.3.1 Total Squared Error, Mean Squared Error

m X 2 TSE = (cn − xn) n=1

Pm (c − x )2 MSE = n=1 n n m

Mean Squared Error (MSE) and Total Squared Error (TSE) are appropriate for emphasizing the error of outliers. MSE and TSE facilitate direct solution on linear algebraic systems (vector-matrix expressions). Compared to TSE, MSE is normalized and can be judged independent of the number of data points so is better suited to making comparisons. The units of measure on MSE and TSE are the square of the units on the measurements made.

Clement and Quinn (also Rosas, Gimenez, and Labarta) use a linear model (parametric on a system coefficients) and capitalize on the least squares method to minimize SSE (MSE) [22] [87]. Barnes also uses a linear model and RMS error [9]. Clement and Quinn also rely on a relative error and minimize sum of squared errors (SSE) [21].

11.3.2 Total Absolute Error, Mean Absolute Error

m X TAE = |cn − xn| n=1 Pm |c − x | MAE = n=1 n n m

Total Absolute Error (TAE) and Mean Absolute Error (MAE) are more appropriate to situa- tions where equal weighting should be applied to the error on all samples. The incorporation of an absolute value precludes direct solution through linear algebraic means which is un- suitable for us anyhow. As with MSE, MAE is normalized and can be judged independent

101 of the number of data points. The units of measure on MAE and TAE are the same as on the measurements made.

11.3.3 Mean Absolute Relative Error

Closest to MAE is Mean Absolute Relative Error (MARE):

Pm |cn−xn| MARE = n=1 xn m

Again, cn are calculated values while xn are the real observed values. Relative errors are non- dimensionalized with respect to the observed values and are therefore better for comparing between different datasets (such as runtimes on different machines) where magnitudes of value differ. Barnes et al use a linear log-log model and rely on relative error because with their experiments “Variability increases with T [runtime]...” As with MAE, equal weighting is provided to each data point and normalization over set size is built in. Non-dimensional error values allow for comparison when sources of data are different (different computers) and also different sizes. MARE is equivalent to Mean Absolute Percentage Deviation (MAPD) but without the percentage.

In favor of non-dimensionalization, we choose to use a minimization of relative absolute error rather than absolute error. This affords a meaningful comparative result across all machines/applications. MARE is also known as Mean Relative Absolute Deviation (MRAD).

11.3.4 Mean Weighted Absolute Relative Error

The particular optimization we perform isn’t a general-purpose circumstance. Rather than fitting a single continuous function to a cloud of continuous-valued data, we simultaneously

102 fit two (or three if available) continuous valued functions derived from a common model onto related but distinct experimental results. The experimental results corresponding to each function/part of the model are discrete valued on the independent parameter N and result in a set of prediction groups.

As described in the benchmarking protocol, the size of each group is contingent based on first the expected statistical variation of experimental results (more samples taken for expectedly noisy results) and second on the actual benchmarking tool (Ponos) sometimes performing more extensive experiments based on statistical anomalies being observed. In a perfectly executed situation the most represented group will have 80 data points while the least rep- resented groups will have only 10. To compensate for this, we factor in a proportionate weighting scheme to more equally represent the smaller data groups (figure 11.1). For each group on n, its error values are weighted relative to the population of the largest group:

W = maxn(|Gn|) , where W is the weight ultimately applied to each value in group n and G n |Gn| n n is the total number of data points in each data group.

103 Figure 11.1: Here, the weights for each discrete data group the on H1 and H3 data sets are suggested by the ratio of the diameters of the red circles to the yellow. On the H1 curve (upper) the population size grows from n = 1 the the maximum population typically expected at the n = nmax with proportionate growth. On the H3 curve (lower) populations are typically the same or with very little variation. Weighting normalizes the contribution of each group error to the total error. Otherwise, the errors derived from high n values on the H1 curve would overwhelm the contributions of the entire H3 curve leading to unintended fitting bias.

104 11.3.5 Prediction Methodology

Our predictive model computes the equivalent sequential time Ts ≈ Tp,1 from Tp,n,n = [2 :

0 0 n]: Tp,1 (Tp,n, n). MWARE is estabished between Tp,1 (Tp,n, n) and the mean of sequential

0 computation times Tp,1. If Tp,1 were computed on n = [1 : n] instead of n = [2 : n], the n = 1 results would artificially push average error towards zero during model fitting as all models with n = 1 degenerate.

105 Chapter 12

Optimization

Optimization plays a key and central role in this system. Once experimental benchmark data is available, the various predictive models can be applied to each data set and the free variables (model properties) optimized to fit the data according to the structure of the actual problem (figure 12.3). Free variables of the predictive model parts (section 8) map directly to model properties (figure 12.1). Following optimization through curve-fitting these models, cross-predictions can be performed using the results.

12.1 Types of Optimization

Our first optimizer was a discrete combinatorial optimizer which simply divided each pa- rameter range into a uniform number of divisions and then examined each of the possible combinations. While that brute force approach is fine for optimizing low-dimensional prob- lems of any level of complexity, higher-dimensional problems become rapidly infeasible.

Therefore, we moved to a Particle Swarm Optimizing heuristic (PSO) [90][63] for its capacity to work with high-dimensional problems. For general operation, 1000 particles (candidate

106 solutions) were computed for 500 iterations each. Early termination was considered if no global (population specific) updates were encountered for five consecutive iterations.

Based on the same infrastructure as the PSO, particle hill-climbers (PHC) were also im- plemented. Our hill-climber implementation mimics a single-parameter gradient descent method. Randomly taking each parameter in turn and cyclically, the parameter is optimized to minimize error with incrementally decreasing step size until the solution has converged to some minima. All particles are considered and the overall minimum taken.

12.2 Optimization Strategy

A range of performance models are expressed from the various parts available. All combina- tions are examined and only the valid models (composed of compatible parts) are considered (figure 12.2). Model parts are valid contingent on the types of benchmark data being sup- plied (H1,H2,H3) to the optimization. Model parts lead to activation of model properties (optimizable parameters) (figure 12.1).

The performance models described are composed of discrete-valued (integer or enumeration) or continuous-valued (floating-point) properties (parameters).

Two strategies were considered for comprehensively evaluating the models:

1: For each model, the discrete valued properties are evaluated combinatorially resulting in a set of continuous-valued optimization problems. These problems are then submitted for optimization by the PSO and PHC systems which may be operated alternately for a collaborative result.

107 2: For each model, all properties are evaluated simultaneously through the PSO system. Integer and enumeration-type properties are configured for truncation and value-based map- ping without the PSO actually knowing anything about their nature.

Development of the second method proved necessary when the management of data for the first proved unnecessarily convoluted and overly complicated and calculation of the more advanced model parts overly time intensive. The second method was used for our actual implementation.

Unfortunately, there are aspects of our models which are non-differentiable. Any property involving conditional value truncation or limiting, and anything involving the Heaviside Step function causes problems. Consequently, the standard PSO was ultimately implemented because it places no requirements on the form of the objective/fitness function explored.

12.3 Solution Methodology

p  Where Amdahl’s Law and its derivations are simply phrased: Tp,n = TS s + n , TP repre-

sents each benchmark and n is a runtime parameter. We solve for TS leaving all other model

Tp,n parameters with variables or constants as described previously: TS = p . (s+ n )

In a practical sense, with n = 1 in all cases, the true Ts is known a priori with some

distribution as Tpn . A general curve-fitting model will be able to predict Ts for all Tpn where n 6= 1 local to that machine using arbitrary fitting parameters. A predictive curve fitting

model will predict Ts for all Tpn where n 6= 1 using fitting parameters relating to system constants. Some of our simpler models are simply general curve-fitting schemes independent of system constants, but the vast majority realized are predictive models.

108 For every model, for each application on each machine, we optimize the free variables. For each group of predictions, it is judged more or less optimal based on the Mean Weighted Absolute Relative Error (MWARE). MWARE (M = machine) is the evaluated curve fitting error for an application and a model on a particular machine. Here, f (j, H, M) generically represents each of the individual performance models (section 8.5.5) evaluated against ex- perimental data on j processors, for a particular HYDRA type H, and a particular machine M. f (1,H3,M) represents the reference benchmark data, the mean single-processor H3 data for an application on a particular machine:

  P PnM f (j, H, M) − f (1,H3,M) W H=HYDRA j=2 j,H,M MWARE (M = machine) = min  P PnM  H=HYDRA j=2 Wj,H,M

As described, we optimize all real-valued/floating-point properties. The best total solution is retained.

For the models, as described, through combinatoric evaluation and assembly of all the avail- able model parts, we realize 3456 total model variations. Because we rely on only H1 and H3 data, only 1728 models were unique and useful candidates.

We solve (optimize) these models on [1 : m] machines and predict on up to [1 : m − 1] others as corresponding results are available.

Each part of the model corresponds to a set of model properties (optimizable parameters) and different degrees of coupling exist between the different parts (figure 12.1). Individual model parts are conceptually exclusive or compulsory (figure 12.2). The scheme for incorporation of the predictive model and experimental data is described in figure 12.3.

109 Figure 12.1: The discrete parts of the predictive model (left) are shown with their correspon- dence to the parametric properties they activate (top). ‘X’ indicates a universal mapping for all types of data while numbers indicate which HYDRA data types are valid for that part to apply.

Figure 12.2: Model parts (both axis) are shown with their mutual-exclusion from other parts. ‘N’ indicates exclusion when a part is present ‘Y’. ‘R’ is a required part.

110 Figure 12.3: The relationships between different parts of the solver are illustrated here. Model parts are generated iteratively and combinatorically. These are translated into prop- erties specific to H1, H3, and H2 (not pictured) models. The optimizer generates model values for each property common to all models and this is passed with experimental data to each model (magenta). All models are evaluated and used collectively to arrive at a set of predictions (blue) which together yield a single fitting error on the data. Only H1 and H3 are illustrated here for simplicity.

111 Chapter 13

Cross-Prediction

Cross-prediction involves the operation and characterization of an application on a well- known machine and predicting its performance (runtime) on another machine with minimal information about the machine and the application on it.

13.1 Methods and Error Measures

As previously described, for fitting the model parameters to each application on each machine

(independently), all the parallel runtimes Tpn with pn ∈ [2, n] for H1, (H2), and H3 were used to minimize the error in predicting sequential operation: Tpn for pn = 1:

  P PnM f (j, H, M) − f (1,H3,M) W H=HYDRA j=2 j,H,M MWARE (M = machine) = min  P PnM  H=HYDRA j=2 Wj,H,M

For cross-prediction, parameters for each application/model are transferred to each other machine and no further optimization is performed. Sequential operation is predicted using the local machine benchmarks and the foreign machine model parameters and MWARE

112 evaluated. Relative Error (RE) is determined from the error metrics for these combinations to assess cross-machine predictability for each model.

Unfortunately, the devil is in the details. Through the use of hardware performance bench- marks and model parameters, application implementation constants in some places are in- ferred and are parametrically dependant on the machine. Without the parametrization, the estimated implementation constants would lack any particular grounding in reality. Algo- rithm bandwidth is exemplary:

MT,n = VI,nθ + VO,n (1 − θ)

Other parameters, like migration frequency, are rightly machine or environmental parame- ters but we treat them arbitrarily here with no protocol for their independant evaluation. These inferences become corrupted when model parameters are blindly transferred between machines. Cross-prediction onto a target machine ‘tgt’ therefore demands the carrying of the original execution machine parameters ‘exe’ into the prediction.

MWARE (M = machine) is necessarily extended to MWARE (E = exe, T = tgt) and f (j, H, M) to f (j, H, E, T ). Therefore:

P PnM f (j, H, E, T ) − f (1,H3,E,T ) W H=HYDRA j=2 j,H,T MWARE (E = exe, T = tgt) = P PnM H=HYDRA j=2 Wj,H,T

So, for two machines ‘exe’ and ‘tgt’:

MWARE (exe, exe) =best fit∈∼ (1%, 10%)

MWARE (exe, tgt) =cross-prediction exe to tgt

113 In a perfect world we could judge the quality of predictions by the actual value and MWARE alone. Unfortunately, we can’t say that any particular value for MWARE is good or not (smaller always being better). Quality of prediction is strongly dependant on both the local and remote fit.

The quality of the prediction can be judged on MWARE (exe, tgt) alone which already is a measure of quality for the prediction. Attempts to relate cross-prediction error to fitting error with an expression like:

MWARE (exe, tgt) − MWARE (tgt, tgt) RE (exe, tgt) = MWARE (tgt, tgt) leads to misrepresentation. While possibly an interesting metric for no as yet stated reason, through the ratio, a model with a poor fit and a poor prediction may be mistakenly classified as a good model where this is true under no circumstances. Judgement must be on the actual prediction error, alone.

Through symmetry:

MWARE (tgt, tgt) =best fit∼ (1%, 10%)

MWARE (tgt, exe) =cross-prediction tgt to exe

The quality of the model can then be judged based on the average prediction error for MWARE (tgt, exe) and MWARE (exe, tgt) collectively over all evaluated combinations of applications and machines.

114 13.2 Complications, Caveats, and Limitations

We know that not all mathematical models are created equal in their predictive capability, but the variation in machines and applications also is present. Simple machines, absent substantial luck, can only support limited predictability.

For example, the machine present in our survey, ‘ROMEOBLUE’, contains only two cores in its processor and no ‘virtual’ cores. It is the simplest parallel multi-core machine possible. Two data points (technically four with the use of H1 and H3 data) makes for difficult times to have a meaningful outcome when fitting a sophisticated model to data which is at best linear in nature. Correspondingly, the lack of more advanced processor features would make prediction of their behavior on other systems implementing them entirely speculative and generally without rooting in reality. As the memory wall is a critical concept with higher processor count, there is no particular assurance that smaller machines will exhibit symptoms.

Simple machines have an advantage that little data is available and collectible making the process quite expedient. Fitting any model on simple or little data is fast. The simplest machines used in this study, ROMEOBLUE and COYOTETANGO, produced the most data in the shortest time despite being the oldest (and slowest) machines. The nonlinearity in data requirements established for H1 data is a major and contributing factor here. XERXES, the most complicated machine, was one of the least producers of data as a result.

Machines of intermediate complexity, in general, anything with 4 cores or 8 hardware threads, were highly variant in their capability to generate results in a timely manner. Aside from individual complications relating to availability of those machines for work, these machines spanned a greater range of performance and age but all with the same requirements for benchmarks and computation. ARES was the oldest, slowest, and least productive of these machines.

115 It IS known that performance varies within a spectrum. Any prediction reliant on minimal information, conceivably as little as a single data point, is subject to the representative quality of that particular data point.

As described, it’s also expected that the models presented are not necessarily free of internal dependencies, linear or nonlinear, so the fitting operation is not specifically deterministic in potential. The particle swarm optimizer (PSO) itself has no guarantees of converging at any particular optimal result. This is substantially why we rely on significant statistics in order to pass judgement on predictive quality.

Regardless of the quality of prediction using well-fit data, the predictive result is a result of the compounding of all errors and uncertainty during application benchmarking, machine characterization (x2), model fitting, and then representative performance testing.

116 Chapter 14

Predictive Outcomes

Results from model fitting and cross-prediction are presented here. Essential statistics col- lected for each named quantity are: minimum, mean, median, max, standard deviation, and number of data points. Relevance for these results are, of course, related to H1/H3 data sets as no H2 results were available.

14.1 Architecture Representation

Due to availability of both machines and execution results, representation is non-uniform. Only two machines represent the Core2 architecture and five are of i7. While machines can all be very unique based on individual composition, many more different i7 processors have been produced over more generations and years than Core2.

For the aggregate, roughly ∼ 16000 − 18000 crosses per model are typically observed. Of these, ∼ 2700 are between Core2’s and ∼ 9700 between i7’s. ∼ 5300 are from Core2 to i7 and ∼ 5300 from i7 to Core2.

117 14.2 Model Decomposition

All models presented are composed from a collection of 14 different parts. Some parts are compulsory and some mutually exclusive (Figure 12.2) which simplifies analysis. Using the pattern ‘11101101101100’ as an example, of all 14 parts, referred from left to right by the indices 0-13, the first, 0, relating to the partitioning of the applications into parallel and sequential portions, is compulsory while the tenth, 9, is absent from our analysis, relating to mutexes in sequential code and relevant to H2 only: X11011011X1100. Treated as constants, twelve parts remain.

The third and fourth and also the seventh and eighth are mutually exclusive pairs, relating to the efficiency of hardware threads and cores for parallel and sequential operation, where only one of each may be active: X1(10)11(01)1X1100. Ten parts remain as variables. The maximum number of parts active in our study is eleven.

The eleventh through fourteenth parts are together closely related through the probabilistic modeling of the memory cache structure, referred to as the MCS group. Five of the twenty- one model properties are common amongst them all so they will be looked at as a special group: X1(10)11(01)1X[1100].

14.3 Curve-Fitting Experimental Data

14.3.1 Best Fit on Model Parts

For each model part, the fitting errors are aggregated both collectively and on a per- architecture basis. The data is presented on a per-application basis (FDI, SRA) and overall. This is a first-order examination to see if any singular parts lead to better and tighter fittings.

118 On a per-application basis, both SRA A.2.3 and FDI A.2.2 summarize to similar fitting errors so they can be considered either separately or as a whole A.2.1. Mean error rates (MWARE) are roughly double to triple on i7 architectures (5-6%) than Core2 (2-3%). Minimums and maximums are more extreme, however. Core2 has the lowest error rates, of course, as the architectures are both smaller and simpler. For the minimum error, Core2 results in typically 0.0035 versus 0.0125, a factor of 3.5 difference. For the maximum error, 0.12 and 2.1 respectively for a factor of 17.5 difference.

On a per-part basis, no individual part distinguishes itself, statistically, as being more or less relevant to a tighter fit than any other. Through observing the full range of models the statistics for each part are substantially similar.

14.3.2 Best Fit on Model Properties

Each model part activates one or more properties of the model (free optimizable variables) and one or more parts may simultaneously activate common properties. Here, the actual values assigned to each property are considered both aggregated and on a per-architecture basis.

On a per-property basis, as was the case with individual parts, each property takes on a full range of values. This should be expected as each part is a composition of one or more properties. Again, SRA A.3.3 can be considered separately from FDI A.3.2 or else as an aggregate A.3.1.

No apparently clear and focused predictive conclusions can be derived from this sort of data.

119 14.3.3 Best Fit on Model

Each discrete model (the total assembly of all its parts) is looked at in terms of fitting the experimental data. Relative to the ‘Best Fit on Parts’, this is the highest-order evaluation of the system.

For brevity sake, the top 25 models are expressed and presented in order of best fit perfor- mance. Results for SRA A.1.3 are considered alone and separate from FDI A.1.2 and also as an aggregate A.1.1.

There is a high degree of commonality amongst these top results on a per-application basis and so they can sufficiently be considered aggregated.

The active parts of the top 25 aggregated models are collected and represented in this histogram (figure 14.1). The top 50, 75, and 100 are also collected for comparison (figures 14.2, 14.3 , and 14.4). With m = 2 applications, where an item is observed 25m times, it is present in all models represented and is a critical component.

Among the top 25 models, the critical parts for high quality fits are: #0: Parallel Part (compulsory) #(2,3): Parallel Efficiency Parts, 96% #5: Parallel Mutex Parameterized Part #(6,7): Sequential Efficiency Parts #8: Sequential Main Memory Bandwidth Part

The remaining parts for the models, overall, are ambiguous but there is significant preference towards having at least the most fundamental of MCS parts added in: #[10] theta (≥ 60%). Fitting is only an intermediate calculation and not an end-point with no assured correlation between fitting and prediction so we move on.

120 Model Part Representation, Top 25 Best Fit 50

40

30

20 Observations 10

0

LX − P ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ART THETA − − − − − − − − − − OMEGA − − P ART MIGRAT− ION BOOST − − GENERICLAT ENCY GENERICLAT ENCY P ARALLEL SHARING − − − − SHARING P ART − BANDW IDT H BANDW IDT− H − − −

P ARAMET ERIZEDP ARAMET ERIZEDSCHEDULER P ART P ART − − − − − SEQUENT IAL OS

EFFICIENCYEFFICIENCYEFFICIENCYEFFICIENCY − − − − SCHEDULER MUTEX MUTEX − − − MAINMEMORYMAINMEMORY OS − − SCHEDULER SCHEDULER − −

P ARALLELP ARALLEL OS OS P ARALLELSEQUENTSEQUENT IAL IAL

P ARALLEL SEQUENT IAL

SEQUENT IAL P artName

Figure 14.1: Model Part Representation, Top 25 Best Fit

121 Model Part Representation, Top 50 Best Fit 100

80

60

40 Observations 20

0

LX − P ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ART THETA − − − − − − − − − − OMEGA − − P ART MIGRAT− ION BOOST − − GENERICLAT ENCY GENERICLAT ENCY P ARALLEL SHARING − − − − SHARING P ART − BANDW IDT H BANDW IDT− H − − −

P ARAMET ERIZEDP ARAMET ERIZEDSCHEDULER P ART P ART − − − − − SEQUENT IAL OS

EFFICIENCYEFFICIENCYEFFICIENCYEFFICIENCY − − − − SCHEDULER MUTEX MUTEX − − − MAINMEMORYMAINMEMORY OS − − SCHEDULER SCHEDULER − −

P ARALLELP ARALLEL OS OS P ARALLELSEQUENTSEQUENT IAL IAL

P ARALLEL SEQUENT IAL

SEQUENT IAL P artName

Figure 14.2: Model Part Representation, Top 50 Best Fit

122 Model Part Representation, Top 75 Best Fit 150

100

50 Observations

0

LX − P ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ART THETA − − − − − − − − − − OMEGA − − P ART MIGRAT− ION BOOST − − GENERICLAT ENCY GENERICLAT ENCY P ARALLEL SHARING − − − − SHARING P ART − BANDW IDT H BANDW IDT− H − − −

P ARAMET ERIZEDP ARAMET ERIZEDSCHEDULER P ART P ART − − − − − SEQUENT IAL OS

EFFICIENCYEFFICIENCYEFFICIENCYEFFICIENCY − − − − SCHEDULER MUTEX MUTEX − − − MAINMEMORYMAINMEMORY OS − − SCHEDULER SCHEDULER − −

P ARALLELP ARALLEL OS OS P ARALLELSEQUENTSEQUENT IAL IAL

P ARALLEL SEQUENT IAL

SEQUENT IAL P artName

Figure 14.3: Model Part Representation, Top 75 Best Fit

123 Model Part Representation, Top 100 Best Fit 200

150

100

Observations 50

0

LX − P ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ART THETA − − − − − − − − − − OMEGA − − P ART MIGRAT− ION BOOST − − GENERICLAT ENCY GENERICLAT ENCY P ARALLEL SHARING − − − − SHARING P ART − BANDW IDT H BANDW IDT− H − − −

P ARAMET ERIZEDP ARAMET ERIZEDSCHEDULER P ART P ART − − − − − SEQUENT IAL OS

EFFICIENCYEFFICIENCYEFFICIENCYEFFICIENCY − − − − SCHEDULER MUTEX MUTEX − − − MAINMEMORYMAINMEMORY OS − − SCHEDULER SCHEDULER − −

P ARALLELP ARALLEL OS OS P ARALLELSEQUENTSEQUENT IAL IAL

P ARALLEL SEQUENT IAL

SEQUENT IAL P artName

Figure 14.4: Model Part Representation, Top 100 Best Fit

124 14.4 Cross-Prediction

14.4.1 Cross-Prediction on Model Parts

For each model part, the cross-prediction errors are aggregated both collectively and on a per-architecture basis. The data is presented on a per-application basis (FDI, SRA) and overall. This is a first-order examination to see if any singular parts lead to better and tighter fittings.

On a per-application basis, both SRA B.1.3 and FDI B.1.2 summarize to similar fitting errors so they can be considered either separately or as a whole B.1.1.

No apparent conclusions can be made through this view of the unfiltered data. Through the heavy usage of all parts under a wide variety of circumstances all relevant statistics look quite similar and are indistinguishable. Filtering and decomposition of ranked models is presented in the following section.

14.4.2 Cross-Prediction on Model

Each discrete model (the total assembly of all its parts) is looked at in terms of cross- predicting the experimental data on a separate machine. Relative to the ‘Best Fit on Parts’, this is the highest-order evaluation of the system.

For brevity sake, the top 25 models are expressed and presented in order of best mean prediction performance. Results for SRA B.2.3 are considered alone and separate from FDI B.2.2 and also as an aggregate B.2.1.

125 There is a high degree of commonality amongst these top results on a per-application ba- sis and so the aggregate data B.2.1 is considered for both overall and on architecture-to- architecture basis.

The simplest and most complex models are presented for reference against the best perform- ers. Our best models show a ∼ 25% improvement in Mean error during prediction versus the most complicated models with a ∼ 81% improvement in worst-case (maximum) error. Standard deviations are roughly 80% narrower. Compared to Amdahl’s Law, we realize a ∼ 50% reduction in mean error and a ∼ 91.4% reduction in maximum error.

14.4.2.1 Model Complexity

The complexity level of the top 25 models are considered from the aggregate B.2.1. The average complexity of the models are assessed both in terms of their overall composition and of the MCS group. Inter- and intra-architectural predictions and also overall performance are evaluated (figure 14.5).

The simplest model, Amdahl’s Law, is composed of a single part (and no MCS parts) while the most complex models are composed of 11 with all 4 MCS parts present. These are not highly competitive models and often show up ranked between 1500 and 2000.

Where cross-prediction is between Core2 architectures, the simplest studied, 6.32 and 1.04 MCS parts are present. These are the simplest of the predictive groups. For prediction from Core2 to i7, only slightly more complicated models are required with 6.40 and 1.04 parts.

Where cross-prediction is between i7 architectures, 7.28 and 1.28 parts are required while from i7 to Core2 demands 7.16 and 1.32.

For the aggregate, the average number of parts active are 7.56 with 1.76 MCS parts. So, for good predictive quality under all circumstances the average complexity of models is

126 higher than all other inter- and intra-architectural combinations. That said, the models are significantly simpler than the most complicated proposed (11 parts) with only 68% of the complexity. Noteworthy is that the MCS parts are active but not overwhelmingly with 44% active versus just 26% being present for prediction between Core2’s. This shouldn’t be particularly surprising as the MCS group theoretically corresponds to more sophisticated architectures (i7).

Figure 14.5: Inter- and intra-architectural model complexity for the top predictors in each category is summarized here also with overall. The simplest predictors exist in making pre- dictions from the simpler architectures (Core2) regardless of the target; more sophisticated models lack sufficient data points to properly resolve. The most complicated predictors are required when originating from a more complicated architecture (either i7-i7, i7-Core2, or else overall prediction(inclusive)).

14.4.2.2 Top Model Composition

Where ‘n’ inter- and intra-architecture predictions are made on m = 2 applications, the active parts of the top 25, 50, 75, and 100 aggregated models are collected and represented in this histogram (see figures 14.6, 14.7, 14.8, and 14.9). Where an item is observed 25mn times (200, 400, 600, or 800 times correspondingly), it is present in all models represented and is a critical component.

127 As expected, we find the composition of the best models becomes somewhat less consistent with the larger sets but results are all quite similar. Except for the compulsory first part, even among the smallest group of 25 models, only strong preferences are apparent: #0: Parallel Part (compulsory) #(2,3) Parallel Efficiency Parts, 82% #4 Parallel Main Memory Bandwidth Part, 90.5% #5 Parallel Mutex Parameterized Part, 88% #(6,7) Sequential Efficiency Parts, 83% #8 Sequential Main Memory Bandwidth Part, 80.5% #[10] OS Scheduler Sharing Theta Part 71%

128 Model Part Representation, Top 25*Archs Cross Prediction 200

150

100

Observations 50

0

LX − P ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ART THETA − − − − − − − − − − OMEGA − − P ART MIGRAT− ION BOOST − − GENERICLAT ENCY GENERICLAT ENCY P ARALLEL SHARING − − − − SHARING P ART − BANDW IDT H BANDW IDT− H − − −

P ARAMET ERIZEDP ARAMET ERIZEDSCHEDULER P ART P ART − − − − − SEQUENT IAL OS

EFFICIENCYEFFICIENCYEFFICIENCYEFFICIENCY − − − − SCHEDULER MUTEX MUTEX − − − MAINMEMORYMAINMEMORY OS − − SCHEDULER SCHEDULER − −

P ARALLELP ARALLEL OS OS P ARALLELSEQUENTSEQUENT IAL IAL

P ARALLEL SEQUENT IAL

SEQUENT IAL P artName

Figure 14.6: Model Part Representation, Top 25*Archs Cross Prediction

129 Model Part Representation, Top 50*Archs Cross Prediction 400

300

200

Observations 100

0

LX − P ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ART THETA − − − − − − − − − − OMEGA − − P ART MIGRAT− ION BOOST − − GENERICLAT ENCY GENERICLAT ENCY P ARALLEL SHARING − − − − SHARING P ART − BANDW IDT H BANDW IDT− H − − −

P ARAMET ERIZEDP ARAMET ERIZEDSCHEDULER P ART P ART − − − − − SEQUENT IAL OS

EFFICIENCYEFFICIENCYEFFICIENCYEFFICIENCY − − − − SCHEDULER MUTEX MUTEX − − − MAINMEMORYMAINMEMORY OS − − SCHEDULER SCHEDULER − −

P ARALLELP ARALLEL OS OS P ARALLELSEQUENTSEQUENT IAL IAL

P ARALLEL SEQUENT IAL

SEQUENT IAL P artName

Figure 14.7: Model Part Representation, Top 50*Archs Cross Prediction

130 Model Part Representation, Top 75*Archs Cross Prediction 600

400

200 Observations

0

LX − P ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ART THETA − − − − − − − − − − OMEGA − − P ART MIGRAT− ION BOOST − − GENERICLAT ENCY GENERICLAT ENCY P ARALLEL SHARING − − − − SHARING P ART − BANDW IDT H BANDW IDT− H − − −

P ARAMET ERIZEDP ARAMET ERIZEDSCHEDULER P ART P ART − − − − − SEQUENT IAL OS

EFFICIENCYEFFICIENCYEFFICIENCYEFFICIENCY − − − − SCHEDULER MUTEX MUTEX − − − MAINMEMORYMAINMEMORY OS − − SCHEDULER SCHEDULER − −

P ARALLELP ARALLEL OS OS P ARALLELSEQUENTSEQUENT IAL IAL

P ARALLEL SEQUENT IAL

SEQUENT IAL P artName

Figure 14.8: Model Part Representation, Top 75*Archs Cross Prediction

131 Model Part Representation, Top 100*Archs Cross Prediction 800

600

400

Observations 200

0

LX − P ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ART THETA − − − − − − − − − − OMEGA − − P ART MIGRAT− ION BOOST − − GENERICLAT ENCY GENERICLAT ENCY P ARALLEL SHARING − − − − SHARING P ART − BANDW IDT H BANDW IDT− H − − −

P ARAMET ERIZEDP ARAMET ERIZEDSCHEDULER P ART P ART − − − − − SEQUENT IAL OS

EFFICIENCYEFFICIENCYEFFICIENCYEFFICIENCY − − − − SCHEDULER MUTEX MUTEX − − − MAINMEMORYMAINMEMORY OS − − SCHEDULER SCHEDULER − −

P ARALLELP ARALLEL OS OS P ARALLELSEQUENTSEQUENT IAL IAL

P ARALLEL SEQUENT IAL

SEQUENT IAL P artName

Figure 14.9: Model Part Representation, Top 100*Archs Cross Prediction

132 14.4.2.3 BEST Model Composition

The best overall models are extracted from Section B.2.1 into Table 14.1. The top 25 models presented in Section B.2.1, ranked based on lowest Mean error, are then reduced to 12 items based on the least Maximum errors with the best nominal values of ∼ .75 retained and all others rejected with nominal values of > 1.0 (bad predictions are possible but we are minimizing the apparent propensity).

Derived from the aggregate best which exhibited complexities of 7.28 and 1.2 MCS parts, these change in an interesting way. The total complexity goes up to 7.5 parts on average with MCS complexity reduced to only 1.0.

The models of Table 14.1 are decomposed into Table 14.10 to identify the their composition.

Critical parts include: #0: Parallel Part (compulsory) #1: Sequential Boost Part #(2,3): Parallel Efficiency Part #5: Parallel Mutex Parameterized Part #(6,7): Sequential Efficiency Part #8: Sequential Main Memory Bandwidth Part

Sub-critical parts include: #4: Parallel Main Memory Bandwidth Part, 50% represented #[10]: OS Scheduler Sharing Theta Part, 66% represented #[11]: OS Scheduler Sharing Omega Part, 33% represented

Of the 12 models, composition of non-MCS parts is comprehensive with the exception of #4 (Parallel Main Memory Bandwidth) which is as equally present as absent. Due to mutual exclusivity, either #2 or #3 and #6 or #7 are present in all cases.

133 With regards to MCS composition, #[10] (OS Scheduler Sharing Theta Part) is present in 2/3 of models. #[11] (OS Scheduler Sharing Omega Part) is fully contingent on #[10] and is present in 1/2 of all possible models.

Aside from the specificity of the top models just described, we can make some more gen- eral statements. From the observed outcomes, we can reasonably arrive at several conclu- sions\observation regarding complexity and content of high quality predictive models: 1- Simpler architectures (Core2) where structure is simpler, behavior more linear, and exper- imental data points are fewer relate more poorly with regards to more complicated models (few extra terms can improve on predicting from fundamentally linear behavior). 2- Prediction within simpler architectures (Core2) were less demanding than more complex architectures (i7). 3- Prediction from more complicated architectures (i7) regardless of the target were more demanding than others. Internally within i7 exist some of the largest variations in machine performance and structure. 4- An overall cross-predictor is more complicated than any of the internal groupings but all together simpler than the most complicated models available. 5- Some degree of modeling of the memory hierarchy and an applications interaction with it is required and it plays a critical role. Nearly all top models had at least some MCS part active. 6- The memory wall concept is reflected in that a model which considers memory bandwidth for the whole application, not just for the parallel portion, is critical. 7- Some feature that can affect differential performance due to contention realized by software threads cohabitating inside individual cores is important.

134 Table 14.1: Cross Prediction BEST Models, *denotes complete non-MCS groups

Cross Prediction comprehensive relative error

Index Min (Mean) Median Max Stddev Name Count Mean Rank Mean Fit

* 0.0005 0.1563 0.0196 4765.2090 7.3902 ALL DATA 39816974 * *

Amdahl 0.0008 0.0726 0.0601 8.8358 0.1704 10(00)00(00)00[0000] 23054 863 0.0779

[1701] 0.0007 0.2451 0.0388 126.6917 1.2272 11(10)11(10)10[1111] 23028 1309 0.0400

[1703] 0.0005 0.2458 0.0432 267.2698 2.5111 11(01)11(10)10[1111] 23028 1313 0.0389

[1725] 0.0009 0.2759 0.0383 173.9906 1.9954 11(10)11(01)10[1111] 23028 1410 0.0409

135 [1727] 0.0005 0.2319 0.0373 63.5861 0.6793 11(01)11(01)10[1111] 23028 1189 0.0413

429* 0.0005 0.0358 0.0337 0.7586 0.0285 11(10)11(01)10[1100] 23054 0 0.0351

285* 0.0005 0.0359 0.0337 0.7681 0.0286 11(10)11(01)10[1000] 23054 1 0.0352

141* 0.0005 0.0359 0.0337 0.7586 0.0286 11(10)11(01)10[0000] 23054 5 0.0352

135 0.0005 0.0360 0.0337 0.7586 0.0285 11(10)01(01)10[0000] 23054 7 0.0353

279 0.0005 0.0361 0.0337 0.7586 0.0286 11(10)01(01)10[1000] 23054 8 0.0354

119* 0.0005 0.0361 0.0340 0.7690 0.0286 11(01)11(10)10[0000] 23054 9 0.0356

423 0.0005 0.0361 0.0338 0.7586 0.0288 11(10)01(01)10[1100] 23054 11 0.0354

263* 0.0005 0.0362 0.0341 0.7718 0.0286 11(01)11(10)10[1000] 23054 13 0.0356

407* 0.0005 0.0362 0.0340 0.7682 0.0285 11(01)11(10)10[1100] 23054 15 0.0356 Continuation of Table 14.1

Index Min (Mean) Median Max Stddev Name Count Mean Rank Mean Fit

113 0.0005 0.0364 0.0342 0.7718 0.0288 11(01)01(10)10[0000] 23054 20 0.0359

401 0.0005 0.0364 0.0342 0.7718 0.0286 11(01)01(10)10[1100] 23054 21 0.0359

257 0.0005 0.0364 0.0342 0.7690 0.0287 11(01)01(10)10[1000] 23054 23 0.0359

End of Table 14.1 136 12

10

8

6

4 Observations 2

0

LX − P ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ARTP ART THETA − − − − − − − − − − OMEGA − − P ART MIGRAT− ION BOOST − − GENERICLAT ENCY GENERICLAT ENCY P ARALLEL SHARING − − − − SHARING P ART − BANDW IDT H BANDW IDT− H − − −

P ARAMET ERIZEDP ARAMET ERIZEDSCHEDULER P ART P ART − − − − − SEQUENT IAL OS

EFFICIENCYEFFICIENCYEFFICIENCYEFFICIENCY − − − − SCHEDULER MUTEX MUTEX − − − MAINMEMORYMAINMEMORY OS − − SCHEDULER SCHEDULER − −

P ARALLELP ARALLEL OS OS P ARALLELSEQUENTSEQUENT IAL IAL

P ARALLEL SEQUENT IAL

SEQUENT IAL P artName

Figure 14.10: Model Part Representation, Top 12 BEST Cross Prediction

137 Chapter 15

Conclusions

Conclusions from several prior sections are collected here for easy reference.

Referencing section 14.3.3 for best fit results:

The active parts of the top 25 aggregated models are collected and represented in this histogram (figure 14.1). The top 50, 75, and 100 are also collected for comparison (figures 14.2, 14.3 , and 14.4). With m = 2 applications, where an item is observed 25m times, it is present in all models represented and is a critical component.

Among the top 25 models, the critical parts for high quality fits are: #0: Parallel Part (compulsory) #(2,3): Parallel Efficiency Parts, 96% #5: Parallel Mutex Parameterized Part #(6,7): Sequential Efficiency Parts #8: Sequential Main Memory Bandwidth Part

The remaining parts for the models, overall, are ambiguous but there is significant preference towards having at least the most fundamental of MCS parts added in: #[10] theta (≥ 60%).

138 Fitting is only an intermediate calculation and not an end-point with no assured correlation between fitting and prediction so we move on.

Referencing section 14.4.2.1 for model complexity results:

The simplest model, Amdahl’s Law, is composed of a single part (and no MCS parts) while the most complex models are composed of 11 with all 4 MCS parts present. These are not highly competitive models and often show up ranked between 1500 and 2000.

Where cross-prediction is between Core2 architectures, the simplest studied, 6.32 and 1.04 MCS parts are present. These are the simplest of the predictive groups. For prediction from Core2 to i7, only slightly more complicated models are required with 6.40 and 1.04 parts.

Where cross-prediction is between i7 architectures, 7.28 and 1.28 parts are required while from i7 to Core2 demands 7.16 and 1.32.

For the aggregate, the average number of parts active are 7.56 with 1.76 MCS parts. So, for good predictive quality under all circumstances the average complexity of models is higher than all other inter- and intra-architectural combinations. That said, the models are significantly simpler than the most complicated proposed (11 parts) with only 68% of the complexity. Noteworthy is that the MCS parts are active but not overwhelmingly with 44% active versus just 26% being present for prediction between Core2’s. This shouldn’t be particularly surprising as the MCS group theoretically corresponds to more sophisticated architectures (i7).

Referencing sections 14.4.2.2 and 14.4.2.3 for model composition results:

As expected, we find the composition of the best models becomes somewhat less consistent with the larger sets but results are all quite similar. Except for the compulsory first part, even among the smallest group of 25 models, only strong preferences are apparent: #0: Parallel Part (compulsory)

139 #(2,3) Parallel Efficiency Parts, 82% #4 Parallel Main Memory Bandwidth Part, 90.5% #5 Parallel Mutex Parameterized Part, 88% #(6,7) Sequential Efficiency Parts, 83% #8 Sequential Main Memory Bandwidth Part, 80.5% #[10] OS Scheduler Sharing Theta Part 71%

The best overall models are extracted from Section B.2.1 into Table 14.1. The top 25 models presented in Section B.2.1, ranked based on lowest Mean error, are then reduced to 12 items based on the least Maximum errors with the best nominal values of ∼ .75 retained and all others rejected with nominal values of > 1.0 (bad predictions are possible but we are minimizing the apparent propensity).

Derived from the aggregate best which exhibited complexities of 7.28 and 1.2 MCS parts, these change in an interesting way. The total complexity goes up to 7.5 parts on average with MCS complexity reduced to only 1.0.

The models of Table 14.1 are decomposed into Table 14.10 to identify the their composition.

Critical parts include: #0: Parallel Part (compulsory) #1: Sequential Boost Part #(2,3): Parallel Efficiency Part #5: Parallel Mutex Parameterized Part #(6,7): Sequential Efficiency Part #8: Sequential Main Memory Bandwidth Part

Sub-critical parts include: #4: Parallel Main Memory Bandwidth Part, 50% represented

140 #[10]: OS Scheduler Sharing Theta Part, 66% represented #[11]: OS Scheduler Sharing Omega Part, 33% represented

Of the 12 models, composition of non-MCS parts is comprehensive with the exception of #4 (Parallel Main Memory Bandwidth) which is as equally present as absent. Due to mutual exclusivity, either #2 or #3 and #6 or #7 are present in all cases.

Aside from the specificity of the top models just described, we can make some more gen- eral statements. From the observed outcomes, we can reasonably arrive at several conclu- sions\observation regarding complexity and content of high quality predictive models: 1- Simpler architectures (Core2) where structure is simpler, behavior more linear, and exper- imental data points are fewer relate more poorly with regards to more complicated models (few extra terms can improve on predicting from fundamentally linear behavior). 2- Prediction within simpler architectures (Core2) were less demanding than more complex architectures (i7). 3- Prediction from more complicated architectures (i7) regardless of the target were more demanding than others. Internally within i7 exist some of the largest variations in machine performance and structure. 4- An overall cross-predictor is more complicated than any of the internal groupings but all together simpler than the most complicated models available. 5- Some degree of modeling of the memory hierarchy and an applications interaction with it is required and it plays a critical role. Nearly all top models had at least some MCS part active. 6- The memory wall concept is reflected in that a model which considers memory bandwidth for the whole application, not just for the parallel portion, is critical. 7- Some feature that can affect differential performance due to contention realized by software threads cohabitating inside individual cores is important.

141 Chapter 16

Opportunities for Future Work

Many opportunities exist to extend and expand on this work. Lots of possibilities exist for both increasing and reducing the overall complexity.

Alternate models of parallelism may be considered. While this work centers around Amdahl’s Law, it could be re-framed to use other parallel models (for example, Gustafson’s Law or a hybrid- scaled speedup).

Parameter reduction: some parameters are seemingly redundant. Sequential and random read and write benchmarks may be reducible to the absolute min and max performance for the total of the benchmarks rather than considered strictly pairwise without loss of generality.

Allow the number of parallel tasks to become variable. The number of tasks is often a very knowable quantity. Here we treat the value as a known constant. In the face of no application knowledge at all, this may be treated as entirely variable.

Individual model parts may be rephrased as appropriate.

142 Curve fits were performed on single-machine data. Collections of machines may be considered during a single fitting in order to reduce the effects of architecture on the models.

Utilize H2: Validate the portions of the model involving H2 benchmarks and related param- eters.

Other architectures can be considered. Here, single processor Intel architectures were ex- plored. AMD processor systems can be considered which have different memory hierarchy designs. NUMA and other architectures such as SMP and other Intel chips like Xeon could be considered.

Experiment with other applications (similar types) and other types of applications.

Applications which are explicitly affinity-enabled and architecture-aware could be examined.

With comprehensive data already available, the burden of data collection could be reduced when examining variations.

The model can be expanded to a fully probabilistic one, operating on and outputting runtime probability distributions (Gaussian, etc.)

From the top XXX models Mm (...), we can reasonably propose a composite model averaged PXXX m=0 Mm(...) of the top individuals Mc (...) = XXX to yield a more stable predictive result.

143 Bibliography

[1] V. S. Adve. Analyzing the behavior and performance of parallel programs. PhD thesis, Citeseer, 1993.

[2] T. J. Akai and T. J. Akai. Applied numerical methods for engineers. J. Wiley, 1994.

[3] B. M. Al-Babtain, F. J. Al-Kanderi, M. F. Al-Fahad, and I. Ahmad. A survey on amdahl’s law extension in multicore architectures. International Journal of New Com- puter Architectures and their Applications (IJNCAA), 3(3):30–46, 2013.

[4] G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring), pages 483–485, New York, NY, USA, 1967. ACM.

[5] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amaras- inghe. PetaBricks: a language and compiler for algorithmic choice, volume 44. ACM, 2009.

[6] D. A. Bader and G. Cong. Swarm: A parallel programming framework for multicore processors. In Encyclopedia of Parallel Computing, pages 1966–1971. Springer, 2011.

[7] D. H. Bailey. 12 ways to fool the masses when giving performance results on parallel computers, 1991.

[8] A. H. Baker, R. D. Falgout, T. Gamblin, T. V. Kolev, M. Schulz, and U. M. Yang. Scaling algebraic multigrid solvers: On the road to exascale. In Competence in High Performance Computing 2010, pages 215–226. Springer, 2012.

[9] B. J. Barnes. A Regression-based System for Accurate Scalability Prediction on Large- scale Machines. PhD thesis, University of Georgia, 2011.

[10] S. M. Blackburn, A. Diwan, M. Hauswirth, P. F. Sweeney, J. N. Amaral, V. Babka, W. Binder, T. Brecht, L. Bulej, L. Eeckhout, et al. Can you trust your experimental results. Technical report, Tech. Rep.

[11] S. Brown et al. Performance comparison of finite-difference modeling on cell, fpga and multi-core computers. In SEG/San Antonio Annual Meeting, 2007.

144 [12] J. Buisson, O. Sonmez, H. Mohamed, W. Lammers, and D. Epema. Scheduling mal- leable applications in multicluster systems. In Cluster Computing, 2007 IEEE Inter- national Conference on, pages 372–381. IEEE, 2007.

[13] E. A. Carmona and M. D. Rice. Modeling the serial and parallel fractions of a parallel algorithm. Journal of Parallel and Distributed Computing, 13(3):286–298, 1991.

[14] J. Carr. Surface reconstruction in 3d medical imaging. 1996.

[15] J. Casazza. Intel core i7-800 processor series and the intel core i5-700 processor series based on intel microarchitecture (nehalem). White paper, Intel Corp, 2009.

[16] J. R. Cebral and R. L¨ohner. From medical images to anatomically accurate finite element grids. International Journal for Numerical Methods in Engineering, 51(8):985– 1008, 2001.

[17] J. Charles, P. Jassi, N. S. Ananth, A. Sadat, and A. Fedorova. Evaluation of the intel R core i7 turbo boost feature. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 188–197. IEEE, 2009.

[18] G. Chatzopoulos, K. Kourtis, N. Koziris, and G. Goumas. Towards a compiler/runtime synergy to predict the scalability of parallel loops. In Multi-/Many-core Computing Systems (MuCoCoS), 2013 IEEE 6th International Workshop on, pages 1–10. IEEE, 2013.

[19] S. Chellappa, F. Franchetti, and M. P¨uschel. How to write fast numerical code: A small introduction. In Generative and Transformational Techniques in Software Engineering II, pages 196–259. Springer, 2008.

[20] J. U. Chuck Palahniuk. Fight Club. Twentieth Century Fox Film Corporation, 1999.

[21] M. J. Clement and M. J. Quinn. Dynamic performance prediction for scalable parallel computing. Technical report, Citeseer, 1994.

[22] M. J. Clement and M. J. Quinn. Automated performance prediction for scalable parallel computing. Parallel Computing, 23(10):1405–1420, 1997.

[23] W. Colburn and K. Haines. Volume hologram formation in photopolymer materials. Applied Optics, 10(7):1636–1641, 1971.

[24] A. Corporation. AMD FX-8350 Processor, 2016 (accessed January 19, 2016).

[25] I. Corporation. Intel E5335 Processor, 2016 (accessed January 19, 2016).

[26] I. Corporation. Intel E7400 Processor, 2016 (accessed January 19, 2016).

[27] I. Corporation. Intel i7-2700K Processor, 2016 (accessed January 19, 2016).

[28] I. Corporation. Intel i7-2820QM Processor, 2016 (accessed January 19, 2016).

145 [29] I. Corporation. Intel i7-3930K Processor, 2016 (accessed January 19, 2016).

[30] I. Corporation. Intel i7-4700MQ Processor, 2016 (accessed January 19, 2016).

[31] I. Corporation. Intel i7-4720HQ Processor, 2016 (accessed January 19, 2016).

[32] I. Corporation. Intel i7-4820K Processor, 2016 (accessed January 19, 2016).

[33] I. Corporation. Intel i7-860 Processor, 2016 (accessed January 19, 2016).

[34] I. Corporation. Intel Q8200 Processor, 2016 (accessed January 19, 2016).

[35] M. Corporation. Windows Processes and Threads, 2016 (accessed January 19, 2016).

[36] M. Corporation. Windows Threads, 2016 (accessed January 19, 2016).

[37] M. E. Crovella. Performance prediction and tuning of parallel programs. Technical report, DTIC Document, 1994.

[38] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. Logp: Towards a realistic model of parallel computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP ’93, pages 1–12, New York, NY, USA, 1993. ACM.

[39] L. Dagum and R. Menon. Openmp: an industry standard for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46–55, 1998.

[40] J. Diamond, M. Burtscher, J. D. McCalpin, B.-D. Kim, S. W. Keckler, and J. C. Browne. Evaluation and optimization of multicore performance bottlenecks in super- computing applications. In Performance Analysis of Systems and Software (ISPASS), 2011 IEEE International Symposium on, pages 32–43. IEEE, 2011.

[41] U. W. Eisenecker. Generative programming (gp) with c++. In Modular Programming Languages, pages 351–365. Springer, 1997.

[42] C. C. Elgot and A. Robinson. Random-access stored-program machines, an approach to programming languages. J. ACM, 11(4):365–399, Oct. 1964.

[43] D. Eppstein and Z. Galil. Parallel algorithmic techniques for combinatorial computa- tion. In Automata, Languages and Programming, pages 304–318. Springer, 1989.

[44] N. Faria, R. Silva, and J. L. Sobral. Impact of data structure layout on performance. In Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on, pages 116–120. IEEE, 2013.

[45] M. A. Fienup. Scalability study in parallel computing. 1995.

[46] M. Forum. MPI, 2016 (accessed January 12, 2016).

[47] M. Frigo and S. G. Johnson. The design and implementation of fftw3. Proceedings of the IEEE, 93(2):216–231, 2005.

146 [48] P. Grimal and A. Maxwell-Hyslop. The Dictionary of Classical Mythology. Blackwell reference. Wiley, 1996.

[49] M. E. Group. Outlier Handout, 2016 (accessed February 24, 2016).

[50] E. G¨unther, F. G. K¨onig,and N. Megow. Scheduling and packing malleable and parallel tasks with precedence constraints of bounded width. Journal of Combinatorial Optimization, 27(1):164–181, 2014.

[51] N. J. Gunther. A general theory of computational scalability based on rational func- tions. arXiv preprint arXiv:0808.1431, 2008.

[52] V. Gupta, H. Kim, and K. Schwan. Evaluating scalability of multi-threaded applica- tions on a many-core platform. 2012.

[53] J. L. Gustafson. Reevaluating amdahl’s law. Communications of the ACM, 31:532–533, 1988.

[54] J. L. Gustafson. The consequences of fixed time performance measurement. In System Sciences, 1992. Proceedings of the Twenty-Fifth Hawaii International Conference on, volume 2, pages 113–124. IEEE, 1992.

[55] J. L. Hennessy and D. A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2012.

[56] M. D. Hill and M. R. Marty. Amdahl’s law in the multicore era. Computer, (7):33–38, 2008.

[57] K. H. Hohne and R. Bernstein. Shading 3d-images from ct using gray-level gradients. Medical Imaging, IEEE Transactions on, 5(1):45–47, 1986.

[58] L. Hu and I. Gorton. Performance evaluation for parallel systems: A survey. Citeseer, 1997.

[59] R. Jain. The art of computer systems performance analysis. John Wiley & Sons, 2008.

[60] B. H. Juurlink and C. Meenderinck. Amdahl’s law for predicting the future of mul- ticores considered harmful. ACM SIGARCH Computer Architecture News, 40(2):1–9, 2012.

[61] A. Kayi, T. El-Ghazawi, and G. B. Newby. Performance issues in emerging homoge- neous multi-core architectures. Simulation Modelling Practice and Theory, 17(9):1485– 1499, 2009.

[62] R. Koch. 3-d surface reconstruction from stereoscopic image sequences. In Computer Vision, 1995. Proceedings., Fifth International Conference on, pages 109–114. IEEE, 1995.

147 [63] B.-I. Koh, A. D. George, R. T. Haftka, and B. J. Fregly. Parallel asynchronous particle swarm optimization. International Journal for Numerical Methods in Engineering, 67(4):578–595, 2006.

[64] W. T. Kramer and C. Ryan. Performance variability of highly parallel architectures. Springer, 2003.

[65] S. Krishnamurthi, M. Felleisen, and B. F. Duba. From macros to reusable generative programming. In Generative and Component-Based Software Engineering, pages 105– 120. Springer, 2000.

[66] J. Langer, M. Bar-On, and H. D. Miller. New computational method in the theory of spinodal decomposition. Physical Review A, 11(4):1417, 1975.

[67] C. E. Leiserson. Cilk. In Encyclopedia of Parallel Computing, pages 273–288. Springer, 2011.

[68] M. Levoy. Display of surfaces from volume data. Computer Graphics and Applications, IEEE, 8(3):29–37, 1988.

[69] H.-X. Lin, A. J. Van Gemund, and J. Meijdam. Scalability analysis and parallel execution of unstructured problems. In EUROSIM, pages 151–160. Citeseer, 1996.

[70] E. A. Luke, I. Banicescu, and J. Li. The optimal effectiveness metric for parallel application analysis. Information processing letters, 66(5):223–229, 1998.

[71] A. D. Malony. Tools for parallel computing: A performance evaluation perspective. In Handbook on Parallel and Distributed Processing, pages 342–363. Springer, 2000.

[72] J. R. March. Dictionary of classical mythology. Oxbow Books, 2014.

[73] J. W. Meira. Modeling performance of parallel programs. TR859. Computer Science Department, University of Rochester, 1995.

[74] C. L. Mendes. Performance scalability prediction on multicomputers. Citeseer, 1997.

[75] C. L. Mendes, D. Reed, et al. Integrated compilation and scalability analysis for parallel systems. In Parallel Architectures and Compilation Techniques, 1998. Proceedings. 1998 International Conference on, pages 385–392. IEEE, 1998.

[76] E. C. Mike Judge. Idiocracy. Twentieth Century Fox Film Corporation, 2006.

[77] G. Mounie, C. Rapine, and D. Trystram. Efficient approximation algorithms for scheduling malleable tasks. In Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, pages 23–32. ACM, 1999.

[78] A. Nataraj, A. Morris, A. D. Malony, M. Sottile, and P. Beckman. The ghost in the machine: observing the effects of kernel operation on parallel application performance. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, page 29. ACM, 2007.

148 [79] C. Nugteren and H. Corporaal. The boat hull model: adapting the roofline model to enable performance prediction for parallel computing. In ACM Sigplan Notices, volume 47, pages 291–292. ACM, 2012.

[80] J. M. Perez, R. M. Badia, and J. Labarta. A dependency-aware task-based program- ming environment for multi-core architectures. In Cluster Computing, 2008 IEEE International Conference on, pages 142–151. IEEE, 2008.

[81] K. S. Perumalla. Parallel and distributed simulation: traditional techniques and recent advances. In Proceedings of the 38th conference on Winter simulation, pages 84–95. Winter Simulation Conference, 2006.

[82] C. Pheatt. Intel R threading building blocks. Journal of Computing Sciences in Col- leges, 23(4):298–298, 2008.

[83] G. Prinslow. Overview of performance measurement and analytical modeling tech- niques for multi-core processors. URL: http://www. cse. wustl. edu/˜ jain/cse567- 11/ftp/multcore. pdf, 2011.

[84] S. Puri and H. L. Frisch. Surface-directed spinodal decomposition: modelling and numerical simulations. Journal of Physics: Condensed Matter, 9(10):2109–2133, 1997.

[85] J. Rashed. Coarsening Dynamics for the Cahn-Hilliard Equation. PhD thesis, Technion-Israel Institute of Technology, Faculty of Mathematics, 2009.

[86] T. Rogers, K. Elder, and R. C. Desai. Numerical study of the late stages of spinodal decomposition. Physical Review B, 37(16):9638, 1988.

[87] C. Rosas, J. Gim´enez12,and J. Labarta12. Scalability prediction for fundamental performance factors. Supercomputing frontiers and innovations, 1(2):4–19, 2014.

[88] C. Rosas, J. Jim´enez,and J. J. Labarta Mancho. Methodology to predict scalability of parallel applications. In BSC Doctoral Symposium (2nd: 2015: Barcelona). Barcelona Supercomputing Center, 2015.

[89] J. C. Schatzman. Writing high-performance java code that runs as fast as fortran, c, or c++. In ITCom 2001: International Symposium on the Convergence of IT and Communications, pages 106–114. International Society for Optics and Photonics, 2001.

[90] J. F. Schutte, J. A. Reinbolt, B. J. Fregly, R. T. Haftka, and A. D. George. Paral- lel global optimization with the particle swarm algorithm. International Journal for Numerical Methods in Engineering, 61(13):2296, 2004.

[91] T. Scogland, P. Balaji, W.-c. Feng, and G. Narayanaswamy. Asymmetric interactions in symmetric multi-core systems: analysis, enhancements and evaluation. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 17. IEEE Press, 2008.

[92] Y. Shi. Reevaluating amdahls law and gustafsons law. Computer Sciences Department, Temple University (MS: 38-24), 1996.

149 [93] T. A. Simon and J. McGalliard. Observation and analysis of the multicore perfor- mance impact on scientific applications. Concurrency and Computation: Practice and Experience, 21(17):2213–2231, 2009.

[94] I. Singh. Review on parallel and distributed computing.

[95] X.-H. Sun. Scalability versus execution time in scalable systems. Journal of Parallel and Distributed Computing, 62(2):173–192, 2002.

[96] X.-H. Sun and Y. Chen. Reevaluating amdahls law in the multicore era. Journal of Parallel and Distributed Computing, 70(2):183–188, 2010.

[97] X.-H. Sun and L. M. Ni. Another view on parallel speedup. In Supercomputing’90., Proceedings of, pages 324–333. IEEE, 1990.

[98] X.-Y. Sun, G.-K. Xu, X. Li, X.-Q. Feng, and H. Gao. Mechanical properties and scaling laws of nanoporous gold. Journal of Applied Physics, 113(2):023505, 2013.

[99] L. Tools. Laguna Toosl Catalog 2015, 2015 (accessed January 3, 2015).

[100] K. Trocki. Performance aspects of using various techniques of programming extensions of modern general-purpose processors. In Information Technology, 2008. IT 2008. 1st International Conference on, pages 1–4. IEEE, 2008.

[101] A. Vajda. Programming many-core chips. Springer Science & Business Media, 2011.

[102] L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103– 111, Aug. 1990.

[103] L. G. Valiant. A bridging model for multi-core computing. In Algorithms-ESA 2008, pages 13–28. Springer, 2008.

[104] A. Wallin. Constructing isosurfaces from ct data. IEEE Computer Graphics and Applications, 11(6):28–33, 1991.

[105] Wikipedia. Illithyia, 2016 (accessed January 19, 2016).

[106] Wikipedia. Iris, 2016 (accessed January 19, 2016).

[107] Wikipedia. Prometheus, 2016 (accessed January 19, 2016).

[108] Wikpedia. Pandora, 2016 (accessed January 19, 2016).

[109] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual perfor- mance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.

[110] Woodchippers. Bandit 1890d Woodchippers, 2015 (accessed January 3, 2015).

[111] J. Worlton. Toward a taxonomy of performance metrics. Parallel Computing, 17(10):1073–1092, 1991.

150 Appendix A

Data Fitting Results 151

A.1 Fitting Errors Per Model

A.1.1 Fitting Errors Per Model, All Data

The MWARE of model fitting (Min, Mean, Median, Max, Stddev) is listed on a per model basis (Name, Index). The total number of data points for the model (Count). The models are ranked (Mean Rank) according to the mean MWARE. ‘ALL DATA’ indicates the statistics for the overall data. The top 25 models are listed. The simplest model, Amdahl’s Law is included at the top for reference as are the four most complicated models proposed. Table A.1: Comprehensive Model Fitting Errors (MWARE)

Comprehensive Model Fitting Errors (MWARE)

Index Min (Mean) Median Max Stddev Name Count Mean Rank

* 0.000541 0.028969 0.011248 8.835823 0.116190 ALL DATA 78765036 *

Amdahl 0.0008 0.0807 0.0566 8.8358 0.2143 10000000000000 6523 2237

[2277] 0.0005 0.0429 0.0389 0.7496 0.0370 11101110101111 4524 799

[2279] 0.0005 0.0427 0.0368 0.7719 0.0365 11011110101111 4524 793

[2301] 0.0005 0.0442 0.0378 0.7582 0.0408 11101101101111 4523 869

152 [2303] 0.0005 0.0451 0.0376 0.7924 0.0409 11011101101111 4523 933

1724 0.0005 0.0358 0.0324 1.7598 0.0403 10101101101101 4530 0

1580 0.0005 0.0358 0.0323 1.7598 0.0404 10101101100101 4530 1

1436 0.0005 0.0359 0.0323 1.7598 0.0404 10101101101001 4531 2

1292 0.0005 0.0359 0.0322 1.7598 0.0404 10101101100001 4531 3

1148 0.0005 0.0360 0.0328 0.8923 0.0327 10101101101110 4531 4

1004 0.0005 0.0360 0.0328 0.9601 0.0332 10101101100110 4532 5

1142 0.0005 0.0360 0.0328 0.8923 0.0325 10100101101110 4531 6

998 0.0005 0.0361 0.0327 0.8934 0.0327 10100101100110 4532 7

1718 0.0005 0.0361 0.0324 1.7598 0.0406 10100101101101 4530 8 Continuation of Table A.1

Index Min (Mean) Median Max Stddev Name Count Mean Rank

1574 0.0005 0.0361 0.0325 1.7598 0.0406 10100101100101 4530 9

1430 0.0005 0.0361 0.0325 1.7598 0.0406 10100101101001 4531 10

573 0.0005 0.0361 0.0324 0.7586 0.0322 11101101101100 5063 11

1286 0.0005 0.0362 0.0324 1.7598 0.0407 10100101100001 4531 12

860 0.0005 0.0362 0.0330 0.8887 0.0332 10101101101010 4595 13

1702 0.0005 0.0363 0.0327 1.1260 0.0338 10011110101101 4530 14

572 0.0005 0.0363 0.0318 1.7598 0.0415 10101101101100 5063 15 153 1558 0.0005 0.0363 0.0327 1.1260 0.0338 10011110100101 4530 16

1581 0.0005 0.0364 0.0333 0.7673 0.0314 11101101100101 4530 17

567 0.0005 0.0364 0.0326 0.7586 0.0322 11100101101100 5063 18

854 0.0005 0.0365 0.0328 0.9595 0.0340 10100101101010 4595 19

1725 0.0005 0.0365 0.0333 0.7586 0.0310 11101101101101 4530 20

1293 0.0005 0.0365 0.0333 0.7655 0.0313 11101101100001 4531 21

1696 0.0005 0.0365 0.0328 1.1260 0.0338 10010110101101 4530 22

1414 0.0005 0.0365 0.0332 1.1260 0.0339 10011110101001 4531 23

1270 0.0005 0.0365 0.0329 1.1260 0.0340 10011110100001 4531 24

End of Table A.1 A.1.2 Fitting Errors Per Model, FDI

The MWARE of model fitting (Min, Mean, Median, Max, Stddev) is listed on a per model basis (Name, Index). The total number of data points for the model (Count). The models are ranked (Mean Rank) according to the mean MWARE. ‘ALL DATA’ indicates the statistics for the for the overall categorical data. The top 25 models are listed. The simplest model, Amdahl’s Law is included at the top for reference as are the four most complicated models proposed.

Table A.2: Comprehensive Model Fitting Errors (MWARE)

Comprehensive Model Fitting Errors (MWARE)

Index Min (Mean) Median Max Stddev Name Count Mean Rank 154 * 0.000541 0.059351 0.045535 8.835823 0.126056 ALL DATA 6316071 *

Amdahl 0.0008 0.1053 0.0690 8.8358 0.2901 10000000000000 3408 2229

[2277] 0.0005 0.0476 0.0440 0.6416 0.0369 11101110101111 2567 822

[2279] 0.0005 0.0470 0.0423 0.6133 0.0351 11011110101111 2567 790

[2301] 0.0005 0.0502 0.0446 0.6229 0.0434 11101101101111 2566 889

[2303] 0.0005 0.0509 0.0454 0.6291 0.0420 11011101101111 2566 932

1724 0.0005 0.0388 0.0351 0.3862 0.0314 10101101101101 2569 0

1436 0.0005 0.0389 0.0353 0.3659 0.0314 10101101101001 2570 1

1292 0.0005 0.0389 0.0352 0.3702 0.0316 10101101100001 2570 2 Continuation of Table A.2

Index Min (Mean) Median Max Stddev Name Count Mean Rank

1580 0.0005 0.0390 0.0352 0.3896 0.0314 10101101100101 2569 3

573 0.0005 0.0391 0.0357 0.2947 0.0284 11101101101100 2789 4

1148 0.0005 0.0391 0.0355 0.3763 0.0295 10101101101110 2570 5

1004 0.0005 0.0391 0.0354 0.4224 0.0297 10101101100110 2570 6

572 0.0005 0.0391 0.0348 0.4683 0.0329 10101101101100 2789 7

1142 0.0005 0.0392 0.0357 0.3064 0.0292 10100101101110 2570 8

1718 0.0005 0.0392 0.0352 0.3974 0.0319 10100101101101 2569 9 155 1574 0.0005 0.0392 0.0355 0.4050 0.0320 10100101100101 2569 10

998 0.0005 0.0393 0.0356 0.4062 0.0296 10100101100110 2570 11

1430 0.0005 0.0393 0.0354 0.3974 0.0319 10100101101001 2570 12

860 0.0005 0.0393 0.0360 0.4642 0.0307 10101101101010 2633 13

1286 0.0005 0.0393 0.0353 0.3974 0.0320 10100101100001 2570 14

567 0.0005 0.0394 0.0356 0.2934 0.0284 11100101101100 2789 15

566 0.0005 0.0395 0.0350 0.4692 0.0331 10100101101100 2789 16

285 0.0005 0.0395 0.0368 0.2958 0.0260 11101101101000 3408 17

1702 0.0005 0.0396 0.0356 0.3238 0.0300 10011110101101 2569 18

429 0.0005 0.0397 0.0371 0.2965 0.0273 11101101100100 3036 19 Continuation of Table A.2

Index Min (Mean) Median Max Stddev Name Count Mean Rank

1558 0.0005 0.0397 0.0356 0.3267 0.0301 10011110100101 2569 20

279 0.0005 0.0397 0.0368 0.2959 0.0262 11100101101000 3408 21

854 0.0005 0.0398 0.0358 0.4155 0.0314 10100101101010 2633 22

284 0.0005 0.0398 0.0361 0.4705 0.0305 10101101101000 3408 23

428 0.0005 0.0399 0.0363 0.4916 0.0319 10101101100100 3036 24

End of Table A.2 156 A.1.3 Fitting Errors Per Model, SRA

The MWARE of model fitting (Min, Mean, Median, Max, Stddev) is listed on a per model basis (Name, Index). The total number of data points for the model (Count). The models are ranked (Mean Rank) according to the mean MWARE. ‘ALL DATA’ indicates the statistics for the for the overall categorical data. The top 25 models are listed. The simplest model, Amdahl’s Law is included at the top for reference as are the four most complicated models proposed.

Table A.3: Comprehensive Model Fitting Errors (MWARE)

Comprehensive Model Fitting Errors (MWARE)

Index Min (Mean) Median Max Stddev Name Count Mean Rank 157 * 0.003456 0.038623 0.034261 2.107354 0.041664 ALL DATA 4931019 *

Amdahl 0.0045 0.0537 0.0476 2.1074 0.0526 10000000000000 3111 2241

[2277] 0.0043 0.0366 0.0310 0.7496 0.0363 11101110101111 1955 1207

[2279] 0.0035 0.0370 0.0306 0.7719 0.0376 11011110101111 1955 1351

[2301] 0.0035 0.0364 0.0307 0.7582 0.0358 11101101101111 1955 1120

[2303] 0.0035 0.0374 0.0303 0.7924 0.0382 11011101101111 1955 1436

1725 0.0035 0.0318 0.0290 0.7586 0.0330 11101101101101 1959 0

1581 0.0035 0.0318 0.0289 0.7673 0.0337 11101101100101 1959 1

1437 0.0035 0.0318 0.0290 0.7686 0.0335 11101101101001 1959 2 Continuation of Table A.3

Index Min (Mean) Median Max Stddev Name Count Mean Rank

1580 0.0035 0.0318 0.0286 1.7598 0.0495 10101101100101 1959 3

1724 0.0035 0.0318 0.0287 1.7598 0.0495 10101101101101 1959 4

1293 0.0035 0.0318 0.0291 0.7655 0.0336 11101101100001 1959 5

1287 0.0035 0.0319 0.0293 0.7586 0.0333 11100101100001 1959 6

1436 0.0035 0.0319 0.0287 1.7598 0.0495 10101101101001 1959 7

1702 0.0035 0.0319 0.0289 1.1260 0.0378 10011110101101 1959 8

1558 0.0035 0.0319 0.0290 1.1260 0.0377 10011110100101 1959 9 158 1142 0.0035 0.0319 0.0290 0.8923 0.0361 10100101101110 1959 10

1292 0.0035 0.0319 0.0287 1.7598 0.0495 10101101100001 1959 11

1719 0.0035 0.0320 0.0291 0.7586 0.0332 11100101101101 1959 12

998 0.0035 0.0320 0.0291 0.8934 0.0360 10100101100110 1960 13

1148 0.0035 0.0320 0.0291 0.8923 0.0361 10101101101110 1959 14

1574 0.0035 0.0320 0.0287 1.7598 0.0495 10100101100101 1959 15

1718 0.0035 0.0320 0.0287 1.7598 0.0495 10100101101101 1959 16

1575 0.0035 0.0320 0.0291 0.7797 0.0334 11100101100101 1959 17

1004 0.0035 0.0320 0.0292 0.9601 0.0369 10101101100110 1960 18

1723 0.0035 0.0320 0.0288 0.7651 0.0341 11001101101101 1959 19 Continuation of Table A.3

Index Min (Mean) Median Max Stddev Name Count Mean Rank

854 0.0035 0.0320 0.0290 0.9595 0.0368 10100101101010 1960 20

1579 0.0035 0.0320 0.0288 0.7696 0.0341 11001101100101 1959 21

1696 0.0035 0.0320 0.0289 1.1260 0.0377 10010110101101 1959 22

1431 0.0035 0.0320 0.0292 0.7586 0.0331 11100101101001 1959 23

1414 0.0035 0.0320 0.0292 1.1260 0.0378 10011110101001 1959 24

End of Table A.3 159 A.2 Fitting Errors, Per Part

A.2.1 Fitting Errors, Per Part, Aggregate, Per Architecture

The MWARE of model fitting (Min, Mean, Median, Max, Stddev) is listed on a per model part basis (Name, Index). The total number of data points for the model part (Count). ‘ALL DATA’ indicates the statistics for the overall categorical data. 160 A.2.1.1 All Data

Index Min Mean Median Max Stddev Name Count * 0.0005 0.0277 0.0105 8.8358 0.1162 ALL DATA 79925958 0 0.0005 0.0467 0.0352 8.8358 0.0996 PARALLEL-PART 11417994 1 0.0005 0.0460 0.0363 1.3203 0.0665 SEQUENTIAL-BOOST-PART 5708971 2 0.0005 0.0462 0.0357 8.8358 0.1016 PARALLEL-EFFICIENCY-GENERIC-PART 3806000 3 0.0005 0.0461 0.0356 8.8358 0.0979 PARALLEL-EFFICIENCY-LATENCY-PART 3805992 4 0.0005 0.0470 0.0363 8.8358 0.1003 PARALLEL-MAINMEMORY-BANDWIDTH-PART 5708996 5 0.0005 0.0393 0.0341 2.1074 0.0451 PARALLEL-MUTEX-PARAMETERIZED-PART 5708998 6 0.0005 0.0477 0.0378 8.8358 0.1024 SEQUENTIAL-EFFICIENCY-GENERIC-PART 3805986 7 0.0005 0.0441 0.0340 8.4062 0.0944 SEQUENTIAL-EFFICIENCY-LATENCY-PART 3805963 8 0.0005 0.0465 0.0359 8.8358 0.1000 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 5708846 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0

161 10 0.0005 0.0469 0.0356 8.8358 0.0998 OS-SCHEDULER-PART-SHARING-THETA 7611646 11 0.0005 0.0471 0.0366 8.8358 0.1000 OS-SCHEDULER-PART-SHARING-OMEGA 3805678 12 0.0005 0.0469 0.0369 8.8358 0.0857 OS-SCHEDULER-PART-MIGRATION 5708315 13 0.0005 0.0452 0.0352 8.4936 0.0940 OS-SCHEDULER-PART-LX 5707735

Table A.4: Part Subset=all Name=ALL A.2.1.2 Intel i7 Architecture

Index Min Mean Median Max Stddev Name Count * 0.0120 0.0637 0.0451 8.8358 0.1292 ALL DATA 6412324 0 0.0120 0.0573 0.0454 2.1074 0.0789 PARALLEL-PART 2399353 1 0.0120 0.0571 0.0456 1.3203 0.0710 SEQUENTIAL-BOOST-PART 1199676 2 0.0120 0.0559 0.0447 2.1074 0.0786 PARALLEL-EFFICIENCY-GENERIC-PART 799783 3 0.0139 0.0569 0.0437 2.1074 0.0789 PARALLEL-EFFICIENCY-LATENCY-PART 799780 4 0.0120 0.0572 0.0452 2.1074 0.0792 PARALLEL-MAINMEMORY-BANDWIDTH-PART 1199673 5 0.0135 0.0547 0.0417 2.1074 0.0789 PARALLEL-MUTEX-PARAMETERIZED-PART 1199661 6 0.0146 0.0582 0.0480 1.3645 0.0722 SEQUENTIAL-EFFICIENCY-GENERIC-PART 799786 7 0.0120 0.0530 0.0395 1.9808 0.0788 SEQUENTIAL-EFFICIENCY-LATENCY-PART 799756 8 0.0129 0.0571 0.0457 1.8710 0.0775 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 1199565 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0

162 10 0.0120 0.0573 0.0457 2.1074 0.0789 OS-SCHEDULER-PART-SHARING-THETA 1599362 11 0.0120 0.0573 0.0461 2.1074 0.0788 OS-SCHEDULER-PART-SHARING-OMEGA 799544 12 0.0129 0.0577 0.0465 2.1074 0.0747 OS-SCHEDULER-PART-MIGRATION 1199274 13 0.0120 0.0556 0.0425 2.1074 0.0776 OS-SCHEDULER-PART-LX 1198913

Table A.5: Part Subset=arch Name=i7-Arch A.2.1.3 Intel Core2 Architecture

Index Min Mean Median Max Stddev Name Count * 0.0005 0.0255 0.0196 0.2529 0.0247 ALL DATA 5005670 0 0.0035 0.0227 0.0207 0.1285 0.0115 PARALLEL-PART 3619814 1 0.0035 0.0224 0.0204 0.1285 0.0113 SEQUENTIAL-BOOST-PART 1809906 2 0.0035 0.0227 0.0207 0.1155 0.0116 PARALLEL-EFFICIENCY-GENERIC-PART 1206605 3 0.0035 0.0227 0.0207 0.1260 0.0116 PARALLEL-EFFICIENCY-LATENCY-PART 1206604 4 0.0035 0.0225 0.0205 0.1248 0.0115 PARALLEL-MAINMEMORY-BANDWIDTH-PART 1809906 5 0.0035 0.0208 0.0192 0.1285 0.0100 PARALLEL-MUTEX-PARAMETERIZED-PART 1809902 6 0.0035 0.0227 0.0207 0.1285 0.0116 SEQUENTIAL-EFFICIENCY-GENERIC-PART 1206600 7 0.0035 0.0227 0.0207 0.1248 0.0116 SEQUENTIAL-EFFICIENCY-LATENCY-PART 1206600 8 0.0035 0.0211 0.0191 0.1159 0.0110 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 1809878 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0

163 10 0.0035 0.0227 0.0207 0.1285 0.0116 OS-SCHEDULER-PART-SHARING-THETA 2413152 11 0.0035 0.0226 0.0206 0.1285 0.0116 OS-SCHEDULER-PART-SHARING-OMEGA 1206576 12 0.0035 0.0227 0.0207 0.1285 0.0116 OS-SCHEDULER-PART-MIGRATION 1809734 13 0.0035 0.0214 0.0198 0.1285 0.0104 OS-SCHEDULER-PART-LX 1809734

Table A.6: Part Subset=arch Name=Core2-Arch A.2.2 Fitting Errors, Per Part, FDI, Per Architecture

The MWARE of model fitting (Min, Mean, Median, Max, Stddev) is listed on a per model part basis (Name, Index). The total number of data points for the model part (Count). ‘ALL DATA’ indicates the statistics for the overall categorical data. 164 A.2.2.1 All Data

Index Min Mean Median Max Stddev Name Count * 0.0005 0.0589 0.0432 8.8358 0.1338 ALL DATA 5398827 0 0.0005 0.0589 0.0432 8.8358 0.1338 PARALLEL-PART 5398827 1 0.0005 0.0568 0.0432 1.0399 0.0809 SEQUENTIAL-BOOST-PART 2699389 2 0.0005 0.0577 0.0409 8.8358 0.1361 PARALLEL-EFFICIENCY-GENERIC-PART 1799612 3 0.0005 0.0569 0.0417 8.8358 0.1305 PARALLEL-EFFICIENCY-LATENCY-PART 1799608 4 0.0005 0.0589 0.0431 8.8358 0.1347 PARALLEL-MAINMEMORY-BANDWIDTH-PART 2699417 5 0.0005 0.0449 0.0396 0.8921 0.0336 PARALLEL-MUTEX-PARAMETERIZED-PART 2699435 6 0.0005 0.0598 0.0443 8.8358 0.1387 SEQUENTIAL-EFFICIENCY-GENERIC-PART 1799600 7 0.0005 0.0544 0.0389 8.4062 0.1253 SEQUENTIAL-EFFICIENCY-LATENCY-PART 1799607 8 0.0005 0.0586 0.0431 8.8358 0.1345 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 2699403 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0

165 10 0.0005 0.0591 0.0433 8.8358 0.1342 OS-SCHEDULER-PART-SHARING-THETA 3599132 11 0.0005 0.0590 0.0435 8.8358 0.1334 OS-SCHEDULER-PART-SHARING-OMEGA 1799558 12 0.0005 0.0584 0.0442 8.8358 0.1119 OS-SCHEDULER-PART-MIGRATION 2699307 13 0.0005 0.0565 0.0427 8.4936 0.1249 OS-SCHEDULER-PART-LX 2699088

Table A.7: Part Subset=all Name=ALL A.2.2.2 Intel i7 Architecture

Index Min Mean Median Max Stddev Name Count * 0.0121 0.0683 0.0464 8.8358 0.1523 ALL DATA 4012971 0 0.0121 0.0683 0.0464 8.8358 0.1523 PARALLEL-PART 4012971 1 0.0121 0.0653 0.0463 1.0399 0.0891 SEQUENTIAL-BOOST-PART 2006461 2 0.0151 0.0663 0.0433 8.8358 0.1554 PARALLEL-EFFICIENCY-GENERIC-PART 1337660 3 0.0121 0.0653 0.0446 8.8358 0.1488 PARALLEL-EFFICIENCY-LATENCY-PART 1337656 4 0.0130 0.0681 0.0463 8.8358 0.1533 PARALLEL-MAINMEMORY-BANDWIDTH-PART 2006489 5 0.0131 0.0530 0.0433 0.8921 0.0329 PARALLEL-MUTEX-PARAMETERIZED-PART 2006507 6 0.0121 0.0692 0.0469 8.8358 0.1582 SEQUENTIAL-EFFICIENCY-GENERIC-PART 1337648 7 0.0128 0.0619 0.0414 8.4062 0.1427 SEQUENTIAL-EFFICIENCY-LATENCY-PART 1337655 8 0.0121 0.0679 0.0463 8.8358 0.1531 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 2006475 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0

166 10 0.0121 0.0683 0.0465 8.8358 0.1527 OS-SCHEDULER-PART-SHARING-THETA 2675228 11 0.0121 0.0681 0.0467 8.8358 0.1521 OS-SCHEDULER-PART-SHARING-OMEGA 1337606 12 0.0135 0.0673 0.0474 8.8358 0.1260 OS-SCHEDULER-PART-MIGRATION 2006379 13 0.0121 0.0662 0.0460 8.4936 0.1421 OS-SCHEDULER-PART-LX 2006160

Table A.8: Part Subset=arch Name=i7-Arch A.2.2.3 Intel Core2 Architecture

Index Min Mean Median Max Stddev Name Count * 0.0005 0.0328 0.0116 0.2529 0.0423 ALL DATA 1385856 0 0.0005 0.0328 0.0116 0.2529 0.0423 PARALLEL-PART 1385856 1 0.0005 0.0327 0.0108 0.2529 0.0424 SEQUENTIAL-BOOST-PART 692928 2 0.0005 0.0328 0.0116 0.2529 0.0423 PARALLEL-EFFICIENCY-GENERIC-PART 461952 3 0.0005 0.0328 0.0116 0.2529 0.0423 PARALLEL-EFFICIENCY-LATENCY-PART 461952 4 0.0005 0.0323 0.0115 0.2529 0.0421 PARALLEL-MAINMEMORY-BANDWIDTH-PART 692928 5 0.0005 0.0216 0.0102 0.2083 0.0236 PARALLEL-MUTEX-PARAMETERIZED-PART 692928 6 0.0005 0.0328 0.0116 0.2529 0.0423 SEQUENTIAL-EFFICIENCY-GENERIC-PART 461952 7 0.0005 0.0328 0.0116 0.2529 0.0423 SEQUENTIAL-EFFICIENCY-LATENCY-PART 461952 8 0.0005 0.0319 0.0104 0.2529 0.0421 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 692928 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0

167 10 0.0005 0.0328 0.0116 0.2529 0.0423 OS-SCHEDULER-PART-SHARING-THETA 923904 11 0.0005 0.0328 0.0115 0.2529 0.0423 OS-SCHEDULER-PART-SHARING-OMEGA 461952 12 0.0005 0.0328 0.0116 0.2529 0.0424 OS-SCHEDULER-PART-MIGRATION 692928 13 0.0005 0.0286 0.0107 0.2024 0.0341 OS-SCHEDULER-PART-LX 692928

Table A.9: Part Subset=arch Name=Core2-Arch A.2.3 Fitting Errors, Per Part, SRA, Per Architecture

The MWARE of model fitting (Min, Mean, Median, Max, Stddev) is listed on a per model part basis (Name, Index). The total number of data points for the model part (Count). ‘ALL DATA’ indicates the statistics for the overall categorical data. 168 A.2.3.1 All Data

Index Min Mean Median Max Stddev Name Count * 0.0035 0.0366 0.0306 2.1074 0.0532 ALL DATA 6019167 0 0.0035 0.0366 0.0306 2.1074 0.0532 PARALLEL-PART 6019167 1 0.0035 0.0362 0.0306 1.3203 0.0487 SEQUENTIAL-BOOST-PART 3009582 2 0.0035 0.0360 0.0306 2.1074 0.0529 PARALLEL-EFFICIENCY-GENERIC-PART 2006388 3 0.0035 0.0364 0.0306 2.1074 0.0532 PARALLEL-EFFICIENCY-LATENCY-PART 2006384 4 0.0035 0.0363 0.0306 2.1074 0.0535 PARALLEL-MAINMEMORY-BANDWIDTH-PART 3009579 5 0.0035 0.0343 0.0285 2.1074 0.0530 PARALLEL-MUTEX-PARAMETERIZED-PART 3009563 6 0.0035 0.0369 0.0316 1.3645 0.0496 SEQUENTIAL-EFFICIENCY-GENERIC-PART 2006386 7 0.0035 0.0348 0.0298 1.9808 0.0527 SEQUENTIAL-EFFICIENCY-LATENCY-PART 2006356 8 0.0035 0.0355 0.0298 1.8710 0.0527 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 3009443 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0

169 10 0.0035 0.0365 0.0308 2.1074 0.0533 OS-SCHEDULER-PART-SHARING-THETA 4012514 11 0.0035 0.0364 0.0308 2.1074 0.0533 OS-SCHEDULER-PART-SHARING-OMEGA 2006120 12 0.0035 0.0366 0.0309 2.1074 0.0509 OS-SCHEDULER-PART-MIGRATION 3009008 13 0.0035 0.0350 0.0294 2.1074 0.0523 OS-SCHEDULER-PART-LX 3008647

Table A.10: Part Subset=all Name=ALL A.2.3.2 Intel i7 Architecture

Index Min Mean Median Max Stddev Name Count * 0.0120 0.0573 0.0454 2.1074 0.0789 ALL DATA 2399353 0 0.0120 0.0573 0.0454 2.1074 0.0789 PARALLEL-PART 2399353 1 0.0120 0.0571 0.0456 1.3203 0.0710 SEQUENTIAL-BOOST-PART 1199676 2 0.0120 0.0559 0.0447 2.1074 0.0786 PARALLEL-EFFICIENCY-GENERIC-PART 799783 3 0.0139 0.0569 0.0437 2.1074 0.0789 PARALLEL-EFFICIENCY-LATENCY-PART 799780 4 0.0120 0.0572 0.0452 2.1074 0.0792 PARALLEL-MAINMEMORY-BANDWIDTH-PART 1199673 5 0.0135 0.0547 0.0417 2.1074 0.0789 PARALLEL-MUTEX-PARAMETERIZED-PART 1199661 6 0.0146 0.0582 0.0480 1.3645 0.0722 SEQUENTIAL-EFFICIENCY-GENERIC-PART 799786 7 0.0120 0.0530 0.0395 1.9808 0.0788 SEQUENTIAL-EFFICIENCY-LATENCY-PART 799756 8 0.0129 0.0571 0.0457 1.8710 0.0775 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 1199565 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0

170 10 0.0120 0.0573 0.0457 2.1074 0.0789 OS-SCHEDULER-PART-SHARING-THETA 1599362 11 0.0120 0.0573 0.0461 2.1074 0.0788 OS-SCHEDULER-PART-SHARING-OMEGA 799544 12 0.0129 0.0577 0.0465 2.1074 0.0747 OS-SCHEDULER-PART-MIGRATION 1199274 13 0.0120 0.0556 0.0425 2.1074 0.0776 OS-SCHEDULER-PART-LX 1198913

Table A.11: Part Subset=arch Name=i7-Arch A.2.3.3 Intel Core2 Architecture

Index Min Mean Median Max Stddev Name Count * 0.0035 0.0227 0.0207 0.1285 0.0115 ALL DATA 3619814 0 0.0035 0.0227 0.0207 0.1285 0.0115 PARALLEL-PART 3619814 1 0.0035 0.0224 0.0204 0.1285 0.0113 SEQUENTIAL-BOOST-PART 1809906 2 0.0035 0.0227 0.0207 0.1155 0.0116 PARALLEL-EFFICIENCY-GENERIC-PART 1206605 3 0.0035 0.0227 0.0207 0.1260 0.0116 PARALLEL-EFFICIENCY-LATENCY-PART 1206604 4 0.0035 0.0225 0.0205 0.1248 0.0115 PARALLEL-MAINMEMORY-BANDWIDTH-PART 1809906 5 0.0035 0.0208 0.0192 0.1285 0.0100 PARALLEL-MUTEX-PARAMETERIZED-PART 1809902 6 0.0035 0.0227 0.0207 0.1285 0.0116 SEQUENTIAL-EFFICIENCY-GENERIC-PART 1206600 7 0.0035 0.0227 0.0207 0.1248 0.0116 SEQUENTIAL-EFFICIENCY-LATENCY-PART 1206600 8 0.0035 0.0211 0.0191 0.1159 0.0110 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 1809878 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0

171 10 0.0035 0.0227 0.0207 0.1285 0.0116 OS-SCHEDULER-PART-SHARING-THETA 2413152 11 0.0035 0.0226 0.0206 0.1285 0.0116 OS-SCHEDULER-PART-SHARING-OMEGA 1206576 12 0.0035 0.0227 0.0207 0.1285 0.0116 OS-SCHEDULER-PART-MIGRATION 1809734 13 0.0035 0.0214 0.0198 0.1285 0.0104 OS-SCHEDULER-PART-LX 1809734

Table A.12: Part Subset=arch Name=Core2-Arch A.3 Fitting Errors, Per Property

A.3.1 Fitting Errors, Per Property, Aggregate, Per Architecture

The MWARE of model fitting (Min, Mean, Median, Max, Stddev) is listed on a per model property basis (Name, Index). The total number of data points for the model property (Count).

Histograms represent the probability distribution of relative errors experienced on a per-property basis. 172 A.3.1.1 All Data

Index Min Mean Median Max Stddev Name Count 0 0.0000 47845.0586 11.2521 1000000.0000 159156.7500 ’Freq’ 10764428 1 0.0000 0.7374 0.8858 1.0000 0.1599 ’P’ 22752314 2 0.0000 0.9212 1.0000 1.0000 0.1844 ’seq-boost’ 11376142 3 0.0000 0.3607 0.3219 1.0000 0.2719 ’P-B’ 7584080 4 0.0000 0.5122 0.5065 1.0000 0.2940 ’P-HTefficiency’ 7584100 5 0.0000 0.0792 0.0291 1.0000 0.1544 ’P-Mut’ 11359258 6 0.0000 0.4978 0.4945 1.0000 0.2594 ’P-Mut-Nrm’ 11359258 7 0.0000 0.4883 0.4706 1.0000 0.2669 ’S-B’ 7562268 8 0.0000 0.5602 0.5670 1.0000 0.2936 ’S-HTefficiency’ 7569580 9 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex’ 0 10 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex-Norm’ 0

173 11 0.0000 0.5179 0.5143 1.0000 0.2564 ’P-mem-Blend’ 15163482 12 0.0000 0.5184 0.5182 1.0000 0.2563 ’S-mem-Blend’ 15124492 13 0.0000 0.9803 0.8021 1.9000 0.4728 ’P-main-mem-select-read INT’ 22121114 14 0.0000 0.9517 0.7751 1.9000 0.4671 ’P-main-mem-select-write INT’ 22121114 15 0.0000 0.9941 0.8053 1.9000 0.4746 ’S-main-mem-select-read INT’ 22121114 16 0.0000 0.9487 0.7791 1.9000 0.4671 ’S-main-mem-select-write INT’ 22121114 17 0.0000 1.3937 1.2372 2.9000 0.7513 ’Scheduler INT’ 20858714 18 0.0000 0.4990 0.4990 1.0000 0.2554 ’Theta’ 16744614 19 0.0000 0.5035 0.5049 1.0000 0.2611 ’Omega’ 10969942 20 0.0000 0.9531 0.9914 1.0000 0.1369 ’LX-PENALTY’ 10559278

Table A.13: Property Subset=all Name=ALL Property=’Freq’ ALL Property=’P’ ALL 0.25 0.8 0.2 0.6 0.15

0.4 0.1 P robability P robability 0.2 0.05

0 0 0.00 1000000.00 0.00 1.00 V alue V alue

Property=’seq-boost’ ALL Property=’P-B’ ALL

0.5 174 0.12 0.4 0.1 0.3 0.08

0.2 0.06 P robability P robability 0.04 0.1 0.02 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-HTefficiency’ ALL Property=’P-Mut’ ALL

0.6 0.06

0.4 0.05 P robability P robability 0.2 0.04

0 0.00 1.00 0.00 1.00 V alue V alue

Property=’P-Mut-Nrm’ ALL Property=’S-B’ ALL

175 0.06 0.06 0.05 0.05 0.04

P robability P robability 0.04 0.03

0.02 0.03

0.00 1.00 0.00 1.00 V alue V alue Property=’S-HTefficiency’ ALL Property=’P-mem-Blend’ ALL

0.08 0.07

0.06 0.06 0.05 P robability P robability 0.04 0.04

0.03

0.00 1.00 0.00 1.00 V alue V alue

Property=’S-mem-Blend’ ALL Property=’P-main-mem-select-read INT’ ALL

0.5 176

0.06 0.4

0.05 0.3

0.2

P robability 0.04 P robability 0.1 0.03 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-main-mem-select-write INT’ ALL Property=’S-main-mem-select-read INT’ ALL

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-main-mem-select-write INT’ ALL Property=’Scheduler INT’ ALL 0.4

177 0.5

0.4 0.3

0.3 0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 2.00 V alue V alue Property=’Theta’ ALL Property=’Omega’ ALL

0.06 0.06

0.05 0.04 0.04 P robability P robability 0.03 0.02 0.02

0.00 1.00 0.00 1.00 V alue V alue

Property=’LX-PENALTY’ ALL

178 0.3

0.2

P robability 0.1

0 0.00 1.00 V alue A.3.1.2 Intel i7 Architecture

Index Min Mean Median Max Stddev Name Count 0 0.0000 765.6600 0.0000 1000000.0000 23703.2148 ’Freq’ 2096618 1 0.0000 0.8979 0.8979 1.0000 0.0735 ’P’ 4811670 2 0.0000 0.9125 0.9990 1.0000 0.1924 ’seq-boost’ 2405828 3 0.0000 0.2246 0.1481 1.0000 0.2291 ’P-B’ 1603878 4 0.0000 0.6098 0.5973 1.0000 0.2831 ’P-HTefficiency’ 1603890 5 0.0000 0.0462 0.0412 1.0000 0.0703 ’P-Mut’ 2396604 6 0.0000 0.4806 0.4773 1.0000 0.2517 ’P-Mut-Nrm’ 2396604 7 0.0000 0.4316 0.4006 1.0000 0.2558 ’S-B’ 1592688 8 0.0000 0.6498 0.6819 1.0000 0.2958 ’S-HTefficiency’ 1596536 9 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex’ 0 10 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex-Norm’ 0

179 11 0.0000 0.5351 0.5369 1.0000 0.2594 ’P-mem-Blend’ 3204680 12 0.0000 0.4938 0.4872 1.0000 0.2558 ’S-mem-Blend’ 3185342 13 0.0000 1.0041 1.0360 1.9000 0.4712 ’P-main-mem-select-read INT’ 4618230 14 0.0000 0.9517 0.9533 1.9000 0.4677 ’P-main-mem-select-write INT’ 4618230 15 0.0000 0.9653 0.9773 1.9000 0.4731 ’S-main-mem-select-read INT’ 4618230 16 0.0000 0.9502 0.9498 1.9000 0.4674 ’S-main-mem-select-write INT’ 4618230 17 0.0000 1.4710 1.5042 2.9000 0.8005 ’Scheduler INT’ 4231350 18 0.0000 0.5000 0.5005 1.0000 0.2494 ’Theta’ 3395574 19 0.0000 0.5027 0.5069 1.0000 0.2568 ’Omega’ 2176924 20 0.0100 0.9697 0.9771 1.0000 0.0475 ’LX-PENALTY’ 1987028

Table A.14: Property Subset=arch Name=i7-Arch Property=’Freq’ i7-Arch Property=’P’ i7-Arch 1 0.3 0.8

0.6 0.2

0.4 P robability P robability 0.1 0.2

0 0 0.00 1000000.00 0.00 1.00 V alue V alue

Property=’seq-boost’ i7-Arch Property=’P-B’ i7-Arch 0.5

180 0.25 0.4 0.2 0.3 0.15

0.2 0.1 P robability P robability

0.1 0.05

0 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-HTefficiency’ i7-Arch Property=’P-Mut’ i7-Arch

0.6 0.1

0.08 0.4 0.06 P robability P robability 0.2 0.04

0.02 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’P-Mut-Nrm’ i7-Arch Property=’S-B’ i7-Arch 181 0.06 0.08

0.04 0.06 P robability P robability 0.04 0.02

0.02

0.00 1.00 0.00 1.00 V alue V alue Property=’S-HTefficiency’ i7-Arch Property=’P-mem-Blend’ i7-Arch

0.15 0.06

0.05 0.1 0.04 P robability P robability 0.05 0.03

0.00 1.00 0.00 1.00 V alue V alue

Property=’S-mem-Blend’ i7-Arch Property=’P-main-mem-select-read INT’ i7-Arch

0.5 182 0.06 0.4 0.05 0.3 0.04 0.2 P robability P robability 0.03 0.1

0.02 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-main-mem-select-write INT’ i7-Arch Property=’S-main-mem-select-read INT’ i7-Arch

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-main-mem-select-write INT’ i7-Arch Property=’Scheduler INT’ i7-Arch

183 0.5 0.3 0.4

0.3 0.2

0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 2.00 V alue V alue Property=’Theta’ i7-Arch Property=’Omega’ i7-Arch

0.06 0.06

0.04 0.04 P robability P robability 0.02 0.02

0.00 1.00 0.00 1.00 V alue V alue

Property=’LX-PENALTY’ i7-Arch

184 0.4

0.3

0.2 P robability 0.1

0 0.00 1.00 V alue A.3.1.3 Intel Core2 Architecture

Index Min Mean Median Max Stddev Name Count 0 0.0000 101625.9844 0.5618 1000000.0000 223638.0938 ’Freq’ 2571264 1 0.0000 0.9067 0.9075 1.0000 0.1086 ’P’ 5142528 2 0.0000 0.8840 0.9910 1.0000 0.1906 ’seq-boost’ 2571264 3 0.0000 0.4997 0.4997 1.0000 0.2517 ’P-B’ 1714176 4 0.0000 0.4997 0.4992 1.0000 0.2517 ’P-HTefficiency’ 1714176 5 0.0000 0.0295 0.0154 1.0000 0.0768 ’P-Mut’ 2571264 6 0.0000 0.4810 0.4765 1.0000 0.2532 ’P-Mut-Nrm’ 2571264 7 0.0000 0.4998 0.5000 1.0000 0.2518 ’S-B’ 1714176 8 0.0000 0.4995 0.4997 1.0000 0.2518 ’S-HTefficiency’ 1714176 9 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex’ 0 10 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex-Norm’ 0

185 11 0.0000 0.4840 0.4858 1.0000 0.2414 ’P-mem-Blend’ 3428352 12 0.0000 0.5261 0.5398 1.0000 0.2546 ’S-mem-Blend’ 3428352 13 0.0000 0.9442 0.9552 1.9000 0.4833 ’P-main-mem-select-read INT’ 5035392 14 0.0000 0.9638 0.9846 1.9000 0.4778 ’P-main-mem-select-write INT’ 5035392 15 0.0000 0.9871 1.0267 1.9000 0.4831 ’S-main-mem-select-read INT’ 5035392 16 0.0000 0.9523 0.9602 1.9000 0.4777 ’S-main-mem-select-write INT’ 5035392 17 0.0000 1.4500 1.4506 2.9000 0.7277 ’Scheduler INT’ 4821120 18 0.0000 0.4961 0.4956 1.0000 0.2550 ’Theta’ 3856896 19 0.0000 0.5031 0.5021 1.0000 0.2680 ’Omega’ 2571264 20 0.0000 0.8690 0.9501 1.0000 0.1900 ’LX-PENALTY’ 2571264

Table A.15: Property Subset=arch Name=Core2-Arch Property=’Freq’ Core2-Arch Property=’P’ Core2-Arch 0.8 0.25

0.2 0.6 0.15 0.4 0.1 P robability P robability 0.2 0.05

0 0 0.00 1000000.00 0.00 1.00 V alue V alue

Property=’seq-boost’ Core2-Arch Property=’P-B’ Core2-Arch

0.4 186 0.06 0.3

0.04 0.2 P robability P robability 0.1 0.02

0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-HTefficiency’ Core2-Arch Property=’P-Mut’ Core2-Arch

0.8 0.06 0.6

0.04 0.4 P robability P robability 0.02 0.2

0 0.00 1.00 0.00 1.00 V alue V alue

Property=’P-Mut-Nrm’ Core2-Arch Property=’S-B’ Core2-Arch

187 0.06 0.06

0.04 0.04 P robability P robability 0.02 0.02

0.00 1.00 0.00 1.00 V alue V alue Property=’S-HTefficiency’ Core2-Arch Property=’P-mem-Blend’ Core2-Arch 0.1 0.06 0.08

0.06 0.04

0.04 P robability P robability 0.02 0.02

0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-mem-Blend’ Core2-Arch Property=’P-main-mem-select-read INT’ Core2-Arch

188 0.5

0.06 0.4

0.3

0.04 0.2 P robability P robability 0.1 0.02 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-main-mem-select-write INT’ Core2-Arch Property=’S-main-mem-select-read INT’ Core2-Arch

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-main-mem-select-write INT’ Core2-Arch Property=’Scheduler INT’ Core2-Arch

189 0.5 0.4 0.4 0.3 0.3 0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 2.00 V alue V alue Property=’Theta’ Core2-Arch Property=’Omega’ Core2-Arch

0.06 0.06

0.05

0.04 0.04 P robability P robability 0.02 0.03

0.02 0.00 1.00 0.00 1.00 V alue V alue

Property=’LX-PENALTY’ Core2-Arch 0.3 190

0.2

P robability 0.1

0 0.00 1.00 V alue A.3.2 Fitting Errors, Per Property, FDI, Per Architecture

The MWARE of model fitting (Min, Mean, Median, Max, Stddev) is listed on a per model property basis (Name, Index). The total number of data points for the model property (Count).

Histograms represent the probability distribution of relative errors experienced on a per-property basis. 191 A.3.2.1 All Data

Index Min Mean Median Max Stddev Name Count 0 0.0000 41365.5938 0.0003 1000000.0000 150939.7656 ’Freq’ 6096546 1 0.0000 0.8858 0.8858 1.0000 0.1837 ’P’ 12798116 2 0.0000 0.9185 1.0000 1.0000 0.1875 ’seq-boost’ 6399050 3 0.0000 0.3574 0.3156 1.0000 0.2694 ’P-B’ 4266026 4 0.0000 0.4794 0.4646 1.0000 0.3057 ’P-HTefficiency’ 4266034 5 0.0000 0.1125 0.0507 1.0000 0.1974 ’P-Mut’ 6391390 6 0.0000 0.5117 0.5115 1.0000 0.2713 ’P-Mut-Nrm’ 6391390 7 0.0000 0.5056 0.4938 1.0000 0.2765 ’S-B’ 4255404 8 0.0000 0.5504 0.5574 1.0000 0.2999 ’S-HTefficiency’ 4258868 9 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex’ 0 10 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex-Norm’ 0

192 11 0.0000 0.5259 0.5229 1.0000 0.2686 ’P-mem-Blend’ 8530450 12 0.0000 0.5243 0.5180 1.0000 0.2652 ’S-mem-Blend’ 8510798 13 0.0000 0.9848 1.0190 1.9000 0.4783 ’P-main-mem-select-read INT’ 12467492 14 0.0000 0.9472 0.9482 1.9000 0.4730 ’P-main-mem-select-write INT’ 12467492 15 0.0000 0.9967 1.0381 1.9000 0.4813 ’S-main-mem-select-read INT’ 12467492 16 0.0000 0.9516 0.9573 1.9000 0.4733 ’S-main-mem-select-write INT’ 12467492 17 0.0000 1.4201 1.4421 2.9000 0.7762 ’Scheduler INT’ 11806244 18 0.0000 0.4996 0.5001 1.0000 0.2559 ’Theta’ 9492144 19 0.0000 0.5039 0.5051 1.0000 0.2647 ’Omega’ 6221754 20 0.0000 0.9558 0.9914 1.0000 0.1202 ’LX-PENALTY’ 6000986

Table A.16: Property Subset=all Name=ALL Property=’Freq’ ALL Property=’P’ ALL

0.2 0.8

0.6 0.15

0.4 0.1 P robability P robability 0.2 0.05

0 0 0.00 1000000.00 0.00 1.00 V alue V alue

Property=’seq-boost’ ALL Property=’P-B’ ALL

0.12 193 0.1 0.4 0.08

0.06 0.2 P robability P robability 0.04

0.02 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-HTefficiency’ ALL Property=’P-Mut’ ALL

0.5 0.08 0.4

0.3 0.06 0.2 P robability P robability

0.04 0.1 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’P-Mut-Nrm’ ALL Property=’S-B’ ALL

0.06 194 0.06

0.05 0.05

0.04 P robability 0.04 P robability

0.03 0.03

0.00 1.00 0.00 1.00 V alue V alue Property=’S-HTefficiency’ ALL Property=’P-mem-Blend’ ALL

0.08 0.06

0.06 0.05

P robability P robability 0.04 0.04

0.03

0.00 1.00 0.00 1.00 V alue V alue

Property=’S-mem-Blend’ ALL Property=’P-main-mem-select-read INT’ ALL

0.5 195 0.06 0.4

0.05 0.3

0.2 P robability 0.04 P robability 0.1 0.03 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-main-mem-select-write INT’ ALL Property=’S-main-mem-select-read INT’ ALL

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-main-mem-select-write INT’ ALL Property=’Scheduler INT’ ALL

196 0.5 0.3 0.4

0.3 0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 2.00 V alue V alue Property=’Theta’ ALL Property=’Omega’ ALL

0.06 0.06

0.05 0.04 0.04 P robability P robability 0.02 0.03

0.02 0.00 1.00 0.00 1.00 V alue V alue

Property=’LX-PENALTY’ ALL 197 0.3

0.2

P robability 0.1

0 0.00 1.00 V alue A.3.2.2 Intel i7 Architecture

Index Min Mean Median Max Stddev Name Count 0 0.0000 5963.5166 0.0000 1000000.0000 62438.3750 ’Freq’ 4165794 1 0.0000 0.8851 0.8682 1.0000 0.1461 ’P’ 8936612 2 0.0000 0.9242 1.0000 1.0000 0.1899 ’seq-boost’ 4468298 3 0.0000 0.2958 0.2396 1.0000 0.2515 ’P-B’ 2978858 4 0.0000 0.4704 0.4412 1.0000 0.3244 ’P-HTefficiency’ 2978866 5 0.0000 0.0923 0.0497 1.0000 0.1780 ’P-Mut’ 4460638 6 0.0000 0.5027 0.4990 1.0000 0.2656 ’P-Mut-Nrm’ 4460638 7 0.0000 0.5079 0.4905 1.0000 0.2847 ’S-B’ 2968236 8 0.0000 0.5721 0.5880 1.0000 0.3145 ’S-HTefficiency’ 2971700 9 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex’ 0 10 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex-Norm’ 0

198 11 0.0000 0.5335 0.5324 1.0000 0.2709 ’P-mem-Blend’ 5956114 12 0.0000 0.5252 0.5150 1.0000 0.2638 ’S-mem-Blend’ 5936462 13 0.0000 1.0003 1.0334 1.9000 0.4724 ’P-main-mem-select-read INT’ 8686436 14 0.0000 0.9466 0.9434 1.9000 0.4685 ’P-main-mem-select-write INT’ 8686436 15 0.0000 1.0131 1.0585 1.9000 0.4777 ’S-main-mem-select-read INT’ 8686436 16 0.0000 0.9502 0.9501 1.9000 0.4686 ’S-main-mem-select-write INT’ 8686436 17 0.0000 1.4063 1.4370 2.9000 0.7987 ’Scheduler INT’ 8186084 18 0.0000 0.5029 0.5027 1.0000 0.2511 ’Theta’ 6596016 19 0.0000 0.4960 0.4999 1.0000 0.2586 ’Omega’ 4291002 20 0.0003 0.9815 0.9956 1.0000 0.0766 ’LX-PENALTY’ 4070234

Table A.17: Property Subset=arch Name=i7-Arch Property=’Freq’ i7-Arch Property=’P’ i7-Arch 1 0.25 0.8 0.2 0.6 0.15

0.4 0.1 P robability P robability 0.2 0.05

0 0 0.00 1000000.00 0.00 1.00 V alue V alue

Property=’seq-boost’ i7-Arch Property=’P-B’ i7-Arch 0.6

199 0.15

0.4 0.1 P robability 0.2 P robability 0.05

0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-HTefficiency’ i7-Arch Property=’P-Mut’ i7-Arch 0.12

0.1 0.4 0.08

0.2 P robability 0.06 P robability

0.04 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’P-Mut-Nrm’ i7-Arch Property=’S-B’ i7-Arch

200 0.06 0.06

0.05 0.05

0.04 0.04 P robability P robability 0.03 0.03

0.02 0.02 0.00 1.00 0.00 1.00 V alue V alue Property=’S-HTefficiency’ i7-Arch Property=’P-mem-Blend’ i7-Arch 0.12 0.06 0.1

0.05 0.08

0.06 0.04 P robability P robability 0.04 0.03 0.02 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-mem-Blend’ i7-Arch Property=’P-main-mem-select-read INT’ i7-Arch

0.5 201 0.06 0.4

0.05 0.3

0.04 0.2 P robability P robability 0.1 0.03 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-main-mem-select-write INT’ i7-Arch Property=’S-main-mem-select-read INT’ i7-Arch

0.5

0.4 0.4

0.3

0.2 0.2 P robability P robability 0.1

0 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-main-mem-select-write INT’ i7-Arch Property=’Scheduler INT’ i7-Arch

202 0.5 0.3 0.4

0.3 0.2

0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 2.00 V alue V alue Property=’Theta’ i7-Arch Property=’Omega’ i7-Arch

0.06 0.06

0.04 0.04 P robability P robability 0.02 0.02

0.00 1.00 0.00 1.00 V alue V alue

Property=’LX-PENALTY’ i7-Arch

203 0.4

0.3

0.2 P robability 0.1

0 0.00 1.00 V alue A.3.2.3 Intel Core2 Architecture

Index Min Mean Median Max Stddev Name Count 0 0.0000 117754.4922 0.0169 1000000.0000 240701.5938 ’Freq’ 1930752 1 0.0000 0.8176 0.9328 1.0000 0.2529 ’P’ 3861504 2 0.0000 0.8961 1.0000 1.0000 0.1921 ’seq-boost’ 1930752 3 0.0000 0.5002 0.5005 1.0000 0.2566 ’P-B’ 1287168 4 0.0000 0.5000 0.5003 1.0000 0.2567 ’P-HTefficiency’ 1287168 5 0.0000 0.1590 0.0661 1.0000 0.2278 ’P-Mut’ 1930752 6 0.0000 0.5330 0.5375 1.0000 0.2848 ’P-Mut-Nrm’ 1930752 7 0.0000 0.5001 0.5000 1.0000 0.2564 ’S-B’ 1287168 8 0.0000 0.4999 0.5000 1.0000 0.2569 ’S-HTefficiency’ 1287168 9 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex’ 0 10 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex-Norm’ 0

204 11 0.0000 0.5088 0.5013 1.0000 0.2642 ’P-mem-Blend’ 2574336 12 0.0000 0.5230 0.5244 1.0000 0.2675 ’S-mem-Blend’ 2574336 13 0.0000 0.9592 0.9846 1.9000 0.4926 ’P-main-mem-select-read INT’ 3781056 14 0.0000 0.9516 0.9615 1.9000 0.4873 ’P-main-mem-select-write INT’ 3781056 15 0.0000 0.9692 1.0012 1.9000 0.4895 ’S-main-mem-select-read INT’ 3781056 16 0.0000 0.9591 0.9772 1.9000 0.4877 ’S-main-mem-select-write INT’ 3781056 17 0.0000 1.4505 1.4506 2.9000 0.7428 ’Scheduler INT’ 3620160 18 0.0000 0.4918 0.4940 1.0000 0.2650 ’Theta’ 2896128 19 0.0000 0.5214 0.5186 1.0000 0.2799 ’Omega’ 1930752 20 0.0000 0.8807 0.9492 1.0000 0.1681 ’LX-PENALTY’ 1930752

Table A.18: Property Subset=arch Name=Core2-Arch Property=’Freq’ Core2-Arch Property=’P’ Core2-Arch

0.6 0.3

0.4 0.2 P robability P robability 0.2 0.1

0 0 0.00 1000000.00 0.00 1.00 V alue V alue

Property=’seq-boost’ Core2-Arch Property=’P-B’ Core2-Arch

0.5 205 0.06 0.4

0.3 0.04

0.2 P robability P robability 0.1 0.02

0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-HTefficiency’ Core2-Arch Property=’P-Mut’ Core2-Arch 0.5 0.06 0.4

0.3 0.04 0.2 P robability P robability 0.02 0.1

0 0.00 1.00 0.00 1.00 V alue V alue

Property=’P-Mut-Nrm’ Core2-Arch Property=’S-B’ Core2-Arch 0.06

206 0.06

0.05 0.04

P robability 0.04 P robability 0.02

0.03 0.00 1.00 0.00 1.00 V alue V alue Property=’S-HTefficiency’ Core2-Arch Property=’P-mem-Blend’ Core2-Arch

0.08 0.06

0.06 0.04 0.04 P robability P robability 0.02 0.02

0.00 1.00 0.00 1.00 V alue V alue

Property=’S-mem-Blend’ Core2-Arch Property=’P-main-mem-select-read INT’ Core2-Arch

0.08 0.5 207

0.4 0.06 0.3

0.04 0.2 P robability P robability 0.1 0.02 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-main-mem-select-write INT’ Core2-Arch Property=’S-main-mem-select-read INT’ Core2-Arch

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-main-mem-select-write INT’ Core2-Arch Property=’Scheduler INT’ Core2-Arch

0.5 208 0.4 0.4 0.3 0.3 0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 2.00 V alue V alue Property=’Theta’ Core2-Arch Property=’Omega’ Core2-Arch 0.06 0.06

0.05 0.04 P robability P robability 0.04 0.02

0.03 0.00 1.00 0.00 1.00 V alue V alue

Property=’LX-PENALTY’ Core2-Arch 209

0.2

0.1 P robability

0 0.00 1.00 V alue A.3.3 Fitting Errors, Per Property, SRA, Per Architecture

The MWARE of model fitting (Min, Mean, Median, Max, Stddev) is listed on a per model property basis (Name, Index). The total number of data points for the model property (Count).

Histograms represent the probability distribution of relative errors experienced on a per-property basis. 210 A.3.3.1 All Data

Index Min Mean Median Max Stddev Name Count 0 0.0000 56323.1953 0.1205 1000000.0000 171292.0625 ’Freq’ 4667882 1 0.0000 0.9380 0.8938 1.0000 0.0933 ’P’ 9954198 2 0.0000 0.9011 0.9957 1.0000 0.1941 ’seq-boost’ 4977092 3 0.0000 0.3667 0.3322 1.0000 0.2776 ’P-B’ 3318054 4 0.0000 0.5530 0.5409 1.0000 0.2730 ’P-HTefficiency’ 3318066 5 0.0000 0.0376 0.0261 1.0000 0.0742 ’P-Mut’ 4967868 6 0.0000 0.4807 0.4768 1.0000 0.2524 ’P-Mut-Nrm’ 4967868 7 0.0000 0.4670 0.4482 1.0000 0.2553 ’S-B’ 3306864 8 0.0000 0.5720 0.5736 1.0000 0.2837 ’S-HTefficiency’ 3310712 9 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex’ 0 10 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex-Norm’ 0

211 11 0.0000 0.5083 0.5049 1.0000 0.2515 ’P-mem-Blend’ 6633032 12 0.0000 0.5100 0.5169 1.0000 0.2555 ’S-mem-Blend’ 6613694 13 0.0000 0.9713 0.9955 1.9000 0.4777 ’P-main-mem-select-read INT’ 9653622 14 0.0000 0.9568 0.9688 1.9000 0.4721 ’P-main-mem-select-write INT’ 9653622 15 0.0000 0.9747 1.0071 1.9000 0.4778 ’S-main-mem-select-read INT’ 9653622 16 0.0000 0.9503 0.9549 1.9000 0.4718 ’S-main-mem-select-write INT’ 9653622 17 0.0000 1.4596 1.4714 2.9000 0.7587 ’Scheduler INT’ 9052470 18 0.0000 0.4980 0.4978 1.0000 0.2538 ’Theta’ 7252470 19 0.0000 0.5029 0.5044 1.0000 0.2622 ’Omega’ 4748188 20 0.0000 0.9226 0.9659 1.0000 0.1562 ’LX-PENALTY’ 4558292

Table A.19: Property Subset=all Name=ALL Property=’Freq’ ALL Property=’P’ ALL

0.8

0.2 0.6

0.4 0.1 P robability P robability 0.2

0 0 0.00 1000000.00 0.00 1.00 V alue V alue

Property=’seq-boost’ ALL Property=’P-B’ ALL 0.15

212 0.4

0.3 0.1

0.2 P robability P robability 0.05 0.1

0 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-HTefficiency’ ALL Property=’P-Mut’ ALL 0.07

0.06 0.6

0.05 0.4

0.04 P robability P robability 0.2 0.03 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’P-Mut-Nrm’ ALL Property=’S-B’ ALL

213 0.06 0.06

0.04 0.04 P robability P robability 0.02 0.02

0.00 1.00 0.00 1.00 V alue V alue Property=’S-HTefficiency’ ALL Property=’P-mem-Blend’ ALL

0.08 0.08

0.06 0.06 P robability P robability 0.04 0.04

0.02 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-mem-Blend’ ALL Property=’P-main-mem-select-read INT’ ALL

0.07 0.5 214 0.06 0.4 0.05 0.3 0.04 0.2 P robability P robability 0.03 0.1 0.02 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-main-mem-select-write INT’ ALL Property=’S-main-mem-select-read INT’ ALL

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-main-mem-select-write INT’ ALL Property=’Scheduler INT’ ALL 0.4

215 0.5

0.4 0.3

0.3 0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 2.00 V alue V alue Property=’Theta’ ALL Property=’Omega’ ALL

0.06 0.06 0.05

0.04 0.04 P robability P robability 0.03 0.02 0.02

0.00 1.00 0.00 1.00 V alue V alue

Property=’LX-PENALTY’ ALL

0.3 216

0.2

P robability 0.1

0 0.00 1.00 V alue A.3.3.2 Intel i7 Architecture

Index Min Mean Median Max Stddev Name Count 0 0.0000 765.6600 0.0000 1000000.0000 23703.2148 ’Freq’ 2096618 1 0.0000 0.8979 0.8979 1.0000 0.0735 ’P’ 4811670 2 0.0000 0.9125 0.9990 1.0000 0.1924 ’seq-boost’ 2405828 3 0.0000 0.2246 0.1481 1.0000 0.2291 ’P-B’ 1603878 4 0.0000 0.6098 0.5973 1.0000 0.2831 ’P-HTefficiency’ 1603890 5 0.0000 0.0462 0.0412 1.0000 0.0703 ’P-Mut’ 2396604 6 0.0000 0.4806 0.4773 1.0000 0.2517 ’P-Mut-Nrm’ 2396604 7 0.0000 0.4316 0.4006 1.0000 0.2558 ’S-B’ 1592688 8 0.0000 0.6498 0.6819 1.0000 0.2958 ’S-HTefficiency’ 1596536 9 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex’ 0 10 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex-Norm’ 0

217 11 0.0000 0.5351 0.5369 1.0000 0.2594 ’P-mem-Blend’ 3204680 12 0.0000 0.4938 0.4872 1.0000 0.2558 ’S-mem-Blend’ 3185342 13 0.0000 1.0041 1.0360 1.9000 0.4712 ’P-main-mem-select-read INT’ 4618230 14 0.0000 0.9517 0.9533 1.9000 0.4677 ’P-main-mem-select-write INT’ 4618230 15 0.0000 0.9653 0.9773 1.9000 0.4731 ’S-main-mem-select-read INT’ 4618230 16 0.0000 0.9502 0.9498 1.9000 0.4674 ’S-main-mem-select-write INT’ 4618230 17 0.0000 1.4710 1.5042 2.9000 0.8005 ’Scheduler INT’ 4231350 18 0.0000 0.5000 0.5005 1.0000 0.2494 ’Theta’ 3395574 19 0.0000 0.5027 0.5069 1.0000 0.2568 ’Omega’ 2176924 20 0.0100 0.9697 0.9771 1.0000 0.0475 ’LX-PENALTY’ 1987028

Table A.20: Property Subset=arch Name=i7-Arch Property=’Freq’ i7-Arch Property=’P’ i7-Arch 1 0.3 0.8

0.6 0.2

0.4 P robability P robability 0.1 0.2

0 0 0.00 1000000.00 0.00 1.00 V alue V alue

Property=’seq-boost’ i7-Arch Property=’P-B’ i7-Arch 0.5

218 0.25 0.4 0.2 0.3 0.15

0.2 0.1 P robability P robability

0.1 0.05

0 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-HTefficiency’ i7-Arch Property=’P-Mut’ i7-Arch

0.6 0.1

0.08 0.4 0.06 P robability P robability 0.2 0.04

0.02 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’P-Mut-Nrm’ i7-Arch Property=’S-B’ i7-Arch 219 0.06 0.08

0.04 0.06 P robability P robability 0.04 0.02

0.02

0.00 1.00 0.00 1.00 V alue V alue Property=’S-HTefficiency’ i7-Arch Property=’P-mem-Blend’ i7-Arch

0.15 0.06

0.05 0.1 0.04 P robability P robability 0.05 0.03

0.00 1.00 0.00 1.00 V alue V alue

Property=’S-mem-Blend’ i7-Arch Property=’P-main-mem-select-read INT’ i7-Arch

0.5 220 0.06 0.4 0.05 0.3 0.04 0.2 P robability P robability 0.03 0.1

0.02 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-main-mem-select-write INT’ i7-Arch Property=’S-main-mem-select-read INT’ i7-Arch

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-main-mem-select-write INT’ i7-Arch Property=’Scheduler INT’ i7-Arch

221 0.5 0.3 0.4

0.3 0.2

0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 2.00 V alue V alue Property=’Theta’ i7-Arch Property=’Omega’ i7-Arch

0.06 0.06

0.04 0.04 P robability P robability 0.02 0.02

0.00 1.00 0.00 1.00 V alue V alue

Property=’LX-PENALTY’ i7-Arch

222 0.4

0.3

0.2 P robability 0.1

0 0.00 1.00 V alue A.3.3.3 Intel Core2 Architecture

Index Min Mean Median Max Stddev Name Count 0 0.0000 101625.9844 0.5618 1000000.0000 223638.0938 ’Freq’ 2571264 1 0.0000 0.9067 0.9075 1.0000 0.1086 ’P’ 5142528 2 0.0000 0.8840 0.9910 1.0000 0.1906 ’seq-boost’ 2571264 3 0.0000 0.4997 0.4997 1.0000 0.2517 ’P-B’ 1714176 4 0.0000 0.4997 0.4992 1.0000 0.2517 ’P-HTefficiency’ 1714176 5 0.0000 0.0295 0.0154 1.0000 0.0768 ’P-Mut’ 2571264 6 0.0000 0.4810 0.4765 1.0000 0.2532 ’P-Mut-Nrm’ 2571264 7 0.0000 0.4998 0.5000 1.0000 0.2518 ’S-B’ 1714176 8 0.0000 0.4995 0.4997 1.0000 0.2518 ’S-HTefficiency’ 1714176 9 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex’ 0 10 0.0000 0.0000 0.0000 0.0000 0.0000 ’S-Mutex-Norm’ 0

223 11 0.0000 0.4840 0.4858 1.0000 0.2414 ’P-mem-Blend’ 3428352 12 0.0000 0.5261 0.5398 1.0000 0.2546 ’S-mem-Blend’ 3428352 13 0.0000 0.9442 0.9552 1.9000 0.4833 ’P-main-mem-select-read INT’ 5035392 14 0.0000 0.9638 0.9846 1.9000 0.4778 ’P-main-mem-select-write INT’ 5035392 15 0.0000 0.9871 1.0267 1.9000 0.4831 ’S-main-mem-select-read INT’ 5035392 16 0.0000 0.9523 0.9602 1.9000 0.4777 ’S-main-mem-select-write INT’ 5035392 17 0.0000 1.4500 1.4506 2.9000 0.7277 ’Scheduler INT’ 4821120 18 0.0000 0.4961 0.4956 1.0000 0.2550 ’Theta’ 3856896 19 0.0000 0.5031 0.5021 1.0000 0.2680 ’Omega’ 2571264 20 0.0000 0.8690 0.9501 1.0000 0.1900 ’LX-PENALTY’ 2571264

Table A.21: Property Subset=arch Name=Core2-Arch Property=’Freq’ Core2-Arch Property=’P’ Core2-Arch 0.8 0.25

0.2 0.6 0.15 0.4 0.1 P robability P robability 0.2 0.05

0 0 0.00 1000000.00 0.00 1.00 V alue V alue

Property=’seq-boost’ Core2-Arch Property=’P-B’ Core2-Arch

0.4 224 0.06 0.3

0.04 0.2 P robability P robability 0.1 0.02

0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-HTefficiency’ Core2-Arch Property=’P-Mut’ Core2-Arch

0.8 0.06 0.6

0.04 0.4 P robability P robability 0.02 0.2

0 0.00 1.00 0.00 1.00 V alue V alue

Property=’P-Mut-Nrm’ Core2-Arch Property=’S-B’ Core2-Arch

225 0.06 0.06

0.04 0.04 P robability P robability 0.02 0.02

0.00 1.00 0.00 1.00 V alue V alue Property=’S-HTefficiency’ Core2-Arch Property=’P-mem-Blend’ Core2-Arch 0.1 0.06 0.08

0.06 0.04

0.04 P robability P robability 0.02 0.02

0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-mem-Blend’ Core2-Arch Property=’P-main-mem-select-read INT’ Core2-Arch

226 0.5

0.06 0.4

0.3

0.04 0.2 P robability P robability 0.1 0.02 0 0.00 1.00 0.00 1.00 V alue V alue Property=’P-main-mem-select-write INT’ Core2-Arch Property=’S-main-mem-select-read INT’ Core2-Arch

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 1.00 V alue V alue

Property=’S-main-mem-select-write INT’ Core2-Arch Property=’Scheduler INT’ Core2-Arch

227 0.5 0.4 0.4 0.3 0.3 0.2 0.2 P robability P robability 0.1 0.1

0 0 0.00 1.00 0.00 2.00 V alue V alue Property=’Theta’ Core2-Arch Property=’Omega’ Core2-Arch

0.06 0.06

0.05

0.04 0.04 P robability P robability 0.02 0.03

0.02 0.00 1.00 0.00 1.00 V alue V alue

Property=’LX-PENALTY’ Core2-Arch 0.3 228

0.2

P robability 0.1

0 0.00 1.00 V alue Appendix B

Cross-Prediction Results 229

B.1 Cross Prediction Relative Errors, Per Part B.1.1 Cross Prediction Relative Errors, All Data, Per Part

The MWARE of cross-prediction (Min, Mean, Median, Max, Stddev) is listed on a per model part basis (Name, Index). The total number of data points for the model part (Count). ‘ALL DATA’ indicates the statistics for the overall data.

Index Min Mean Median Max Stddev Name Count * 0.0005 0.0300 0.0095 20.9542 0.1286 ALL DATA 83496024 0 0.0005 0.0431 0.0138 20.9542 0.1303 PARALLEL-PART 41748012 1 0.0005 0.0527 0.0253 19.4404 0.1256 SEQUENTIAL-BOOST-PART 20873878 2 0.0005 0.0520 0.0441 19.4404 0.1283 PARALLEL-EFFICIENCY-GENERIC-PART 13916010 3 0.0005 0.0551 0.0443 20.9542 0.1339 PARALLEL-EFFICIENCY-LATENCY-PART 13915882 4 0.0005 0.0534 0.0249 19.3377 0.1320 PARALLEL-MAINMEMORY-BANDWIDTH-PART 20852954 5 0.0005 0.0475 0.0270 19.7289 0.0800 PARALLEL-MUTEX-PARAMETERIZED-PART 20814438 230 6 0.0005 0.0536 0.0458 19.2875 0.1291 SEQUENTIAL-EFFICIENCY-GENERIC-PART 13867094 7 0.0005 0.0527 0.0414 20.9542 0.1318 SEQUENTIAL-EFFICIENCY-LATENCY-PART 13853038 8 0.0005 0.0532 0.0246 19.7289 0.1342 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 20778982 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0 10 0.0005 0.0527 0.0277 19.4654 0.1296 OS-SCHEDULER-PART-SHARING-THETA 20376458 11 0.0005 0.0539 0.0295 20.9542 0.1318 OS-SCHEDULER-PART-SHARING-OMEGA 19472926 12 0.0005 0.0580 0.0307 20.9542 0.1715 OS-SCHEDULER-PART-MIGRATION 18904172 13 0.0005 0.0535 0.0304 20.9542 0.1726 OS-SCHEDULER-PART-LX 18518274

Table B.1: Comprehensive cross prediction per-part relative errors B.1.2 Cross Prediction Relative Errors, Per Part, FDI

The MWARE of cross-prediction (Min, Mean, Median, Max, Stddev) is listed on a per model part basis (Name, Index). The total number of data points for the model part (Count). ‘ALL DATA’ indicates the statistics for the overall categorical data.

Index Min Mean Median Max Stddev Name Count * 0.0005 0.0433 0.0104 20.9542 0.1524 ALL DATA 54674640 0 0.0005 0.0546 0.0245 20.9542 0.1546 PARALLEL-PART 27337320 1 0.0005 0.0602 0.0503 19.4404 0.1448 SEQUENTIAL-BOOST-PART 13668546 2 0.0005 0.0550 0.0464 19.4404 0.1511 PARALLEL-EFFICIENCY-GENERIC-PART 9112446 3 0.0005 0.0602 0.0481 20.9542 0.1607 PARALLEL-EFFICIENCY-LATENCY-PART 9112362 4 0.0005 0.0623 0.0490 19.3377 0.1566 PARALLEL-MAINMEMORY-BANDWIDTH-PART 13660936 5 0.0005 0.0496 0.0477 19.7289 0.0862 PARALLEL-MUTEX-PARAMETERIZED-PART 13634552 231 6 0.0005 0.0576 0.0488 19.2875 0.1549 SEQUENTIAL-EFFICIENCY-GENERIC-PART 9082218 7 0.0005 0.0570 0.0463 20.9542 0.1574 SEQUENTIAL-EFFICIENCY-LATENCY-PART 9073920 8 0.0005 0.0621 0.0489 19.7289 0.1589 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 13610626 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0 10 0.0005 0.0588 0.0490 19.4654 0.1536 OS-SCHEDULER-PART-SHARING-THETA 13478956 11 0.0005 0.0589 0.0489 20.9542 0.1557 OS-SCHEDULER-PART-SHARING-OMEGA 12890538 12 0.0005 0.0668 0.0501 20.9542 0.1964 OS-SCHEDULER-PART-MIGRATION 12450148 13 0.0005 0.0586 0.0499 20.9542 0.1927 OS-SCHEDULER-PART-LX 12224300

Table B.2: Comprehensive cross prediction per-part relative errors B.1.3 Cross Prediction Relative Errors, Per Part, SRA

The relative error of cross-prediction (Min, Mean, Median, Max, Stddev) is listed on a per model part basis (Name, Index). The total number of data points for the model part (Count). ‘ALL DATA’ indicates the statistics for the overall categorical data.

Index Min Mean Median Max Stddev Name Count * 0.0035 0.0406 0.0222 14.9496 0.0671 ALL DATA 28821384 0 0.0035 0.0442 0.0457 14.9496 0.0678 PARALLEL-PART 14410692 1 0.0035 0.0431 0.0447 14.9496 0.0647 SEQUENTIAL-BOOST-PART 7205332 2 0.0035 0.0426 0.0423 13.2137 0.0672 PARALLEL-EFFICIENCY-GENERIC-PART 4803564 3 0.0035 0.0438 0.0426 14.9496 0.0709 PARALLEL-EFFICIENCY-LATENCY-PART 4803520 4 0.0035 0.0433 0.0439 14.5191 0.0690 PARALLEL-MAINMEMORY-BANDWIDTH-PART 7192018 232 5 0.0035 0.0414 0.0432 14.9496 0.0691 PARALLEL-MUTEX-PARAMETERIZED-PART 7179886 6 0.0035 0.0436 0.0442 14.9496 0.0634 SEQUENTIAL-EFFICIENCY-GENERIC-PART 4784876 7 0.0035 0.0421 0.0401 11.9506 0.0724 SEQUENTIAL-EFFICIENCY-LATENCY-PART 4779118 8 0.0035 0.0427 0.0438 14.9496 0.0703 SEQUENTIAL-MAINMEMORY-BANDWIDTH-PART 7168356 9 0.0000 0.0000 0.0000 0.0000 0.0000 SEQUENTIAL-MUTEX-PARAMETERIZED-PART 0 10 0.0035 0.0425 0.0433 14.9496 0.0655 OS-SCHEDULER-PART-SHARING-THETA 6897502 11 0.0035 0.0423 0.0428 14.5191 0.0666 OS-SCHEDULER-PART-SHARING-OMEGA 6582388 12 0.0035 0.0448 0.0441 14.9496 0.0857 OS-SCHEDULER-PART-MIGRATION 6454024 13 0.0035 0.0417 0.0424 14.9496 0.0860 OS-SCHEDULER-PART-LX 6293974

Table B.3: Comprehensive cross prediction per-part relative errors B.2 Cross Prediction Relative Errors, Per Model

B.2.1 Cross Prediction Relative Errors, All Data

The MWARE of cross-prediction (Min, Mean, Median, Max, Stddev) is listed on a per model basis (Name, Index). The total number of data points for the model (Count). The models are ranked (Mean Rank) according to the mean relative error and the corresponding mean error achieved during curve-fitting is presented (Mean Best Fit). ‘ALL DATA’ indicates the statistics for the overall data. The top 25 models are listed. The simplest model, Amdahl’s Law is included at the top for reference as are the four most complicated models proposed. 233 Evaluations are performed comprehensively for all available predictions and also on all combinations of inter- and intra- architectural predictions. Table B.4: Cross Prediction comprehensive relative error

Cross Prediction comprehensive relative error

Index Min (Mean) Median Max Stddev Name Count Mean Rank Mean Fit

* 0.0005 0.0423 0.0180 20.9542 0.1628 ALL DATA 41748012 * *

Amdahl 0.0008 0.0707 0.0627 8.8358 0.1475 10000000000000 29492 2129 0.0807

[2277] 0.0005 0.0613 0.0445 11.8919 0.1698 11101110101111 16040 1680 0.0429

[2279] 0.0005 0.0585 0.0429 3.4713 0.0940 11011110101111 16040 1513 0.0427

[2301] 0.0005 0.0609 0.0455 3.8547 0.1107 11101101101111 16036 1655 0.0442

234 [2303] 0.0008 0.0621 0.0471 3.5338 0.1037 11011101101111 16036 1731 0.0451

573 0.0005 0.0359 0.0342 0.7586 0.0287 11101101101100 17948 0 0.0362

572 0.0005 0.0360 0.0338 1.7598 0.0405 10101101101100 17948 1 0.0363

1724 0.0005 0.0362 0.0347 1.7598 0.0401 10101101101101 16082 2 0.0358

567 0.0005 0.0362 0.0344 0.7586 0.0288 11100101101100 17948 3 0.0364

566 0.0005 0.0363 0.0339 1.7598 0.0406 10100101101100 17948 4 0.0366

1436 0.0005 0.0363 0.0347 1.7598 0.0401 10101101101001 16092 5 0.0359

1292 0.0005 0.0364 0.0348 1.7598 0.0402 10101101100001 16092 6 0.0359

1580 0.0005 0.0364 0.0347 1.7598 0.0401 10101101100101 16082 7 0.0359

1574 0.0005 0.0365 0.0347 1.7598 0.0403 10100101100101 16082 8 0.0361 Continuation of Table B.4

Index Min (Mean) Median Max Stddev Name Count Mean Rank Mean Fit

429 0.0005 0.0365 0.0349 0.7586 0.0280 11101101100100 19966 9 0.0367

1718 0.0005 0.0366 0.0349 1.7598 0.0403 10100101101101 16082 10 0.0361

428 0.0005 0.0366 0.0346 1.7598 0.0392 10101101100100 19966 11 0.0369

1430 0.0005 0.0366 0.0346 1.7598 0.0403 10100101101001 16092 12 0.0361

423 0.0005 0.0366 0.0352 0.7685 0.0280 11100101100100 19966 13 0.0368

1286 0.0005 0.0366 0.0347 1.7598 0.0404 10100101100001 16092 14 0.0362

1702 0.0005 0.0368 0.0343 1.1260 0.0322 10011110101101 16082 15 0.0363 235 1558 0.0005 0.0368 0.0341 1.1260 0.0321 10011110100101 16082 16 0.0364

285 0.0005 0.0369 0.0347 0.7674 0.0267 11101101101000 25976 17 0.0381

422 0.0005 0.0369 0.0349 1.7598 0.0394 10100101100100 19966 18 0.0372

1696 0.0005 0.0369 0.0343 1.1260 0.0322 10010110101101 16082 19 0.0365

1270 0.0005 0.0370 0.0349 1.1260 0.0322 10011110100001 16092 20 0.0366

1414 0.0005 0.0370 0.0347 1.1260 0.0322 10011110101001 16092 21 0.0365

284 0.0005 0.0370 0.0343 1.7598 0.0360 10101101101000 25976 22 0.0383

279 0.0005 0.0370 0.0347 0.7586 0.0265 11100101101000 25976 23 0.0382

1552 0.0005 0.0371 0.0345 1.1260 0.0323 10010110100101 16082 24 0.0366

End of Table B.4 Table B.5: Cross Prediction per model relative error arch Core2-Arch to Core2-Arch

Cross Prediction per model relative error arch Core2-Arch to Core2-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0005 0.1560 0.0125 4765.2090 7.4276 ALL DATA 79633948 * *

Amdahl 0.0008 0.0331 0.0270 0.2483 0.0286 10000000000000 2696 857 0.0779

[1701] 0.0007 0.2154 0.0251 0.8598 0.2917 11101110101111 2694 1248 0.0400

[1703] 0.0005 0.2193 0.0259 0.8574 0.2950 11011110101111 2694 1388 0.0389

[1725] 0.0009 0.2283 0.0260 0.8594 0.2962 11101101101111 2694 1647 0.0409

236 [1727] 0.0005 0.2184 0.0254 0.8565 0.2965 11011101101111 2694 1356 0.0413

238 0.0005 0.0185 0.0164 0.1758 0.0128 10011100101000 2696 0 0.0381

90 0.0005 0.0186 0.0164 0.1758 0.0128 10001100100000 2696 1 0.0433

116 0.0005 0.0186 0.0162 0.1758 0.0128 10101110100000 2696 2 0.0369

402 0.0005 0.0186 0.0165 0.1758 0.0128 10001110101100 2696 3 0.0387

258 0.0005 0.0186 0.0163 0.1758 0.0129 10001110101000 2696 4 0.0387

236 0.0005 0.0186 0.0164 0.1756 0.0128 10101100101000 2696 5 0.0377

378 0.0005 0.0186 0.0165 0.1758 0.0128 10001100101100 2696 6 0.0432

286 0.0005 0.0186 0.0164 0.1758 0.0128 10011101101000 2696 7 0.0368

428 0.0007 0.0186 0.0163 0.1758 0.0128 10101101101100 2696 8 0.0355 Continuation of Table B.5

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

262 0.0005 0.0186 0.0164 0.1758 0.0128 10011110101000 2696 9 0.0360

404 0.0005 0.0186 0.0164 0.1758 0.0128 10101110101100 2696 10 0.0370

382 0.0005 0.0186 0.0164 0.1758 0.0128 10011100101100 2696 11 0.0381

430 0.0005 0.0186 0.0164 0.1758 0.0128 10011101101100 2696 12 0.0368

260 0.0005 0.0186 0.0164 0.1758 0.0128 10101110101000 2696 13 0.0370

380 0.0005 0.0186 0.0164 0.1758 0.0129 10101100101100 2696 14 0.0377

92 0.0005 0.0186 0.0163 0.1756 0.0128 10101100100000 2696 15 0.0377 237 426 0.0005 0.0186 0.0165 0.1756 0.0128 10001101101100 2696 16 0.0372

142 0.0005 0.0186 0.0162 0.1758 0.0129 10011101100000 2696 17 0.0368

406 0.0005 0.0186 0.0163 0.1758 0.0128 10011110101100 2696 18 0.0360

234 0.0005 0.0186 0.0164 0.1758 0.0128 10001100101000 2696 19 0.0433

114 0.0005 0.0186 0.0163 0.1758 0.0128 10001110100000 2696 20 0.0387

282 0.0005 0.0186 0.0165 0.1756 0.0128 10001101101000 2696 21 0.0372

94 0.0005 0.0186 0.0164 0.1758 0.0129 10011100100000 2696 22 0.0381

138 0.0005 0.0186 0.0164 0.1758 0.0129 10001101100000 2696 23 0.0372

118 0.0005 0.0186 0.0164 0.1756 0.0129 10011110100000 2696 24 0.0360

End of Table B.5 Table B.6: Cross Prediction per model relative error arch Core2-Arch to i7-Arch

Cross Prediction per model relative error arch Core2-Arch to i7-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0005 0.1560 0.0125 4765.2090 7.4276 ALL DATA 79633948 * *

Amdahl 0.0008 0.0450 0.0282 0.2529 0.0493 10000000000000 5342 852 0.0779

[1701] 0.0007 0.2290 0.0480 0.8598 0.2742 11101110101111 5336 1266 0.0400

[1703] 0.0005 0.2212 0.0427 0.8574 0.2760 11011110101111 5336 1030 0.0389

[1725] 0.0009 0.2517 0.1187 0.8594 0.2806 11101101101111 5336 1710 0.0409

238 [1727] 0.0005 0.2200 0.0402 0.8565 0.2775 11011101101111 5336 1000 0.0413

382 0.0005 0.0200 0.0148 0.2083 0.0179 10011100101100 5342 0 0.0381

378 0.0005 0.0200 0.0147 0.2083 0.0180 10001100101100 5342 1 0.0432

430 0.0005 0.0200 0.0149 0.2083 0.0180 10011101101100 5342 2 0.0368

238 0.0005 0.0200 0.0149 0.2083 0.0179 10011100101000 5342 3 0.0381

286 0.0005 0.0200 0.0147 0.2083 0.0180 10011101101000 5342 4 0.0368

116 0.0005 0.0200 0.0148 0.2083 0.0180 10101110100000 5342 5 0.0369

426 0.0005 0.0200 0.0150 0.2083 0.0179 10001101101100 5342 6 0.0372

402 0.0005 0.0200 0.0147 0.2083 0.0180 10001110101100 5342 7 0.0387

90 0.0005 0.0200 0.0148 0.2083 0.0180 10001100100000 5342 8 0.0433 Continuation of Table B.6

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

236 0.0005 0.0200 0.0149 0.2083 0.0180 10101100101000 5342 9 0.0377

404 0.0005 0.0200 0.0149 0.2083 0.0180 10101110101100 5342 10 0.0370

428 0.0007 0.0200 0.0149 0.2083 0.0180 10101101101100 5342 11 0.0355

258 0.0005 0.0200 0.0148 0.2083 0.0181 10001110101000 5342 12 0.0387

118 0.0005 0.0200 0.0146 0.2083 0.0181 10011110100000 5342 13 0.0360

380 0.0005 0.0200 0.0150 0.2083 0.0180 10101100101100 5342 14 0.0377

406 0.0005 0.0201 0.0147 0.2083 0.0180 10011110101100 5342 15 0.0360 239 262 0.0005 0.0201 0.0147 0.2083 0.0180 10011110101000 5342 16 0.0360

140 0.0005 0.0201 0.0150 0.2083 0.0180 10101101100000 5342 17 0.0355

284 0.0005 0.0201 0.0148 0.2083 0.0180 10101101101000 5342 18 0.0355

260 0.0005 0.0201 0.0147 0.2083 0.0181 10101110101000 5342 19 0.0370

92 0.0005 0.0201 0.0147 0.2083 0.0180 10101100100000 5342 20 0.0377

234 0.0005 0.0201 0.0149 0.2083 0.0180 10001100101000 5342 21 0.0433

138 0.0005 0.0201 0.0145 0.2083 0.0180 10001101100000 5342 22 0.0372

142 0.0005 0.0201 0.0147 0.2083 0.0181 10011101100000 5342 23 0.0368

114 0.0005 0.0201 0.0148 0.2083 0.0180 10001110100000 5342 24 0.0387

End of Table B.6 Table B.7: Cross Prediction per model relative error arch i7-Arch to Core2-Arch

Cross Prediction per model relative error arch i7-Arch to Core2-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0005 0.1560 0.0125 4765.2090 7.4276 ALL DATA 79633948 * *

Amdahl 0.0309 0.0722 0.0618 2.1074 0.0607 10000000000000 5342 862 0.0779

[1701] 0.0209 0.2469 0.0510 1.1544 0.3786 11101110101111 5336 1384 0.0400

[1703] 0.0179 0.2289 0.0482 1.1427 0.3660 11011110101111 5336 1137 0.0389

[1725] 0.0220 0.3305 0.0563 173.9906 3.3852 11101101101111 5336 1674 0.0409

240 [1727] 0.0201 0.2350 0.0542 1.1479 0.3665 11011101101111 5336 1218 0.0413

1270 0.0159 0.0437 0.0367 1.1260 0.0355 10011110101101 5338 0 0.0355

135 0.0204 0.0438 0.0371 0.7206 0.0312 11100101100000 5342 1 0.0353

1264 0.0136 0.0438 0.0367 1.1260 0.0355 10010110101101 5338 2 0.0358

140 0.0194 0.0438 0.0364 1.7598 0.0448 10101101100000 5342 3 0.0355

428 0.0181 0.0438 0.0365 1.7598 0.0448 10101101101100 5342 4 0.0355

284 0.0194 0.0438 0.0364 1.7598 0.0448 10101101101000 5342 5 0.0355

1198 0.0160 0.0438 0.0367 1.2438 0.0371 10011110001101 5338 6 0.0371

285 0.0199 0.0438 0.0372 0.7206 0.0314 11101101101000 5342 7 0.0352

141 0.0193 0.0438 0.0372 0.7206 0.0312 11101101100000 5342 8 0.0352 Continuation of Table B.7

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

429 0.0198 0.0438 0.0371 0.7206 0.0313 11101101101100 5342 9 0.0351

422 0.0196 0.0439 0.0365 1.7598 0.0449 10100101101100 5342 10 0.0357

279 0.0196 0.0439 0.0372 0.7206 0.0314 11100101101000 5342 11 0.0354

134 0.0198 0.0440 0.0365 1.7598 0.0450 10100101100000 5342 12 0.0358

1192 0.0162 0.0440 0.0371 1.2438 0.0372 10010110001101 5338 13 0.0376

335 0.0155 0.0440 0.0373 0.7718 0.0312 11011110001100 5342 14 0.0366

278 0.0197 0.0440 0.0365 1.7598 0.0449 10100101101000 5342 15 0.0358 241 423 0.0192 0.0440 0.0373 0.7388 0.0316 11100101101100 5342 16 0.0354

982 0.0154 0.0440 0.0370 1.1260 0.0355 10011110100001 5338 17 0.0358

119 0.0156 0.0440 0.0372 0.7690 0.0310 11011110100000 5342 18 0.0356

47 0.0157 0.0441 0.0373 0.7718 0.0310 11011110000000 5342 19 0.0366

191 0.0155 0.0441 0.0372 0.7718 0.0313 11011110001000 5342 20 0.0367

1126 0.0163 0.0441 0.0373 1.1260 0.0357 10011110101001 5338 21 0.0358

910 0.0157 0.0441 0.0372 1.2438 0.0370 10011110000001 5340 22 0.0373

263 0.0163 0.0441 0.0373 0.7718 0.0310 11011110101000 5342 23 0.0356

976 0.0162 0.0441 0.0374 1.1260 0.0353 10010110100001 5338 24 0.0360

End of Table B.7 Table B.8: Cross Prediction per model relative error arch i7-Arch to i7-Arch

Cross Prediction per model relative error arch i7-Arch to i7-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0005 0.1560 0.0125 4765.2090 7.4276 ALL DATA 79633948 * *

Amdahl 0.0309 0.0992 0.0644 8.8358 0.2530 10000000000000 9674 861 0.0779

[1701] 0.0181 0.2614 0.0513 126.6917 1.8560 11101110101111 9662 1331 0.0400

[1703] 0.0170 0.2762 0.0495 267.2698 3.8586 11011110101111 9662 1369 0.0389

[1725] 0.0164 0.2724 0.0571 119.6859 1.7584 11101101101111 9662 1359 0.0409

242 [1727] 0.0201 0.2404 0.0557 63.5861 0.9790 11011101101111 9662 1222 0.0413

135 0.0192 0.0447 0.0369 0.7586 0.0289 11100101100000 9674 0 0.0353

429 0.0188 0.0448 0.0370 0.7586 0.0290 11101101101100 9674 1 0.0351

279 0.0191 0.0448 0.0370 0.7586 0.0289 11100101101000 9674 2 0.0354

285 0.0195 0.0448 0.0372 0.7681 0.0291 11101101101000 9674 3 0.0352

141 0.0193 0.0448 0.0370 0.7586 0.0291 11101101100000 9674 4 0.0352

423 0.0192 0.0449 0.0371 0.7586 0.0294 11100101101100 9674 5 0.0354

140 0.0188 0.0450 0.0362 1.7598 0.0459 10101101100000 9674 6 0.0355

428 0.0181 0.0450 0.0363 1.7598 0.0459 10101101101100 9674 7 0.0355

284 0.0189 0.0451 0.0363 1.7598 0.0459 10101101101000 9674 8 0.0355 Continuation of Table B.8

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

1270 0.0159 0.0451 0.0368 1.1260 0.0351 10011110101101 9664 9 0.0355

422 0.0187 0.0452 0.0364 1.7598 0.0461 10100101101100 9674 10 0.0357

1264 0.0136 0.0452 0.0371 1.1260 0.0351 10010110101101 9664 11 0.0358

213 0.0195 0.0452 0.0372 0.7717 0.0292 11101101001000 9674 12 0.0364

1198 0.0160 0.0452 0.0370 1.2438 0.0369 10011110001101 9664 13 0.0371

335 0.0155 0.0452 0.0375 0.7718 0.0292 11011110001100 9674 14 0.0366

134 0.0188 0.0452 0.0363 1.7598 0.0461 10100101100000 9674 15 0.0358 243 357 0.0202 0.0452 0.0373 0.7723 0.0297 11101101001100 9674 16 0.0364

47 0.0157 0.0453 0.0375 0.7718 0.0292 11011110000000 9674 17 0.0366

69 0.0200 0.0453 0.0372 0.7686 0.0298 11101101000000 9674 18 0.0364

278 0.0189 0.0453 0.0363 1.7598 0.0460 10100101101000 9674 19 0.0358

351 0.0198 0.0453 0.0373 0.7682 0.0292 11100101001100 9674 20 0.0368

119 0.0156 0.0453 0.0374 0.7690 0.0293 11011110100000 9674 21 0.0356

191 0.0155 0.0453 0.0375 0.7718 0.0295 11011110001000 9674 22 0.0367

407 0.0160 0.0453 0.0376 0.7682 0.0291 11011110101100 9674 23 0.0356

263 0.0163 0.0454 0.0376 0.7718 0.0292 11011110101000 9674 24 0.0356

End of Table B.8 B.2.2 Cross Prediction Relative Errors, FDI,

per Architecture

The MWARE of cross-prediction (Min, Mean, Median, Max, Stddev) is listed on a per model basis (Name, Index). The total number of data points for the model (Count). The models are ranked (Mean Rank) according to the mean relative error and the corresponding mean error achieved during curve-fitting is presented (Mean Best Fit). ‘ALL DATA’ indicates the statistics for the overall categorical data. The top 25 models are listed. The simplest model, Amdahl’s Law is included at the top for reference as are the four most complicated models proposed. 244 Table B.9: Cross Prediction per model relative error arch Core2-Arch to Core2-Arch

Cross Prediction per model relative error arch Core2-Arch to Core2-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0005 0.1565 0.0194 4765.2090 7.5921 ALL DATA 47872776 * *

Amdahl 0.0008 0.0424 0.0164 0.2483 0.0511 10000000000000 668 851 0.0779

[1701] 0.0007 0.2339 0.2737 0.8285 0.2375 11101110101111 668 1491 0.0400

[1703] 0.0005 0.2264 0.2165 0.8168 0.2388 11011110101111 668 1329 0.0389

[1725] 0.0009 0.2588 0.3065 0.8299 0.2354 11101101101111 668 1720 0.0409

245 [1727] 0.0005 0.2308 0.2279 0.8273 0.2480 11011101101111 668 1425 0.0413

238 0.0005 0.0167 0.0077 0.1758 0.0203 10011100101000 668 0 0.0381

236 0.0005 0.0167 0.0076 0.1756 0.0204 10101100101000 668 1 0.0377

382 0.0005 0.0167 0.0076 0.1758 0.0204 10011100101100 668 2 0.0381

286 0.0005 0.0167 0.0077 0.1758 0.0204 10011101101000 668 3 0.0368

402 0.0005 0.0167 0.0074 0.1758 0.0203 10001110101100 668 4 0.0387

258 0.0005 0.0167 0.0076 0.1758 0.0205 10001110101000 668 5 0.0387

404 0.0005 0.0167 0.0077 0.1758 0.0205 10101110101100 668 6 0.0370

92 0.0005 0.0167 0.0077 0.1756 0.0204 10101100100000 668 7 0.0377

430 0.0005 0.0167 0.0076 0.1758 0.0204 10011101101100 668 8 0.0368 Continuation of Table B.9

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

234 0.0005 0.0167 0.0077 0.1758 0.0204 10001100101000 668 9 0.0433

90 0.0005 0.0167 0.0077 0.1758 0.0203 10001100100000 668 10 0.0433

282 0.0005 0.0167 0.0074 0.1756 0.0205 10001101101000 668 11 0.0372

140 0.0005 0.0167 0.0076 0.1758 0.0203 10101101100000 668 12 0.0355

116 0.0005 0.0167 0.0076 0.1758 0.0205 10101110100000 668 13 0.0369

378 0.0005 0.0167 0.0077 0.1758 0.0205 10001100101100 668 14 0.0432

428 0.0007 0.0167 0.0076 0.1758 0.0204 10101101101100 668 15 0.0355 246 114 0.0005 0.0167 0.0079 0.1758 0.0204 10001110100000 668 16 0.0387

426 0.0005 0.0167 0.0077 0.1756 0.0204 10001101101100 668 17 0.0372

260 0.0005 0.0168 0.0077 0.1758 0.0205 10101110101000 668 18 0.0370

262 0.0005 0.0168 0.0076 0.1758 0.0205 10011110101000 668 19 0.0360

406 0.0005 0.0168 0.0077 0.1758 0.0204 10011110101100 668 20 0.0360

142 0.0005 0.0168 0.0077 0.1758 0.0205 10011101100000 668 21 0.0368

284 0.0005 0.0168 0.0076 0.1759 0.0205 10101101101000 668 22 0.0355

380 0.0005 0.0168 0.0077 0.1758 0.0205 10101100101100 668 23 0.0377

118 0.0005 0.0168 0.0075 0.1756 0.0207 10011110100000 668 24 0.0360

End of Table B.9 Table B.10: Cross Prediction per model relative error arch Core2-Arch to i7-Arch

Cross Prediction per model relative error arch Core2-Arch to i7-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0005 0.1565 0.0194 4765.2090 7.5921 ALL DATA 47872776 * *

Amdahl 0.0008 0.0537 0.0231 0.2529 0.0606 10000000000000 3263 846 0.0779

[1701] 0.0007 0.2369 0.2735 0.8285 0.2407 11101110101111 3261 1375 0.0400

[1703] 0.0005 0.2296 0.1788 0.8168 0.2459 11011110101111 3261 1191 0.0389

[1725] 0.0009 0.2579 0.3016 0.8299 0.2396 11101101101111 3261 1687 0.0409

247 [1727] 0.0005 0.2339 0.1871 0.8273 0.2523 11011101101111 3261 1293 0.0413

382 0.0005 0.0192 0.0094 0.2083 0.0216 10011100101100 3263 0 0.0381

378 0.0005 0.0192 0.0093 0.2083 0.0217 10001100101100 3263 1 0.0432

238 0.0005 0.0193 0.0096 0.2083 0.0216 10011100101000 3263 2 0.0381

286 0.0005 0.0193 0.0093 0.2083 0.0216 10011101101000 3263 3 0.0368

430 0.0005 0.0193 0.0094 0.2083 0.0216 10011101101100 3263 4 0.0368

426 0.0005 0.0193 0.0096 0.2083 0.0216 10001101101100 3263 5 0.0372

404 0.0005 0.0193 0.0093 0.2083 0.0217 10101110101100 3263 6 0.0370

116 0.0005 0.0193 0.0095 0.2083 0.0217 10101110100000 3263 7 0.0369

236 0.0005 0.0193 0.0096 0.2083 0.0217 10101100101000 3263 8 0.0377 Continuation of Table B.10

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

402 0.0005 0.0193 0.0094 0.2083 0.0216 10001110101100 3263 9 0.0387

428 0.0007 0.0193 0.0094 0.2083 0.0217 10101101101100 3263 10 0.0355

258 0.0005 0.0193 0.0092 0.2083 0.0217 10001110101000 3263 11 0.0387

90 0.0005 0.0193 0.0094 0.2083 0.0216 10001100100000 3263 12 0.0433

118 0.0005 0.0193 0.0094 0.2083 0.0218 10011110100000 3263 13 0.0360

284 0.0005 0.0193 0.0094 0.2083 0.0217 10101101101000 3263 14 0.0355

140 0.0005 0.0193 0.0096 0.2083 0.0216 10101101100000 3263 15 0.0355 248 406 0.0005 0.0193 0.0096 0.2083 0.0217 10011110101100 3263 16 0.0360

92 0.0005 0.0193 0.0095 0.2083 0.0217 10101100100000 3263 17 0.0377

234 0.0005 0.0193 0.0094 0.2083 0.0217 10001100101000 3263 18 0.0433

380 0.0005 0.0194 0.0094 0.2083 0.0217 10101100101100 3263 19 0.0377

262 0.0005 0.0194 0.0094 0.2083 0.0217 10011110101000 3263 20 0.0360

114 0.0005 0.0194 0.0093 0.2083 0.0217 10001110100000 3263 21 0.0387

260 0.0005 0.0194 0.0095 0.2083 0.0217 10101110101000 3263 22 0.0370

138 0.0005 0.0194 0.0095 0.2083 0.0217 10001101100000 3263 23 0.0372

94 0.0005 0.0194 0.0094 0.2083 0.0217 10011100100000 3263 24 0.0381

End of Table B.10 Table B.11: Cross Prediction per model relative error arch i7-Arch to Core2-Arch

Cross Prediction per model relative error arch i7-Arch to Core2-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0005 0.1565 0.0194 4765.2090 7.5921 ALL DATA 47872776 * *

Amdahl 0.0309 0.0736 0.0635 0.8497 0.0478 10000000000000 3263 862 0.0779

[1701] 0.0209 0.2125 0.0481 1.1061 0.3492 11101110101111 3261 1511 0.0400

[1703] 0.0179 0.2001 0.0451 1.1131 0.3418 11011110101111 3261 1310 0.0389

[1725] 0.0220 0.2241 0.0537 1.0830 0.3568 11101101101111 3261 1652 0.0409

249 [1727] 0.0201 0.2059 0.0513 1.0861 0.3417 11011101101111 3261 1397 0.0413

1270 0.0159 0.0399 0.0345 0.2995 0.0205 10011110101101 3261 0 0.0355

1264 0.0136 0.0401 0.0346 0.3244 0.0205 10010110101101 3261 1 0.0358

1198 0.0160 0.0401 0.0347 0.3202 0.0206 10011110001101 3261 2 0.0371

1192 0.0162 0.0402 0.0349 0.3237 0.0206 10010110001101 3261 3 0.0376

335 0.0155 0.0402 0.0354 0.2941 0.0183 11011110001100 3263 4 0.0366

263 0.0163 0.0403 0.0356 0.2875 0.0183 11011110101000 3263 5 0.0356

119 0.0156 0.0403 0.0356 0.2884 0.0184 11011110100000 3263 6 0.0356

47 0.0157 0.0403 0.0355 0.2877 0.0183 11011110000000 3263 7 0.0366

191 0.0155 0.0403 0.0356 0.2874 0.0182 11011110001000 3263 8 0.0367 Continuation of Table B.11

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

428 0.0181 0.0404 0.0351 0.3638 0.0217 10101101101100 3263 9 0.0355

140 0.0194 0.0404 0.0350 0.3704 0.0216 10101101100000 3263 10 0.0355

284 0.0194 0.0404 0.0351 0.3875 0.0217 10101101101000 3263 11 0.0355

41 0.0166 0.0404 0.0353 0.2865 0.0184 11010110000000 3263 12 0.0371

113 0.0162 0.0404 0.0352 0.2872 0.0184 11010110100000 3263 13 0.0359

118 0.0160 0.0405 0.0352 0.3224 0.0206 10011110100000 3263 14 0.0360

910 0.0157 0.0405 0.0355 0.3233 0.0205 10011110000001 3261 15 0.0373 250 407 0.0164 0.0405 0.0355 0.2862 0.0183 11011110101100 3263 16 0.0356

422 0.0196 0.0405 0.0351 0.3882 0.0221 10100101101100 3263 17 0.0357

401 0.0156 0.0405 0.0357 0.2868 0.0183 11010110101100 3263 18 0.0359

262 0.0154 0.0405 0.0354 0.3004 0.0205 10011110101000 3263 19 0.0360

1126 0.0163 0.0405 0.0353 0.3244 0.0206 10011110101001 3261 20 0.0358

982 0.0154 0.0405 0.0354 0.3237 0.0206 10011110100001 3261 21 0.0358

406 0.0154 0.0405 0.0353 0.2983 0.0204 10011110101100 3263 22 0.0360

329 0.0162 0.0405 0.0356 0.2941 0.0185 11010110001100 3263 23 0.0371

190 0.0157 0.0405 0.0354 0.3202 0.0206 10011110001000 3263 24 0.0375

End of Table B.11 Table B.12: Cross Prediction per model relative error arch i7-Arch to i7-Arch

Cross Prediction per model relative error arch i7-Arch to i7-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0005 0.1565 0.0194 4765.2090 7.5921 ALL DATA 47872776 * *

Amdahl 0.0309 0.1133 0.0678 8.8358 0.2987 10000000000000 6666 861 0.0779

[1701] 0.0209 0.2401 0.0502 126.6917 2.2178 11101110101111 6658 1366 0.0400

[1703] 0.0170 0.2709 0.0473 267.2698 4.6403 11011110101111 6658 1410 0.0389

[1725] 0.0220 0.2414 0.0567 119.6859 2.0979 11101101101111 6658 1369 0.0409

251 [1727] 0.0201 0.2144 0.0549 63.5861 1.1461 11011101101111 6658 1294 0.0413

1270 0.0159 0.0445 0.0362 0.2995 0.0276 10011110101101 6658 0 0.0355

1264 0.0136 0.0446 0.0365 0.3244 0.0276 10010110101101 6658 1 0.0358

335 0.0155 0.0446 0.0371 0.2941 0.0248 11011110001100 6666 2 0.0366

428 0.0181 0.0446 0.0360 0.4916 0.0303 10101101101100 6666 3 0.0355

1198 0.0160 0.0446 0.0365 0.3202 0.0276 10011110001101 6658 4 0.0371

135 0.0192 0.0446 0.0368 0.2938 0.0248 11100101100000 6666 5 0.0353

47 0.0157 0.0446 0.0371 0.2877 0.0247 11011110000000 6666 6 0.0366

140 0.0188 0.0446 0.0358 0.4916 0.0301 10101101100000 6666 7 0.0355

263 0.0163 0.0446 0.0372 0.2875 0.0248 11011110101000 6666 8 0.0356 Continuation of Table B.12

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

279 0.0191 0.0446 0.0367 0.2963 0.0248 11100101101000 6666 9 0.0354

191 0.0155 0.0446 0.0371 0.2874 0.0247 11011110001000 6666 10 0.0367

284 0.0189 0.0447 0.0359 0.4738 0.0303 10101101101000 6666 11 0.0355

429 0.0188 0.0447 0.0369 0.2969 0.0249 11101101101100 6666 12 0.0351

119 0.0156 0.0447 0.0371 0.2884 0.0249 11011110100000 6666 13 0.0356

141 0.0193 0.0447 0.0368 0.2946 0.0250 11101101100000 6666 14 0.0352

285 0.0195 0.0447 0.0369 0.2959 0.0248 11101101101000 6666 15 0.0352 252 1192 0.0162 0.0447 0.0367 0.3237 0.0276 10010110001101 6658 16 0.0376

41 0.0166 0.0447 0.0369 0.2865 0.0247 11010110000000 6666 17 0.0371

422 0.0187 0.0448 0.0360 0.4916 0.0305 10100101101100 6666 18 0.0357

407 0.0160 0.0448 0.0372 0.2862 0.0248 11011110101100 6666 19 0.0356

423 0.0192 0.0448 0.0368 0.2938 0.0249 11100101101100 6666 20 0.0354

113 0.0162 0.0448 0.0371 0.2872 0.0248 11010110100000 6666 21 0.0359

134 0.0188 0.0448 0.0360 0.4659 0.0306 10100101100000 6666 22 0.0358

329 0.0162 0.0449 0.0370 0.2941 0.0248 11010110001100 6666 23 0.0371

401 0.0156 0.0449 0.0372 0.2868 0.0248 11010110101100 6666 24 0.0359

End of Table B.12 B.2.3 Cross Prediction Relative Errors, SRA,

per Architecture

The MWARE of cross-prediction (Min, Mean, Median, Max, Stddev) is listed on a per model basis (Name, Index). The total number of data points for the model (Count). The models are ranked (Mean Rank) according to the mean relative error and the corresponding mean error achieved during curve-fitting is presented (Mean Best Fit). ‘ALL DATA’ indicates the statistics for the overall categorical data. The top 25 models are listed. The simplest model, Amdahl’s Law is included at the top for reference as are the four most complicated models proposed. 253 Table B.13: Cross Prediction per model relative error arch Core2-Arch to Core2-Arch

Cross Prediction per model relative error arch Core2-Arch to Core2-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0035 0.1580 0.0222 3757.7144 7.1728 ALL DATA 31761172 * *

Amdahl 0.0045 0.0300 0.0284 0.0976 0.0138 10000000000000 2028 862 0.0779

[1701] 0.0035 0.2093 0.0226 0.8598 0.3073 11101110101111 2026 1130 0.0400

[1703] 0.0039 0.2169 0.0231 0.8574 0.3113 11011110101111 2026 1381 0.0389

[1725] 0.0035 0.2183 0.0225 0.8594 0.3131 11101101101111 2026 1426 0.0409

254 [1727] 0.0035 0.2143 0.0225 0.8565 0.3108 11011101101111 2026 1290 0.0413

238 0.0035 0.0192 0.0179 0.0856 0.0089 10011100101000 2028 0 0.0381

262 0.0035 0.0192 0.0180 0.0833 0.0089 10011110101000 2028 1 0.0360

1294 0.0035 0.0192 0.0178 0.0865 0.0089 10011101101101 2028 2 0.0371

116 0.0035 0.0192 0.0179 0.0853 0.0089 10101110100000 2028 3 0.0369

427 0.0035 0.0192 0.0178 0.0865 0.0089 11001101101100 2028 4 0.0364

259 0.0035 0.0192 0.0179 0.0865 0.0089 11001110101000 2028 5 0.0386

380 0.0035 0.0192 0.0179 0.0878 0.0089 10101100101100 2028 6 0.0377

428 0.0035 0.0192 0.0179 0.0865 0.0090 10101101101100 2028 7 0.0355

119 0.0035 0.0192 0.0179 0.0839 0.0089 11011110100000 2028 8 0.0356 Continuation of Table B.13

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

378 0.0035 0.0192 0.0180 0.0865 0.0089 10001100101100 2028 9 0.0432

90 0.0035 0.0192 0.0179 0.0855 0.0089 10001100100000 2028 10 0.0433

1270 0.0035 0.0192 0.0179 0.0865 0.0090 10011110101101 2028 11 0.0355

1266 0.0035 0.0192 0.0178 0.0853 0.0089 10001110101101 2028 12 0.0384

403 0.0035 0.0192 0.0180 0.0865 0.0090 11001110101100 2028 13 0.0386

1242 0.0035 0.0192 0.0179 0.0856 0.0090 10001100101101 2028 14 0.0433

260 0.0035 0.0192 0.0180 0.0842 0.0089 10101110101000 2028 15 0.0370 255 402 0.0035 0.0192 0.0181 0.0843 0.0089 10001110101100 2028 16 0.0387

139 0.0035 0.0192 0.0178 0.0842 0.0089 11001101100000 2028 17 0.0364

258 0.0035 0.0192 0.0179 0.0842 0.0090 10001110101000 2028 18 0.0387

379 0.0035 0.0192 0.0179 0.0865 0.0090 11001100101100 2028 19 0.0429

236 0.0038 0.0192 0.0179 0.0865 0.0089 10101100101000 2028 20 0.0377

286 0.0035 0.0192 0.0180 0.0842 0.0089 10011101101000 2028 21 0.0368

1244 0.0035 0.0192 0.0179 0.0842 0.0090 10101100101101 2028 22 0.0376

381 0.0035 0.0192 0.0178 0.0865 0.0089 11101100101100 2028 23 0.0370

405 0.0035 0.0192 0.0179 0.0839 0.0089 11101110101100 2028 24 0.0371

End of Table B.13 Table B.14: Cross Prediction per model relative error arch Core2-Arch to i7-Arch

Cross Prediction per model relative error arch Core2-Arch to i7-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0035 0.1580 0.0222 3757.7144 7.1728 ALL DATA 31761172 * *

Amdahl 0.0085 0.0315 0.0312 0.0976 0.0134 10000000000000 2079 857 0.0779

[1701] 0.0046 0.2165 0.0256 0.8598 0.3196 11101110101111 2075 1112 0.0400

[1703] 0.0045 0.2080 0.0248 0.8574 0.3171 11011110101111 2075 963 0.0389

[1725] 0.0045 0.2420 0.0249 0.8594 0.3349 11101101101111 2075 1626 0.0409

256 [1727] 0.0045 0.1982 0.0236 0.8565 0.3119 11011101101111 2075 900 0.0413

430 0.0045 0.0211 0.0196 0.0865 0.0097 10011101101100 2079 0 0.0368

378 0.0045 0.0211 0.0196 0.0865 0.0097 10001100101100 2079 1 0.0432

380 0.0051 0.0211 0.0197 0.0878 0.0097 10101100101100 2079 2 0.0377

90 0.0045 0.0211 0.0197 0.0855 0.0097 10001100100000 2079 3 0.0433

1268 0.0045 0.0211 0.0196 0.0865 0.0098 10101110101101 2077 4 0.0370

262 0.0045 0.0212 0.0197 0.0833 0.0097 10011110101000 2079 5 0.0360

116 0.0045 0.0212 0.0197 0.0853 0.0097 10101110100000 2079 6 0.0369

382 0.0045 0.0212 0.0197 0.0839 0.0097 10011100101100 2079 7 0.0381

427 0.0045 0.0212 0.0196 0.0865 0.0097 11001101101100 2079 8 0.0364 Continuation of Table B.14

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

260 0.0045 0.0212 0.0196 0.0842 0.0097 10101110101000 2079 9 0.0370

142 0.0045 0.0212 0.0198 0.0832 0.0097 10011101100000 2079 10 0.0368

1242 0.0045 0.0212 0.0198 0.0856 0.0098 10001100101101 2077 11 0.0433

259 0.0045 0.0212 0.0196 0.0865 0.0097 11001110101000 2079 12 0.0386

402 0.0045 0.0212 0.0197 0.0843 0.0098 10001110101100 2079 13 0.0387

286 0.0046 0.0212 0.0197 0.0842 0.0097 10011101101000 2079 14 0.0368

428 0.0045 0.0212 0.0196 0.0865 0.0097 10101101101100 2079 15 0.0355 257 138 0.0045 0.0212 0.0197 0.0866 0.0098 10001101100000 2079 16 0.0372

1294 0.0045 0.0212 0.0195 0.0865 0.0098 10011101101101 2077 17 0.0371

381 0.0045 0.0212 0.0196 0.0865 0.0098 11101100101100 2079 18 0.0370

238 0.0045 0.0212 0.0197 0.0856 0.0098 10011100101000 2079 19 0.0381

118 0.0045 0.0212 0.0197 0.0842 0.0098 10011110100000 2079 20 0.0360

426 0.0045 0.0212 0.0197 0.0839 0.0097 10001101101100 2079 21 0.0372

258 0.0045 0.0212 0.0196 0.0842 0.0099 10001110101000 2079 22 0.0387

236 0.0045 0.0212 0.0197 0.0865 0.0097 10101100101000 2079 23 0.0377

285 0.0045 0.0212 0.0196 0.0839 0.0097 11101101101000 2079 24 0.0352

End of Table B.14 Table B.15: Cross Prediction per model relative error arch i7-Arch to Core2-Arch

Cross Prediction per model relative error arch i7-Arch to Core2-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0035 0.1580 0.0222 3757.7144 7.1728 ALL DATA 31761172 * *

Amdahl 0.0328 0.0699 0.0602 2.1074 0.0768 10000000000000 2079 861 0.0779

[1701] 0.0250 0.3010 0.0557 1.1544 0.4150 11101110101111 2075 1271 0.0400

[1703] 0.0232 0.2742 0.0568 1.1427 0.3970 11011110101111 2075 963 0.0389

[1725] 0.0258 0.4977 0.0591 173.9906 5.4067 11101101101111 2075 1672 0.0409

258 [1727] 0.0231 0.2809 0.0610 1.1479 0.3981 11011101101111 2075 1039 0.0413

135 0.0237 0.0486 0.0397 0.7206 0.0442 11100101100000 2079 0 0.0353

429 0.0237 0.0486 0.0396 0.7206 0.0443 11101101101100 2079 1 0.0351

285 0.0237 0.0486 0.0399 0.7206 0.0445 11101101101000 2079 2 0.0352

141 0.0237 0.0486 0.0400 0.7206 0.0442 11101101100000 2079 3 0.0352

69 0.0237 0.0487 0.0397 0.7677 0.0457 11101101000000 2079 4 0.0364

213 0.0236 0.0487 0.0396 0.7455 0.0447 11101101001000 2079 5 0.0364

351 0.0241 0.0487 0.0398 0.7459 0.0449 11100101001100 2079 6 0.0368

273 0.0237 0.0487 0.0394 0.7206 0.0443 11101001101000 2079 7 0.0477

63 0.0237 0.0488 0.0402 0.7429 0.0448 11100101000000 2079 8 0.0369 Continuation of Table B.15

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

357 0.0236 0.0488 0.0397 0.7678 0.0456 11101101001100 2079 9 0.0364

279 0.0249 0.0488 0.0400 0.7206 0.0444 11100101101000 2079 10 0.0354

129 0.0246 0.0488 0.0395 0.7206 0.0442 11101001100000 2079 11 0.0478

423 0.0238 0.0488 0.0399 0.7388 0.0448 11100101101100 2079 12 0.0354

417 0.0247 0.0489 0.0396 0.7410 0.0447 11101001101100 2079 13 0.0478

201 0.0248 0.0489 0.0393 0.7432 0.0451 11101001001000 2079 14 0.0490

57 0.0234 0.0489 0.0394 0.7445 0.0447 11101001000000 2079 15 0.0489 259 345 0.0242 0.0489 0.0395 0.7428 0.0448 11101001001100 2079 16 0.0489

267 0.0248 0.0489 0.0396 0.7206 0.0444 11100001101000 2079 17 0.0480

411 0.0249 0.0489 0.0396 0.7206 0.0444 11100001101100 2079 18 0.0480

339 0.0238 0.0490 0.0397 0.7461 0.0450 11100001001100 2079 19 0.0494

207 0.0236 0.0490 0.0401 0.7437 0.0448 11100101001000 2079 20 0.0369

123 0.0249 0.0490 0.0396 0.7206 0.0445 11100001100000 2079 21 0.0480

51 0.0251 0.0490 0.0399 0.7453 0.0449 11100001000000 2079 22 0.0493

140 0.0236 0.0491 0.0392 1.7598 0.0661 10101101100000 2079 23 0.0355

284 0.0238 0.0491 0.0390 1.7598 0.0660 10101101101000 2079 24 0.0355

End of Table B.15 Table B.16: Cross Prediction per model relative error arch i7-Arch to i7-Arch

Cross Prediction per model relative error arch i7-Arch to i7-Arch

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

* 0.0035 0.1580 0.0222 3757.7144 7.1728 ALL DATA 31761172 * *

Amdahl 0.0330 0.0680 0.0596 2.1074 0.0818 10000000000000 3008 862 0.0779

[1701] 0.0181 0.3086 0.0545 1.1544 0.4195 11101110101111 3004 1184 0.0400

[1703] 0.0193 0.2879 0.0557 1.1427 0.4052 11011110101111 3004 971 0.0389

[1725] 0.0164 0.3412 0.0578 1.1354 0.4293 11101101101111 3004 1560 0.0409

260 [1727] 0.0210 0.2981 0.0595 1.1479 0.4084 11011101101111 3004 1054 0.0413

135 0.0214 0.0450 0.0375 0.7586 0.0363 11100101100000 3008 0 0.0353

213 0.0203 0.0450 0.0377 0.7717 0.0367 11101101001000 3008 1 0.0364

429 0.0212 0.0450 0.0377 0.7586 0.0365 11101101101100 3008 2 0.0351

285 0.0214 0.0451 0.0379 0.7681 0.0370 11101101101000 3008 3 0.0352

351 0.0217 0.0451 0.0378 0.7682 0.0368 11100101001100 3008 4 0.0368

63 0.0216 0.0451 0.0380 0.7685 0.0366 11100101000000 3008 5 0.0369

141 0.0215 0.0451 0.0376 0.7586 0.0366 11101101100000 3008 6 0.0352

279 0.0215 0.0451 0.0379 0.7586 0.0364 11100101101000 3008 7 0.0354

69 0.0216 0.0451 0.0376 0.7686 0.0382 11101101000000 3008 8 0.0364 Continuation of Table B.16

Index Min (Mean) Median Max Stddev Name Count Rank Mean Fit

423 0.0212 0.0452 0.0377 0.7586 0.0373 11100101101100 3008 9 0.0354

357 0.0218 0.0452 0.0379 0.7723 0.0383 11101101001100 3008 10 0.0364

207 0.0215 0.0453 0.0383 0.7690 0.0366 11100101001000 3008 11 0.0369

273 0.0215 0.0453 0.0377 0.7586 0.0367 11101001101000 3008 12 0.0477

57 0.0209 0.0454 0.0376 0.7681 0.0368 11101001000000 3008 13 0.0489

345 0.0209 0.0454 0.0373 0.7710 0.0370 11101001001100 3008 14 0.0489

267 0.0215 0.0454 0.0374 0.7681 0.0369 11100001101000 3008 15 0.0480 261 129 0.0216 0.0454 0.0380 0.7586 0.0367 11101001100000 3008 16 0.0478

201 0.0209 0.0454 0.0373 0.7692 0.0371 11101001001000 3008 17 0.0490

411 0.0214 0.0454 0.0374 0.7688 0.0368 11100001101100 3008 18 0.0480

339 0.0218 0.0454 0.0379 0.7684 0.0368 11100001001100 3008 19 0.0494

417 0.0216 0.0454 0.0378 0.7709 0.0376 11101001101100 3008 20 0.0478

51 0.0216 0.0454 0.0376 0.7719 0.0369 11100001000000 3008 21 0.0493

123 0.0214 0.0455 0.0380 0.7586 0.0369 11100001100000 3008 22 0.0480

195 0.0215 0.0455 0.0382 0.7702 0.0383 11100001001000 3008 23 0.0493

993 0.0201 0.0458 0.0379 0.7698 0.0371 11101001100001 3006 24 0.0442

End of Table B.16