Practical Approximation Algorithms for Clustering and Covering

a dissertation presented by Jessica May McClintock to The School of Engineering

in total fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Computing and Information Systems

Student Number: 268135 The University of Melbourne Melbourne, Australia June 2017

Thesis advisor: Associate Professor Tony Wirth Jessica May McClintock

Practical Approximation Algorithms for Clustering and Covering

Abstract

The aim of this thesis is to provide solutions to computationally difficult network optimisation problems on modern architectures. There are three types of problem that we will be studying – clustering, covering and ordering – all which are computa- tionally hard to solve, and in practice often involve very large data sets. These models are used in range of real-world applications, and therefore we will investigate both the practical and theoretical aspects of solving these problems in big data contexts. The main approach we consider for solving such instances is to obtain polynomial- time approximation-algorithms, which efficiently solve these problems to within some constant factor of optimality. We also introduce several heuristics for a type of schedul- ing problem with graph-based constraints and demonstrate their performance in a practical setting, before providing an approximation algorithm and hardness res- ults for a formalised version of the problem. For instances on big data, where the computational bottleneck is the available RAM, we consider models for algorithm design that would allow such instances to be solved. For this purpose, we also design clustering algorithms using the MapReduce paradigm for parallelisation, giving ex- perimental evaluations of their performance in comparison to existing approaches. iv Contents

0 Introduction 1 0.1 Clustering and Parallel Algorithms...... 2 0.2 Graph ...... 4 0.3 Test-case Prioritisation with Precedences...... 6 0.4 Min-Sum Set Cover...... 8

1 Related Work 11 1.1 Optimisation and Approximation...... 11 1.2 ...... 13 1.3 Parallelism and Clustering Problems...... 16 1.4 Graph Covering Problems...... 32 1.5 The Min-Sum ...... 46 1.6 Scheduling and Prioritisation with Precedences...... 53 1.7 Summary...... 61

2 Efficient Clustering with MapReduce 63 2.1 Clustering with MapReduce...... 64 2.2 Parallel k-centre...... 65 2.3 Analysis of EIM sampling ...... 72 2.4 Runtime ...... 80 2.5 Experiments...... 82 2.6 Results ...... 86 2.7 Conclusions ...... 90

3 Parallel Coverings of Graphs 93 3.1 MapReduce for Covering Problems...... 93 3.2 Tree Covering Problems ...... 94

v 3.3 The Minimum Path Cover Problem...... 96 3.4 The k-star Covering Problem...... 98 3.5 Hardness of k-star Covering...... 99

4 Scheduling with Precedences 105 4.1 Test-case Prioritisation with Precedences...... 105 4.2 Our Contributions...... 112 4.3 Algorithms and Analysis...... 114 4.4 Special Cases...... 121 4.5 Experiments...... 122 4.6 Results ...... 129 4.7 Conclusions ...... 141

5 Min-Sum Set-Cover with Precedences 147 5.1 Our Contributions...... 149 5.2 Max-Density Precedence-Closed Subgraphs ...... 151 5.3 An Algorithm for precMSSC...... 161 5.4 Hardness Results ...... 165 5.5 Conclusions ...... 169

6 Summary 171

References 174

vi List of Figures

1.1 Gonzalez k-centre solution for k = 3 ...... 28 1.2 An example rooted k-tree covering ...... 35 1.3 An example rooted k-path covering...... 36 1.4 An example clustering solution...... 37 1.5 An example k-star covering ...... 38 1.6 A bad assignment for k-star ...... 40 1.7 An improved assignment for k-star...... 40

2.1 Flowchart for MRG ...... 67 2.2 A bad assignment and seeding for MRG ...... 69 2.3 A bad sample for MRG ...... 70 2.4 An example tight solution for MRG ...... 71 2.5 A point being satisfied by the sample...... 76 2.6 Satisfied and unsatisfied points...... 76 2.7 Comparison of performance over k on real datasets...... 85 2.8 Runtimes over k ...... 86 2.9 Average solution value over k on synthetic data...... 86 2.10 Runtime for k = 25 over a range of n ...... 89 2.11 Runtime over k on synthetic data ...... 90

3.1 Star covering on a line metric ...... 99

4.1 Fault detection histogram for default ordering ...... 107 4.2 Fault detection histogram for fault-coverage based ordering ...... 108 4.3 Fault detection histogram for code-coverage based ordering ...... 110 4.4 An example precedence graph...... 112 4.5 Fault detection histogram for precedence-constrained ordering . . . . . 112 4.6 Combined APFD and AF results for real and synthetic dependencies . 128

vii 4.7 APFD scores for real data sets...... 129 4.8 AF scores for real data sets...... 130 4.9 APFD and AF for data sets with synthetic dependencies ...... 131 4.10 APFD for Siemens with synthetic dependencies ...... 135 4.11 AF for Siemens with synthetic dependencies ...... 136

4.12 Lookahead comparison for gsm2 ...... 138 4.13 Lookahead comparison for tot info ...... 140 4.14 Lookahead comparison for replace ...... 144 4.15 Lookahead comparison for print tokens2 ...... 145

5.1 A bad precedence graph when δ < 1 ...... 153 5.2 An example in-tree ...... 155 5.3 An example out-tree ...... 157 5.4 Histograms for greedy and OPT ...... 163

viii List of Tables

1.1 Current best-known solutions...... 61 1.2 Our contributions...... 62

2.1 Comparison of algorithms for k-centre...... 83 2.2 Average solution value over k on Gau (n = 1,000,000, k0 = 25) ...... 87 2.3 Average solution value over k on Unif (n = 100,000)...... 87 2.4 Average solution value over k on UnB (n = 200,000, k0 = 25) ...... 87 2.5 Average solution over φ ...... 88 2.6 Average runtime over φ ...... 88

4.1 An example defect-coverage matrix...... 106 4.2 An example code-coverage matrix...... 109 4.3 Metrics for real systems...... 123 4.4 Metrics for systems with synthetic precedences ...... 125

ix x List of Algorithms

1 MapReduce-MST(V, E) ...... 21 2 GON(V, k) ...... 27 3 k-means(V, k) ...... 29 4 Greedy(S, U, w) ...... 47 5 MRG(V, k, m) ...... 67 6 EIM-MapReduce-Sample(V, E, k, ε) ...... 72 7 Select(H, S) ...... 74 8 k-Tree-Cover(G, k, B) ...... 95 9 Min-Path-Cover(G, B) ...... 97 10 Minimum Star Cover...... 101 11 Topological-sort based greedy...... 115 12 Coverage-first greedy...... 117 13 greedy-subgraph(P = (S, E), U) ...... 151 14 greedy-subgraph+(P = (S, E), U) ...... 154 15 subgraph-outtree(T = (S, E), v, Rv) ...... 158 16 mssc-greedy(S, P, U) ...... 162

xi xii Acknowledgments

Many thanks to Tony Wirth for these years of advice and assistance as my supervisor – without his support and encouragement I certainly would not be where I am today. His attention to detail and discerning advice have provided many lessons that will be long remembered and appreciated.

Thanks to Tim Miller for introducing me to the fascinating problem of scheduling with precedences, and providing datasets for the experiments. Thanks to Julian Mestre for providing direction and insight on the min-sum set cover problem. And thanks to Andrew Turpin for providing diversions along the way.

Many thanks to my family and friends for their support and tolerance along this journey, particularly to Robert Marshall for proof reading and encouragement, and to Helen and Diana for providing reassurance and distraction.

And finally a special thank you to my mother, Karen McClintock, who has been endlessly supportive even in times of uncertainty.

xiii

Declaration

This is to certify that:

1. The thesis comprises only their original work towards the PhD except where indicated in the preface;

2. Due acknowledgement has been made in the text to all other material used;

3. The thesis is fewer than the maximum word limit in length, exclusive of tables, maps, bibliographies and appendices.

Signed,

Jessica McClintock Date

xv xvi Preface

The original work in this thesis is related to the following papers, one of which has been peer reviewed and appeared in conference proceedings.

• McClintock, J., and Wirth, A., Efficient Parallel Algorithms for k-Center Clus- tering, International Conference on Parallel Processing (ICPP), 2016.

– The content of this paper forms the original work of Chapter2. – The algorithm presented in section 2.2 appeared first in this paper. – The experimental results presented in section 2.5 appeared first in this paper.

• McClintock, J., Miller, T., and Wirth, A., Prioritisation of Test Suites Contain- ing Precedence Constraints, submitted to Journal of Systems and Software (JSS).

– The content of this paper forms the original work of Chapter4. – The algorithms presented in section 4.3 appeared first in this paper. – The experimental results presented in section 4.5 appeared first in this paper.

• McClintock, J., Mestre, J., and Wirth, A., Approximations for Precedence Con- strained Min-Sum Set Cover, International Symposium on Algorithms and Compu- tation (ISAAC), 2017.

– The content of this paper forms the original work of Chapter5. – The algorithm presented in section 5.3 appeared first in this paper.

xvii

CHAPTER 0. INTRODUCTION

0 Introduction

In this thesis, we consider a range of algorithms for solving combinatorial optim- isation problems, with a focus on problems that can, to some degree, be modelled by graphs. There are three types of problem that we will be studying for these problems – clustering, covering and ordering – although the distinction between these types of problem will often be vague. Each of these models can be used to represent a range of real-world applications, and therefore we investigate both the practical and theoretical aspects of these problems.

When solving optimisation problems, the objective is typically to optimise a prop- erty of the data – either to maximise or minimise some function. As the problems covered in this thesis are all NP-complete – that is, fundamentally difficult and likely not possible to be solved in polynomial time – we instead look for approximation algorithms, which provide provably good solutions in a more feasible timeframe. For some cases, we utilise heuristics based on assumptions about the underlying data to identify solutions that are likely to be good, but which lack theoretical guarantees. For such instances, we instead rely on experimental verification to support the per- formance of our procedures. We also consider alternative models of computation; in particular, methods of processing data in parallel to help us overcome restrictions on

1 2 CHAPTER 0. INTRODUCTION the size of the datasets that might be feasible to analyse.

0.1 Clustering and Parallel Algorithms

Clustering is defined as the process of dividing a set of points into partitions (clusters) of points; elements that share a cluster should be more closely related than elements in different clusters. There are many different types of clustering objectives, repres- enting the wide range of applications for such models. When clustering data, we require knowledge of the “dissimilarity”, or distance, between element pairs; this can be represented as a function mapping pairs of points to real numbers, or as weighted edges in a complete graph.

Classical Clustering Problems

We primarily consider two classical clustering problems in which clusters are defined by specific points referred to as centres, with the remaining points being assigned to clusters based on their distance from the centres. Given a complete graph with metric-weighted edges and an integer k, we can define the following clustering prob- lems:

• The k-median problem aims to select a set of k centres from the dataset to min- imise the sum of the distances between each and its assigned centre. The key task is to choose the optimum set of k centres, as the remaining vertices would always be assigned to their nearest centre.

• The k-centre problem aims to find a set of at most k vertices such that the maximum distance from a vertex to its assigned centre is minimised. As with the k-median problem, it is only necessary to find an optimal set of centres, as each vertex will be assigned to its nearest centre.

Informally, k-centre focuses on the worst connection, whereas k-median considers the average connection. Both of these problems are NP-complete – it is not possible to obtain exact solutions in polynomial time unless P = NP. Research on these problems therefore focuses on fast approximation algorithms, ensuring that the res- ults are within some constant factor of the optimal solution. For k-centre there are 0.1. CLUSTERING AND PARALLEL ALGORITHMS 3 several 2-approximation algorithms, and this approximation is known to be tight; contemporary research for k-median still makes incremental improvements to the approximation ratio.

Parallel Algorithms

While approximation algorithms are useful in providing near-optimal results with sub-exponential , often the datasets involved are large enough to require prohibitively large amounts of RAM. There are several possible approaches for dealing with large datasets, such as split- ting the data across multiple machines to be processed in parallel, using multiple pro- cessors to read from a shared memory, or using a data stream model to restrict how the data is read. We focus on parallel architectures for analysing clustering problems on large data structures, with particular emphasis on the MapReduce model.

MapReduce

A widely used method for parallel computing is the MapReduce paradigm [63]. Map- Reduce is a scheme for parallel processing of partitions of the data, overcoming both the time and memory restrictions that arise with large data sets. For a theoretical overview of MapReduce, refer to the work of Karloff et al. [118], and to Chapter2 of this thesis. We remind briefly that the MapReduce paradigm involves processing data in sev- eral rounds of parallel and sequential computations to efficiently solve big data prob- lems. A sequential round is similar to a data stream – individual points are processed independently, and the results sent to one or more of the parallel machines. During parallel rounds, each machine processes the data it was sent, in isolation from the other machines. By allowing several iterations of these two types of procedure, the results of a parallel processing stage can be aggregated before being processed again. Note that the functions performed during each of the parallel and sequential rounds may differ between each iteration. A primary metric for evaluating the performance of MapReduce algorithms is the number of iterations of parallel and sequential procedures. Sending data back and forth across parallel processors can be costly, so algorithms requiring a large number 4 CHAPTER 0. INTRODUCTION of rounds can become impractical even with the benefits of parallelism. Considera- tion therefore needs to be given to the trade-off between computation and commu- nication costs.

Contributions

We present both theoretical and experimental results regarding our algorithm for adapting the k-centre problem to the MapReduce paradigm. Our experiments utilise generated datasets following a schema introduced in prior work [72] to ensure con- sistency with the existing literature, as well as real datasets. The contributions of this chapter are as follows:

• Adapted a MapReduce implementation method for k-centre problem, with a 4-approximation guarantee under certain size constraints.

• Extended a sampling-based algorithm to a family of algorithms with tradeoffs between the accuracy and speed.

• Wrote Python and C implementations of both the 4-approximation algorithm and the sampling-based MapReduce schemes.

• Ran tests to measure the performance of the sampling procedure under differ- ent values of the sampling parameters.

• Generated artificial data sets to compare performance of algorithms over vari- ous properties, such as number of inherent clusters, and number of dimensions.

• Ran experiments evaluating the efficiency and accuracy of the above algorithms on both real and synthetic data.

0.2 Graph Covering Problems

The k-star problem is an example of a multi-objective clustering problem, in which the goal is to cover a given weighted graph G with stars – consisting of a cluster centre 0.2. GRAPH COVERING PROBLEMS 5 and a set of neighbouring edges – such that the maximum sum of the included edge- weights across the stars is minimised. This problem has been referred to as as min- max star cover [8], rooted k-star cover [74] and minimum load k-facility location [2]. A similar problem is k-tree cover, in which the k selected nodes are roots of trees that cover the graph. Here we aim to find a set of k trees such that all vertices in G are covered by a tree, and the cost of the trees is minimal. Both of these models have been motivated by the problem of nurse station location, in which the goal is to minimise the last completion time amongst a group of nurses visiting patients [74]. For both k-star and k-tree covering of graphs, it is necessary to find an assignment of points to the centres. There are cases where nearest-centre assignment performs arbitrarily poorly for this objective. While we typically model k-star as a clustering problem on a metric weighted graph, this problem is also closely related to that of scheduling k parallel asymmetric machines to minimise the maximum processing time (makespan). Once the set of k centres have been chosen, the task of assigning the remaining points to the centres can be modelled as assigning jobs to k machines. The edge weights between points and centres represent the processing times on the different machines, and the sum of edges in a star represents the processing time of the jobs assigned to that machine. As with many clustering problems, solving clustering problems with assignment is often constrained by the size of the datasets. Therefore we adapt some existing k-star and k-tree algorithms to parallel frameworks, and identify special cases where solutions can be found more easily.

Contributions

This is a purely theoretical chapter in which we introduce a series of small results demonstrating how several graph covering algorithms can be parallelised. The con- tributions are as follows:

• Adapted an algorithm for the k-tree problem to the MapReduce framework.

• Gave bounds on the performance of k-median solutions as approximations for the k-star problem.

• Described how the k-path, k-tree and k-star problems can be solved in parallel. 6 CHAPTER 0. INTRODUCTION

• Identified a range of special cases in which better performance bounds can be achieved for the k-star problem.

0.3 Test-case Prioritisation with Precedences

Machine Scheduling

The clustering problems covered in the previous section each involved assignment- based sub-problems which can be modelled in terms of parallel machine scheduling. As with clustering, scheduling problems involve assigning points to some fixed num- ber of groups – in this case, we are assigning jobs to machines, and generally there is no concept of distance between points. Here weights are assigned to the jobs, representing concepts such as processing time and profit. The objective here is usually to optimise some function measuring the amount of work achieved or time taken. Machine scheduling with precedence constraints is a classic problem that naturally models many real world applications. Often performing some task will be dependent on having completed some set of other tasks, or we might split a larger task into a dependent series of smaller tasks. The precedence constraints between jobs in a scheduling problem can be considered as a partial ordering of the jobs, and can be modelled as a directed acyclic graph. We consider the problem of single machine scheduling with precedence constraints, with the goal of optimising a specific object- ive function designed for a software testing application. The standard machine scheduling set up involves a set of jobs with individual processing times that need to be assigned to some number of machines. These sorts of models are useful for analysing scenarios in which there are a fixed number of resources for doing discrete amounts of work – for example, in parallel computing when we assign data to the multiple machines. We consider models of machine learning in which profits are not fixed, but instead depend on the schedule.

Software Testing

Software testing often involves executing a large number of arduous tests, and there has been significant research on how to most efficiently execute these tests. The 0.3. TEST-CASE PRIORITISATION WITH PRECEDENCES 7 standard setup is that some number of tests need to be run in order to locate faults in a software system Each test has the potential to identify some subset of the faults, and faults need to be identified as early in the testing as possible. Test-case prioritisation is the process of ordering the tests so that faults can be identified “as quickly as possible” in the testing process. Recent work has found that many test suites have underlying relationships – precedences, that must be met to ensure that the tests will accurately identify faults. A common objective in test-case prioritisation is known as the Average Percentage of Faults Detected (APFD)[167] measure, which indicates the rate at which faults are identified. The primary distinction is that in software testing we can consider the jobs (or tests) as sets containing the faults that the test can identify. The APFD measure does not focus on how long the jobs are waiting, but on the time taken to identify each of the faults. The value of a test is therefore not constant, but instead depends on which tests are performed before it. We can consider the profit of each test as a schedule-dependent cost function, indicating the number of faults it can identify that were not covered earlier in the test sequence. As a further complication, for many applications the fault-coverage of the tests is unknown. Other features of the test suite are therefore used in place to estimate the coverage of each of the tests. We analyse the accuracy of using code-coverage and dependency structures as estimators for fault-coverage of tests, using a range of greedy heuristics for increasing the APFD for the purpose of software testing.

Contributions

Motivated by applications in software testing, we introduce a novel variant on ma- chine scheduling with precedence constraints. We provide several heuristics for find- ing solutions to this problem and an experimental evaluation of the performance of these algorithms in a range of real contexts.

1. Introduced a model of machine scheduling with precedences for the purpose of test-case prioritisation, representing jobs as (possibly overlapping) sets.

2. Developed several greedy heuristics for identifying feasible solutions to this problem, for which we give time complexity results. 8 CHAPTER 0. INTRODUCTION

3. Tested the heuristics on several datasets with real application in the software testing field, in cases with both complete and partial information.

4. Compared the greedy algorithms to current standards in this field to determine which achieve higher rates of fault detection.

5. Evaluated the performance of three alternative lookahead schemes with the greedy algorithms.

6. Tested the performance of the algorithms on instances where complete fault information is not available, and analysed how performance is impacted by deviations between code-coverage and defect-coverage data.

0.4 Min-Sum Set Cover

In software testing, we model test cases as sets of identifiable faults, which differen- tiates the test-case prioritisation problem significantly from more standard machine scheduling models. We can instead model this problem in terms of set cover, which is a classic NP-complete optimisation problem. We look into variants of the set cover objective that model coverage of linear orderings, and use them as a framework for analysing test-case prioritisation.

The Set Cover Problem

The standard model of the set cover problem involves selecting a minimal number of sets from a larger collection such that all elements that occur in the collection are represented in the sets in the subgroup. This problem – one of the fundamental problems in theoretical computer science – is NP-complete, with the best polynomial approximations achieving a logarithmic approximation factor. Common variations on this problem include adding weights to the sets, and restricting the number of elements per set. Some set cover objectives require solutions to provide a linear ordering of the sets – of particular interest for our purposes is the Min-Sum objective, in which the objective focuses on minimising the total of the elements coverage times. A greedy algorithm 0.4. MIN-SUM SET COVER 9 for min-sum set cover gives a 4-approximation, which is best-possible unless P = NP [77]. While ostensibly a variant on the set cover problem, the solutions obtained can easily be interpreted in a machine scheduling framework. The Min-Sum Set Cover problem (MSSC) can be used as a theoretical model for test-case prioritisation, as the min-sum objective closely relates to the APFD of a schedule. Bridging the gap between set cover and scheduling models, we consider the problem of solving MSSC with precedence constraints. The related problem of single machine scheduling to minimise the sum of completion times allows a greedy 2-approximation [46] – we give greedy approximations for the min-sum objective under this model, along with corresponding hardness results.

Contributions

The results in this chapter are purely theoretical, and include new algorithms for solving the precedence-constrained Min-Sum Set Cover problem (precMSSC), worst case bounds on the performance of our algorithms, and an analysis of the complexity of this problem.

• Introduced precedence constraints to the well-known Min-Sum Set Cover prob- lem, and demonstrated how this problem can be modelled in terms of a schedul- ing problem with AND/OR-precedences.

• Identified special cases for which better approximations can be achieved, and developed procedures for solving such instances.

• Developed an approximation algorithm for solving precMSSC.

• Proved bounds on both the approximation ratio and the runtime complexity for our new algorithm.

• Proved that an algorithm providing better than polynomial approximations for precMSSC would lead to a counter-argument to the planted dense subgraph conjecture. 10 CHAPTER 0. INTRODUCTION

Glossary of Problems

Dominating Set Given a graph G = (V, E) and an integer k, find a set S ⊆ V of size at most k such that all vertices in V are either in S, or adjacent to a vertex in S.

Edge Cover Given a graph G = (V, E), find a minimal subset S ⊆ E such that every vertex in V is incident on an edge in S. k-Centre Clustering Given a complete graph G = (V, E) with metric edge weights and an integer k, select a set S of k centres from V such that the maximum distance from a data point to the nearest centre in S is minimised. k-Median Clustering Given a complete graph G = (V, E) with metric edge weights and an integer k, select a set S of k points from V such that the sum of distances from each vertex to its nearest point in S is minimised. k-Star Cover Given a complete graph G = (V, E) with metric edge weights and an integer k, find a set S of k “stars” on G (where a star is a single vertex and a subset of the incident edges) such that every vertex in V is in or incident to a star in S, and the maximum sum of edge weights of a star in S is minimised.

Minimum Star Cover Given a complete graph G = (V, E) with metric edge weights and an integer B, find a minimal set S of stars covering V such that the sum of edge weights in S is at most B.

Min-Sum Set Cover Given a collection of sets S covering a universe U, find an ordering

S1, S2,..., Sm of the sets in S such that ∑i∈U C(i) is minimised (where C(i) is the min- imal index j such that i ∈ Sj).

Parallel Machine Scheduling Given a set J of jobs, a set of m machines and an objective function f , find the schedule S assigning jobs in J to the m machines such that f (S) is optimal.

Set Cover Given a collection of sets S covering a universe U, find a minimal sub-collection ¯ S ⊆ S such that ∪A∈S¯ A = U. Single Machine Scheduling Given a set J of jobs and an objective function f , find an order- ing S of the jobs in J such that f (S) is optimal.

Test Case Prioritisation Given a set T of tests, a fault matrix F indicating the faults each t ∈ T identifies, and a function f indicating the rate of fault detection for a given permutation of T, find a permutation π of T such that f (π) is optimal.

Vertex Cover Given a graph G = (V, E), find a minimal subset S ⊆ V such that for all (u, v) ∈ E either u ∈ S or v ∈ S. CHAPTER 1. RELATED WORK

1 Related Work

In this chapter we introduce the main subjects of our research, as well as a range of related topics to provide context and motivation for our original contributions. This includes a range of fundamental theorems that we build upon or were inspired by, and several related models that can be used as alternative means of representing the problems we consider. We will introduce the key data structures that form the basis of our analysis such as graphs, schedules and sets. We focus on computationally difficult optimisation problems on these data structures, and the variety of techniques that have been used to find solutions to these problems.

1.1 Optimisation and Approximation

The primary focus of this work is on providing efficient and effective solutions to com- putationally difficult graph-based optimisation problems. Graphs are a fundamental mathematical structure for modelling pairwise relations, and are used to represent numerous real-world problems. There are a wide range of optimisation problems modelled on graphs, where the goal is to identify the “best” among a large number of possible solutions. We investigate a range of efficient methods for implementing algorithms for graph based optimisation problems.

11 12 CHAPTER 1. RELATED WORK

Optimisation Problems

We will primarily consider optimisation problems, a class of problems in which the goal is to identify the best (or optimal) solution amongst a large number of feasible solutions. This generally involves finding a solution that maximises or minimises some objective function – a mathematical formulation that somehow captures the value of a solution. Typically, the number of feasible solutions is exponential in the input size – oth- erwise an exhaustive search of the solution space might be sufficient to identify the optimal solution in polynomial time. Often it is NP-hard to obtain exact solutions to an optimisation problem. However, for many applications, it is sufficient to provide solutions that are near optimal, par- ticularly if we can also give a guarantee of how far from optimal the solution might be.

Definition 1. An α-approximation algorithm (for a constant α) is an algorithm which runs in polynomial time and produces a solution with value no worse than α times the value of the optimal solution.

Constant-factor approximations ensure that the solution is no worse than α times the optimal solution, for some constant α. APX is the complexity class of NP-hard op- timisation problems that allow polynomial-time constant-factor approximation [13]. For many cases, it is even NP-hard to obtain good approximations to optimisation problems – there are many problems (such as Set Cover [76]) that only allow logar- ithmic factor approximations, or worse. There are many strategies for designing approximation algorithms. For such com- putationally difficult problems, it is often preferable to find an approximation al- gorithm rather than solving for the optimal solution.

Definition 2. A Polynomial-time Approximation Scheme (PTAS) is an algorithm which, for a given optimisation problem and a value ε > 0, produces (in polynomial time) a solution with value no more than (1 + ε) times optimal, for a minimisation problem, or no less than (1 − ε) times optimal, for a maximisation problem.

Not all NP-hard problems admit a PTAS – often the best-possible polynomial-time algorithms can only achieve a constant-factor approximation, or worse. If P 6= NP, 1.2. GRAPH THEORY 13 then there exist problems in APX that do not admit a PTAS [59]. A PTAS reduction is a reduction that preserves the property that the problem admits a PTAS. A problem is APX-hard if there is a PTAS-reduction from all problems in APX to the given problem. There are many techniques for identifying solutions to optimisation problems. Lin- ear programming is a fundamental topic in operations research, which provides a polynomial-time approach for solving optimisation problems. However, the solu- tions given by are often fractional, which is not immediately applicable for discrete optimisation problems. It is possible to force the solutions to be integral, as with integer programming, however this might increase the complexity, as it is NP-hard [119], even though linear programming in in P. The linear relaxation of an integer program is a linear program equivalent to the initial integer model but with integer constraints changed to allow a range of real values. A common way of gaining a polynomial-time approximation algorithm for integer programming problems is LP-rounding, which takes the fractional solution obtained by linear programming and uses it to obtain an integral solution. There is no fixed approach for LP-rounding, as the techniques used in rounding, and bounding a solu- tion, vary according to the problem [183].

Definition 3. The integrality gap of an integer program is the worst-case ratio of the integral solution to the linear relaxation.

Since a solution to the linear relaxation of an integer problem gives a (possibly infeasible) solution, the integrality gap can give a bound on the approximation ratio that a given linear program can provide.

1.2 Graph Theory

Graphs are a data structure for representing (typically) binary relations between items. (Undirected) graphs are often modelled as a pair G = (V, E), consisting of a vertex set V, and an edge set E of two-element subsets of the vertex set V. A graph G0 = (V0, E0) is considered to be a subgraph of graph G = (V, E) if V0 ⊆ V and E0 ⊆ E. For a subset V0 ⊆ V of vertices, the subgraph induced by V0 is the set of edges in E with both endpoints in V0. 14 CHAPTER 1. RELATED WORK

A path in a graph G from vertex v to vertex u is a non-repeating alternating sequence of incident vertices and edges in G that starts at v and ends at u. A graph is connected if there is a path between every pair of vertices, and it is complete if there is an edge between every pair of vertices. There are many variants building on the standard graph model that allow us to represent more complex relations between elements. For example, multigraphs allow multiple (parallel) edges between each vertex pair, and edges that relate a node to itself (self-loops). Hypergraphs allow edges to be sets of more than two nodes (hy- peredges). Weighted graphs assign weights to the edges of the graph, while directed graphs (or digraphs) add direction to the edges. For many applications of weighted graphs, it is common to assume that the weight function is metric.

Definition 4. A weight function d : A × A → R is a metric if the following properties hold:

1. Symmetry: d(x, y) = d(y, x) ∀x, y ∈ A.

2. Triangle inequality: d(x, y) + d(y, z) ≥ d(x, z) ∀x, y, z ∈ A.

3. Positive definiteness: d(x, y) = 0 =⇒ x = y ∀x, y ∈ A.

4. Non-negativity: d(x, y) ≥ 0 ∀x, y ∈ A.

A common metric is the Euclidean metric, which is common in routing and planning as it represents distances between points on Rk. Another example is a line metric, which represents distances between points on a line R1. Often, problems on metric weighted graphs are easier to solve than the general case, as the triangle inequality is useful for bounding approximations. For example, both the facility location problem and the k-median problem allow constant-factor approximations in the metric case, but only allow O(log n)-approximations in the non-metric case [141; 114]. Many problems in graph theory involve graphs that are bipartite. In a G = (V, E), the vertex set V can be partitioned into two disjoint sets A and B, and E ⊆ {(u, v) | u ∈ A, v ∈ B}: that is, there are only edges between A and B. Bipartite graphs are often used when modelling problems in which there 1.2. GRAPH THEORY 15 are two different categories of items being related. For example, we could use a bipartite graph to model a timetabling problem, in which set A represents classes, set B represents available time-slots, and edges in E indicate which time-slots a class is able to be scheduled. Directed graphs are another practical graph variant – if we consider edges as rep- resenting a binary relationship between data points, then directed edges represent a one-way relationship between the elements. For example, a graph representing users on Twitter would use a directed edge to represent when one user follows another – this type of non-reciprocal relationship between elements is nicely captured by a directed graph. A particularly useful variant of directed graphs is when there are no directed cycles in the graph.

Definition 5. A Directed Acyclic Graph (DAG) is a directed graph in which there is no directed path from a node back to itself.

Directed acyclic graphs are useful for modelling flow problems, precedence re- lations, and other structures in which it is necessary to avoid cyclic dependencies. DAGs can be related to partial orderings of data, as they indicate a hierarchy over the elements. A topological ordering of a DAG is a linear ordering of the vertices such that for every directed edge (u, v) in the graph, u is before v in the ordering. Topological sorting of a DAG is the process of ordering the vertices such that they are topologically ordered, and can be done in polynomial time. Kahn [115] gave a topological sort algorithm using a breadth-first search strategy, while Tarjan [181] introduced and alternative approach based on a depth-first search of the graph. The problem of topologically sorting a DAG is in the complexity class P [56; 115] – that is, it can be solved in polynomial time using a deterministic Turing Machine. Problems that can be solved in polynomial time on a non-deterministic Turing Ma- chine are in NP – it is not known whether P = NP or if there are problems in NP that are not in P, however it is generally believed that NP represents a class of problems harder than those in P. Many of the problems we will consider are not known to be in P; moreover, they are NP-hard. Such problems are at least as hard as even the hardest problems in NP, as all problems in NP can be reduced to an NP-hard problem in polynomial time. Problems which are NP-hard and also in NP are called 16 CHAPTER 1. RELATED WORK

NP-complete, and are the hardest problems within NP. All NP-complete problems can be mapped to each-other in polynomial time, so we can consider them as being equally difficult, and closely related. Ordering and scheduling a set of elements is often an objective in NP-hard optim- isation problems that can be modelled with DAGs. In such problems, there may be some precedence relations between the items where it is required that some set of items occur before a given item. These precedence relations define a partial ordering of the items, and the precedence constraints can be represented as a directed acyclic graph (DAG), with an edge from node i to node j if item i is required to occur be- fore item j. We can add precedence constraints to many of the ordering problems considered above.

Definition 6. A precedence graph is a directed acyclic graph (DAG) P = (V, E), in which V is a set of vertices and E is a set of edges (i, j) for each dependency i ≺ j, where j can not precede i.

1.3 Parallelism and Clustering Problems

Clustering involves identifying groupings of similar objects, and is a fundamental task in interpreting data sets from a range of applications, such as social network- ing, event recognition, data mining and bioinformatics. Generally, clustering prob- lems involve optimizing some function that indicates how well the clusters repres- ent underlying structures in the data. Many clustering problems are NP-hard to solve exactly, and therefore it is sensible to seek a polynomial-time approximation algorithm. For many applications, the data sets we wish to run these algorithms on can be prohibitively large; there may be insufficient memory to perform the necessary calculations even when seeking approximate solutions. There are schemes such as MapReduce [63] that offer the potential to overcome these limitations.

Models of Parallelism

As the ever-increasing size of data sets rapidly overtakes the size of available memory, there is a growing trend towards developing algorithms for efficiently processing datasets too large to fit on one machine. For such instances, we can instead design 1.3. PARALLELISM AND CLUSTERING PROBLEMS 17 algorithms that split the data across multiple machines, and process each part in parallel before aggregating the results. We will consider parallel approaches for handling such instances – such as the MapReduce framework proposed by Dean and Ghemawat [63].

The MapReduce Paradigm

MapReduce is a parallel framework which organises computations over a large num- ber of computers to allow us to find solutions more efficiently, thereby overcoming the obstacles that arise with large problems [63]. The MapReduce paradigm involves several rounds of parallel and sequential computations to significantly speed up the processing of large problems. This allows the problem to be split across multiple machines, making such large problems more viable. Much work has been done on developing fast approximation algorithms for optim- isation problems such as k-means and k-median using a MapReduce implementation of existing approximation algorithms [30; 50; 72]. We therefore consider MapReduce as the primary framework under which we can analyse our parallel implementation of related graph optimisation problems. Informally, a MapReduce algorithm consists of a series of interleaving rounds of sequential mappers and parallel reducers. A map round assigns each data point in- dependently to some reducer(s); the reducers run in parallel, each performing some procedure on the subset of points it has been assigned. A program in MapReduce may consist of several iterations of mappers and reducers, each involving potentially different map and reduce functions. Karloff et al. [118] introduced a theoretical model of computation for the Map- Reduce paradigm that is widely used to analyse MapReduce algorithms [17; 130]. This work gave a comprehensive method for theoretically structuring algorithms for MapReduce, and defined a family of complexity classes for MapReduce algorithms. Under this model of MapReduce, data is represented as hkey;valuei pairs, in which the key can be used to guide which of the parallel machines the pair is being sent to, and the value contains the data being sent to the reducer. There are three stages in a MapReduce operation; the map stage, the shuffle stage and the reduce stage. In the map stage, a mapper µ takes a single hk; vi pair, and outputs a new set of 18 CHAPTER 1. RELATED WORK

{hk1; v1i hk2; v2i,..., hki; vii} pairs. Each original hk; vi is treated independently of other information, so the mapper is inherently stateless. The shuffle stage sends the hk; vi pairs to the reducers, with all pairs with the same key sent to the same machine. This is an automatic process: as it cannot be altered, it is not a factor in designing algorithms for MapReduce. In the reduce stage, each reducer ρ takes all of the values associated with the key that referred to that reducer, and outputs a new multiset of hk; v0i pairs, with the key unchanged from the input. That is, the keys can only be altered in the map stage. This is the parallel component of the process, as the reducers all work concurrently during the reduce stage. Since the order in which the mapper processes the input is arbitrary, the reducers can not start processing before the map and shuffle rounds are complete. A program in MapReduce may consist of several rounds of the above processes, each involving potentially different map and reduce functions. Borrowing from Kar- loff et al. [118], we can formally define a MapReduce procedure as follows.

MRC: For input length n and a fixed value ε, the MapReduce Class MRCi is defined as an interleaving sequence hµ1, ρ1, µ2, ρ2,..., µR, ρRi of R rounds of mappers µj and reducers ρj such that the correct solution is output with probability at least 3/4. Each

µj is a mapper and each ρj is a reducer, all of which are implemented in RAM with words of length O(log(n)), using O(n1−ε) space and time polynomial in n. The total 2−2ε space used by the hkey; valuei pairs output by mapper µj is O(n ), and the total number of rounds R = O(logi n). The equivalent deterministic version of this class is called DMRC. The number of machines is restricted to Θ(n1−ε), however there may be as many as O(n2−2ε) keys, and each key is mapped to a unique reducer. Therefore it is possible for more than one instance of a reducer to be run on the same machine. The total space available to any map or reduce computation is O(n1−ε), and there- fore the size of every hkey; valuei pair must also be O(n1−ε).

Other Parallel Programming Models

MapReduce is one of many models of parallel programming. A common model for parallel algorithms is the Parallel Random-Access Machine (PRAM) model. Under 1.3. PARALLELISM AND CLUSTERING PROBLEMS 19 this model, there are multiple processors using a shared memory – the processors operate in parallel, and communicate with each-other via the shared memory. Care needs to be taken to handle situations in which multiple processors might read or write to the same memory location in unison. There are several alternative sets of constraints that can be placed on the general PRAM model to avoid or otherwise work around read/write conflicts. Karloff et al. [118] demonstrated how PRAM algorithms can be efficiently simu- lated in the MapReduce framework. As an example of where MapReduce can be more efficient, they showed that computing a minimum spanning tree in a Map- Reduce model only requires two rounds – in a PRAM model this problem requires Ω(log n) rounds. Other common frameworks for parallel computing include BSP [182] and LogP [60], both of which are inspired by the PRAM model, and attempt to account for some of the practical issues in parallel computation. In particular, the PRAM model as- sumes synchronous procesing and ignores interprocessor communication costs; as this is often unrealistic and ignore potential bottlenecks, alternative models have been introduced to allow more accurate analysis of parallel computation. The Bulk- Synchronous parallel (BSP) model developed by Valiant [182] attempts to bridge the gap between the abstractions of the PRAM model and the technicalities of parallel implementations. Iterating on the ideas behind BSP is the LogP model, named for the parameters required to describe these parallel machines: Latency, overhead, gap and processing units.

The data stream model

Another framework for designing efficient algorithms is the data stream model, under which there is an input stream of data points which are processed in the order of their arrival, and the amount of memory used is restricted [16]. While not parallel, this is a common model for designing algorithms in a RAM-restricted setting. Instead of processing data in parallel, data points are processed individually in an arbitrarily ordered “stream”, with some limited amount of storage used as memory of the data seen previously. For multipass streaming algorithms, there may be several passes over the data stream, although this number should be small – for many practical 20 CHAPTER 1. RELATED WORK purposes more than one pass is not feasible. As the available memory is constrained, streaming algorithms often store some compressed summary of the actual dataset, and the emphasis is on achieving good approximation ratios using a practical number of passes [175].

MST algorithms in MapReduce

As an example of an easily parallelisable graph optimisation problem, we can con- sider the minimum spanning tree problem. The minimum spanning tree problem in- volves finding a minimum spanning tree in a given graph, and admits several parallel algorithms. Definition 7. A Minimum Spanning Tree (MST) of a graph is an acyclic subgraph con- necting all vertices such that the sum of the edge weights is minimised. The minimum spanning tree problem can be solved exactly in polynomial time – for a graph with n vertices and m edges, both Prim’s algorithm [165] and Kruskal’s algorithm [128] solve this problem in O(m log n) time. This makes it useful for find- ing approximate solutions to many of the NP-hard graph covering problems. There are several alternative approaches for parallelising MST algorithms – we will briefly overview two MapReduce specific implementations. Karloff et al. [118] gave a MapReduce algorithm for finding an MST of dense graphs, in which the vertex set is partitioned, and the induced subgraphs are sent to different machines. This algorithm takes as input a dense graph G = (V, E), with |V| = N and |E| = m ≥ N1+δ for some constant δ > 0. For some fixed number r, the vertices of the graph are partitioned into r disjoint subsets V1 ∪ V2 ∪ · · · ∪ Vr of size N/r. For each pair of sets Vi, Vj, construct a subgraph of G by taking the vertices Vi ∪ Vj and the edges of G induced by these vertices – each subgraph is sent to a separate reducer, which returns a minimum spanning forest. The union of the forests is then sent to a single reducer, which returns the MST. Lattanzi et al. [130] gave an alternative MapReduce implementation of a MST al- gorithm, which considers partitioning the set of edges rather than the vertices. This can be seen in Algorithm1. The foundation of their algorithm is similar to that of Karloff et al. [118], with the primary distinction being the method used to split the problem across the reducers. 1.3. PARALLELISM AND CLUSTERING PROBLEMS 21

Algorithm 1 MapReduce-MST(V, E) 1: if |E| < c then ∗ 2: Compute T , the minimum spanning tree on E. ∗ 3: return T 4: Partition E into E1, E2,..., Er, such that |Ei| < c and r ∈ Θ(|E|/c) 5: Send partition Ei to reducer ρi. 6: ρi sends back Ti, the minimum spanning tree on G(V, Ei). 7: return MST(V, ∪iTi)

If the number of edges is small enough to fit on a single machine, they send the graph to one reducer, and return the MST. Otherwise, the edge set is partitioned arbitrarily across the reducers, using a universal hash function. Each reducer com- putes the MST for the set of edges it received – if the sample is still too large, then this procedure is run recursively on the union of these MSTs. Assume G has n vertices and m edges: if m = n1+δ, we refer to G as δ-dense. Let each of the machines have memory c, and assume that the number of machines available is O(m/c), and c = n1+ε for ε > 0 (that is, the memory is super-linear in n). For constant ε, Algorithm1 will require a constant number of rounds, and is therefore in MRC0 (a class of problems that can be solved by a MapReduce algorithm in a constant number of rounds, with a sublinear number of machines each with sublinear memory).

Models of Clustering

Clustering partitions data points such that points in the same partition are somehow “similar” under some measure of dissimilarity such as Euclidean distance or cosine distance. Identifying such partitions allows us to find groups of like items without prior information about what sort of categories we might be looking for – other than the number of clusters required, these approaches are generally unsupervised.

Classical Clustering Problems

Often in clustering problems, we are given some predefined number of groups into which we aim to partition the data. When used with location-based data, these mod- els can be considered to represent deployment problems, where we aim to determine 22 CHAPTER 1. RELATED WORK locations for some sort of facility that needs to provide utility to a set of predefined locations. As such, many of these formulations can be generalised to allow there to be two types of vertices – often described as “facilities” and “clients”. Under this formulation, the centres for the clusters need to be selected from the set of facilities, and the quality of a solution is measured by the utility provided to clients in the location set. We consider versions of these problems in which there is no such distinction – centres are selected from the initial set of vertices and utility is based on the distance from these centres to the remaining vertices. In a metric clustering problem, the weights representing the similarity between objects are required to observe the triangle inequality. The best-known example is of course the Euclidean metric on Rn, which represents distances between points in a vector space. In the context of clustering, points in a metric space can be modelled as vertices in a complete graph. Each vertex represents a data point, and each edge is weighted to indicate the distance (or dissimilarity) between the two adjacent points. k-median: The k-median clustering problem involves selecting a set of k points in a data set to be medians such that the total distance from each client to its nearest median is minimised. In general, we assume the input graph G = (V, E) is complete, with a metric function d : E → R+ assigning weights to the edges of G. For a set S ∈ V, let d(i, S) = minj∈S d(i, j) be the distance from a vertex i to the nearest vertex in S.

Given a value k, we aim to select a set S of medians such that |S| ≤ k and ∑i∈V d(i, S) is minimised. We can interpret the k-median objective as minimising the average travelling time from the locations in V to the medians/facilities selected in S. For each of the medians i ∈ S, we can consider the cluster Si associated with that median to be the set of points in V such that i is the closest median in S. Every local minimum for the k-median problem is a 5-approximation to the global minimum [12]. Via a reduction from the set cover problem, Jain et al. [112] proved that the k-median problem does not allow an α-approximation for α < 1 + 2/e − ε ≈ 1.736 unless all problems in NP can be deterministically solved in O(nO(log log n)) time. The current best-known algorithm for this problem by Byrka et al. [35] achieves a 2.611 + ε-approximation. 1.3. PARALLELISM AND CLUSTERING PROBLEMS 23 k-centre: The k-centre clustering problem involves selecting a set of at most k centres from the input data, such that the maximum distance from a data point to the nearest centre is minimised. Formally, given a complete graph G = (V, E) and a metric d : E → R+, a k-centre solution is a set S of k points (or centres) in V such that maxi∈V d(i, S) is minimised. This can be interpreted as minimising the worst-case delay in travelling to each location in i ∈ V from the centre nearest to i. This ob- jective has many applications, from vehicle routing to document clustering, in which it relates to concepts such as the furthest traveling time, service guarantee, or the least “similar” document. It can alternatively be considered to be minimizing the (maximum) covering radius of the clusters. If the edge-weights are not metric, then k-centre is not in APX – so there is no constant-factor approximation unless P = NP [102]. It is NP-hard to approximate the metric k-centre problem to within 2 − ε times optimal for any constant ε > 0 – this can be proved via a reduction from the Dominating Set problem [106]. Exploiting this as- sociation between k-centre and the dominating set problem, Hochbaum and Shmoys gave a 2-approximation algorithm for the problem [103], which is best-possible unless P = NP. However, the more common procedure for approximating solutions to the k-centre problem is the farthest-point algorithm introduced by Gonzalez, which also achieves a 2-approximation guarantee [90].

k-means k-means clustering is similar to k-median, but without the requirement that the centre be one of the input points. For the k-means clustering problem, the objective function used is to minimise the within-cluster sum of squares.

The k-means clustering problem considers a set of points {x1, x2,..., xn}, with the aim of clustering these points into k clusters S = {C1,..., Ck} to minimise the function 2 arg min ∑ ∑ ||x − µi|| , S i≤k x∈Ci where µi is the centroid for cluster Ci – unlike k-median and k-centre, these centroids are not chosen from the input data, but instead represent the mean location of the points in the given cluster. This problem is NP-hard [3], even for binary clustering (when k = 2)[68] and for the case where all points are embedded in the plane [145]. 24 CHAPTER 1. RELATED WORK

Facility Location Problems

Closely related to these types of problems is the uncapacitated facility location problem, a fundamental problem in operations research. The vertices of the graph are partitioned into sets of “facilities” and “clients”, the centres must be selected from the set of facilities, and each client must be assigned to an open facility (centre). The weight of an edge between a client and a facility is the assignment cost, and there is an opening cost associated with setting a facility to be a centre. The aim of the facility location problem is to minimise the total cost of opening facilities and assigning each client to its nearest opened facility.

Definition 8. Given sets F of facilities and C of clients, a metric d : C × F → R+, and a cost function c on the facilities, the Uncapacitated Facility Location (UFL) problem involves

finding a subset S ⊆ F of facilities such that ∑i∈S c(i) + ∑j∈C d(j, S) is minimised.

The objective function ∑i∈S c(i) + ∑j∈C mini∈S d(i, j) represents the total cost asso- ciated with the solution S, given by the sum of opening costs for the facilities in S and the cost of travelling to the nearest facility for each of the clients. Among the wide variety of approaches for finding approximations for this prob- lem are LP-rounding, primal-dual techniques, local-search algorithms and greedy algorithms. Chudak and Shmoys used a deterministic LP-rounding technique to obtain a 4-approximation algorithm [53]. Jain and Vazirani gave a 3-approximation algorithm using primal-dual techniques [114]. Charikar and Guha [41] proved that all locally optimal solutions for the uncapacitated facility location problem are at most three times optimal. Jain et al. gave a greedy 2-approximation algorithm using dual fitting [112]. It is not possible to approximate the uncapacitated metric facility location problem to within a factor of 1.463 unless all problems in NP can be solved deterministically in O(nO(log log n)) time [94]. The current best for this problem is a 1.488-approximation by Li [137], improving on a 1.5-approximation result by Byrka and Aardal [33] and a 1.52-approximation by Mahdian et al. [146]. We can consider k-centre and k-median as special cases of the standard facility location model in which there is no cost assigned to opening facilities, but instead an upper bound on the number of facilities. Note that both of these problems are strictly harder to approximate than the facility location problem, as the hardness bounds for 1.3. PARALLELISM AND CLUSTERING PROBLEMS 25 both problems are above the current performance of facility location algorithms.

Correlation Clustering

While many clustering problems assume that data points can be represented in a metric space, there are applications in which these assumptions do not hold, and where clustering remains of interest. Correlation clustering assumes dyadic rela- tions between pairs of elements, in which the edge between two points has either a positive weight (indicating “agreement”) or a negative weight (“disagreement”), and is motivated by document clustering problems [18]. The aim is to cluster the data in order to maximise the sum of within-cluster agreements and disagreements between clusters. As with the above clustering problems, correlation clustering is NP-hard [18]. Us- ing sampling techniques, Giotis and Guruswami gave a polynomial-time approxim- ation scheme for this problem [87]. For cases where the graph is complete, there are constant-factor approximation algorithms [43; 179]. Overlapping correlation clustering (where elements can be “represented” by mul- tiple clusters) was first considered by Bonchi et al. [31] for the application of label assignment, who used a local search procedure to obtain solutions.

Approximation Algorithms for Clustering k-median

Charikar et al. [42] gave the first constant-factor approximation algorithm for the k- median problem, with a performance guarantee of 62/3. This approach used linear programming to obtain an optimal fractional solution in polynomial time, and then rounded this solution to an integral solution. Arya et al. [12] gave a 5-approximation algorithm for the k-median problem using a local search technique, which swaps which vertices are centres. By extending the basic local search procedure to allow for multi-swaps, they also gave a (3 + 2/k)(1 + ε)-approximation algorithm. Gupta and Tangwongsan [97] give an alternative analysis of this procedure to prove the effectiveness for a more general objective function, that implies this approach gives constant-factor approximations for both the k-median and k-means objectives. Until 26 CHAPTER 1. RELATED WORK recently, this was the best known result for this problem; the current best is by Byrka et al. [35], whose (2.611 + ε)-approximation algorithm builds upon the (2.732 + ε)- approximation by Li and Svensson [138]. Lin and Vitter [141] gave a 1 + ε-approximation algorithm for the k-median prob- lem that allows as many as (1 + 1/ε)(ln n + 1) · k centres to be selected – in general, such solutions are infeasible. Their approach modifies LP solutions using a technique they refer to as filtering. We can model the metric k-median problem with the following integer linear pro- gram:

Minimise ∑ d(i, j) · xij i,j∈V

s.t. ∑ xij = 1 ∀j ∈ V i∈V

∑ yi ≤ k ∀i ∈ V i∈V

xij ≤ yi ∀i, j ∈ V

xij ∈ {0, 1}, yi ∈ {0, 1}, ∀i, j ∈ V

Each term xij indicates that the edge (i, j) is used in the solution – specifically, that the vertex i has been selected as a centre, and vertex j is assigned to i. The yi terms indicate which vertices are selected as centres. This integral model was adapted by Lin and Vitter [141] to provide what they refer to as an ε-approximation to the k-median – an approximate solution that opens some factor more than k centres. The linear relaxation of this model was also utilised by Charikar et al. [42] to give their 62/3- approximation algorithm. k-centre

Gonzalez gave a factor-2 approximation to the k-centre problem [90]. This approach chooses an arbitrary vertex from the graph, and marks it as a centre. At each follow- ing step, the vertex farthest from the existing centres is marked as a new centre, until k centres have been chosen. As the edge weights comprise a metric, the triangle in- equality ensures that the resulting set of centres comprises a 2-factor approximation. 1.3. PARALLELISM AND CLUSTERING PROBLEMS 27

Algorithm 2 GON(V, k) 1: Let i be an arbitrary point in V 2: S ← {i} 3: while |S| < k do 4: Let i = arg maxi∈V mins∈S d(i, s) be the point in V farthest from points in S. 5: S ← S ∪ {i}. 6: return S.

Theorem 9. Gonzalez’s farthest point method (Algorithm2) returns a 2-approximation to the k-centre problem [90].

Proof. Let S denote the set of k centres returned by Algorithm2, and let S∗ be an optimal solution with radius OPT. Consider the partitioning of V into sets {V1 ... Vk} given by assigning each point in V to the nearest centre in S∗ (with ties broken arbit- rarily). For a pair of points (u, v) both in the partition defined by a centre ci, we have d(u, v) ≤ d(u, ci) + d(ci, v) ≤ 2OPT – by the triangle inequality, and noting that all points in the cluster of ci are at most 2OPT from ci. If there is some centre from solution S in each of the partitions, then the result follows – as every point in V is within 2OPT of every item it shares a partition with. Alternatively, if there is a partition that is not represented by the centres in S, it follows that there is some partition that contains more than one point from S. Consider the time at which the algorithm selected the second point in that partition. The distance between the two centres in the partition is no more than 2OPT. Since the algorithm selects the new centre as the point farthest from all current centres, it follows that all points in V were less then 2OPT from the centres in S at this time.

This result is the best possible for k-centre unless P = NP, as Hsu and Nemhauser proved that it is NP-hard to get a performance guarantee ≤ 2 − ε for this problem [106] (via a reduction from the Dominating Set problem).

Definition 10. Given a graph G = (V, E) and an integer k, the Dominating Set problem is the problem of determining whether there exists a set of k points such that all vertices in V are either in the set, or adjacent to a vertex in the set.

Theorem 11. There is no α-approximation algorithm for k-centre with α < 2 unless P = NP [106]. 28 CHAPTER 1. RELATED WORK

Figure 1.1: Solution from Gonzalez’s k-centre algorithm with k = 3. Other than the randomly selected seed (labelled 1), the chosen centres tend to be peripheral to their cluster.

Proof. The dominating set problem is NP-hard [84] – we can prove the hardness of k-centre by a reduction from dominating set. Consider an instance G = (V, E) of the dominating set problem; to map this to an instance of the k-centre problem, we need to construct a complete graph with metric weights on the edges. Let the distance between vertices adjacent in G be 1, and the distance between non-adjacent vertices 2 – these weights will always be metric. There is a dominating set of size k in G iff the k-centre solution on the new graph has radius 1. If we have an α-approximation algorithm for k-centre with α < 2, then it would produce a solution of radius 1 for instances where one exists – allowing us to solve dominating set in polynomial time. Therefore, such an algorithm for k-centre can not exist, unless P = NP.

The weighted k-centre problem is mostly the same as the k-centre problem, but with weights associated with each vertex (in addition to the edges). A heuristic for the weighted k-centre problem involves choosing the point with the largest weight and greedily selecting as the next centre in the same method as Gonzalez’s k-centre algorithm [69]. 1.3. PARALLELISM AND CLUSTERING PROBLEMS 29 k-means

One of the most ubiquitous clustering algorithms is the k-means algorithm, also referred to as Lloyd’s algorithm [142], which has been used over various applications from text clustering to network optimisation [26]. Lloyd’s algorithm is commonly referred to as “the k-means algorithm” due to its frequent use as an approximation heuristic for the k-means clustering problem, for which it obtains a locally optimal solution. As the cluster centres are not required to be points in the input, we refer to them as centroids. The algorithm converges to a local minimum for the k-means clustering objective function [171]. The algorithm works by iteratively improving on the set of centroids, starting with the initial seeding and ending when the solution is stable. This process involves assigning all points to their nearest centroid, and reassigning the centroid for each cluster to the centre of mass for of the points in that cluster. The k-means algorithm is used for many general clustering problems, and is a special case of the EM (expectation maximisation) algorithm [144]. The highlight of the algorithm is not the quality of clusters found, but the speed with which it performs in practice. Altering this algorithm to instead use the data point closest to the clusters’ centroid gives what is referred to as the k-medoids algorithm [121].

Algorithm 3 k-means(V, k) 1 1 1 1: Select an arbitrary set S = {s1, s2,..., sk} of initial centroids. 2: repeat 3: Assign clusters {C1, C2,..., Ck} such that Ci contains the points in V closer to t t si than to S\si. t+ − 4: 1 = | | 1 Reassign each S such that si Ci ∑v∈Ci v. t t t t−1 t−1 t−1 5: until {s1, s2,..., sk} 6= {s1 , s2 ,..., sk } t t t 6: return {C1, C2,..., Ck}

| |−1 The term Ci ∑v∈Ci v defines the centroid for cluster Ci. In document clustering, for example, this defines the mean document vector of the cluster. The popularity of the k-means algorithm is primarily due to its simplicity and speed, as in practice it runs quite fast. The complexity is O(n · l · k), where n is the number of points, k is the number of clusters required, and l is the number of 30 CHAPTER 1. RELATED WORK iterations until the algorithm obtains a solution. For instances where l is constant and k much smaller than n, this tends to be linear in n. The worst-case complexity for the k-means algorithm is superpolynomial, and some constructed examples can run in up to 2Ω(n) iterations [10]. The seedings for the k-means algorithm can be improved by using solutions to other clustering problems. For example, the application of Gonzalez’s k-centre al- gorithm [90] to seeding the k-means algorithm has been called the KKZ method [120]. The main issue with this approach for k-means seeding is sensitivity to outliers, as Gonzalez’s algorithm always chooses the point farthest from the existing set of centres. When seeded with the solution of a local search algorithm, Lloyd’s algorithm has a (9 − ε)-approximation guarantee [116]. This particular approach was adapted from the (3 + ε)-approximation algorithm for the k-median problem by Arya et al. [12]. A k-means seeding method called k-means++ [11] gives a clustering with expec- ted value within a factor of O(log n) of optimality. This algorithm finds a balance between uniformly random seeding and farthest point seeding, in that it selects ran- dom seeds with a bias towards choosing those further from previously selected points.

Clustering with MapReduce

Ene et al. [72] gave a MapReduce algorithm for the k-median problem, which uses a sequential weighted k-median algorithm as a subprocedure. Their algorithm selects a random subset of points S, and adds points to S until most vertices of the graph are within a bounded distance of the sample, and finally adds remaining unrepresented vertices to S. The k-median algorithm uses this procedure to find a sample of the vertices in the input graph. This set of vertices is arbitrarily partitioned, and each reducer receives one of the partitions along with the set of sample points. The reducer assigns a weight to each of the sample points indicating the number of vertices for which this point is the closest (in the sample). The sample points and weights are sent back to the mapper, which sends them all to a single reducer. This reducer calculates the sum of the weights for each of the sample points, giving the number of vertices in the complete graph which are closest to each of the sample points. 1.3. PARALLELISM AND CLUSTERING PROBLEMS 31

The sample points and aggregate weights are then sent to a final reducer, which runs a weighted k-median algorithm and returns the approximate solution. If the approximation ratio of the sequential k-median algorithm is α, then this algorithm achieves a (10α + 3)-approximation to the k-median algorithm. Ene et al.’s experimental results were based on the 3-approximation algorithm from Arya et al. [12] – the resulting MapReduce implementation would have a 33- approximation guarantee. There have been more recent advances in k-median ap- proximation algorithms, with Byrka et al. [35] giving a 2.611 + ε-approximation al- gorithm; this would improve the approximation guarantee of the MapReduce imple- mentation to 29.11 + ε, and it would be of interest to determine how the resulting procedure would perform in practice. Ene et al. [72] also use this sampling procedure for adapting k-centre to Map- Reduce. A sequential k-centre algorithm with approximation factor α is run on the same sample S used in their k-median procedure. With high probability, the k res- ulting centres constitute a 4α + 2-approximation for the k-centre instance. When implemented using a sequential 2-approximation algorithm, with high probability, this results in a 10-approximation overall. Ceccarello et al. [36] gave a MapReduce diameter-approximation procedure with low parallel depth. From this, they derive a k-centre solution for graphs with unit- weight edges: for k ∈ Ω(log2 n), with high probability, this is a O(log3 n)-approximation. Recently, there has been increased interest in adapting k-centre to MapReduce. Im and Moseley [107] have described a randomized 3-round 2-approximation algorithm that requires prior knowledge of the value of the optimal solution. Although they have announced that this leads to a 4-round 2-approximation without the require- ment, the details have yet to be outlined. Recently Malkomes et al. [147] gave a two- round approach that utilises Gonzalez’s algorithm as a subprocedure. The k-means problem has also been parallelised using the MapReduce framework. Lloyd’s k-means algorithm has been implemented in MapReduce [187], as has the EM (expectation maximisation) algorithm [62] (a generalisation of k-means). Ene et al. analyse the performance of a sampling-based MapReduce implementation of Lloyd’s algorithm for the purpose of approximating k-median solutions [72]. Bah- mani et al. [17] developed a MapReduce implementation of the inherently sequential 32 CHAPTER 1. RELATED WORK k-means++ algorithm – in practice, their approach requires only a constant number of passes. Outside of the MapReduce paradigm, Blelloch and Tangwongsan [30] gave the first parallel algorithms for both the k-median and k-centre problems using a PRAM model – their implementations require Ω(N2) machines for an input of size N. The first parallel k-means algorithm was based on a Single Program Multiple Data (SPMD) framework with message-passing [65] – this technique has also been adapted for k- median [143]. Guha et al. [95] gave a single-pass constant factor approximation algorithm for solv- ing the k-median problem under the data stream model – using O(nε) memory, this algorithm runs in O(n1+ε) time. A constant-factor single-pass streaming algorithm for k-centre was introduced by Charikar et al. [40] that runs in O(nk log k) time and uses O(k) memory.

1.4 Graph Covering Problems

Covering problems are a category of combinatorial optimisation problems that in- volve identifying a sub-collection of elements that cover some other component of the general collection. We will primarily consider minimisation problems, where the objectives require that we find some sub-collection of minimal size, or that the largest element in the sub-collection is of minimal cost. As an example of such problems, consider types of covering problems on a col- lection of sets whose union covers some universe – problems can include set cover (find a minimal size collection of sets to cover all elements in the universe) and max- cover (choose at most k sets so that the maximum number of elements are covered by the union). An example of a maximisation problem that involves covering would be finding a perfect in a bipartite graph – all vertices need to be covered by the selected edges, but there are restrictions on how edges can be added to the solution. With clustering problems, we were only interested in the groupings of the data. However, for some applications we might be concerned about the structure of the groups. For example, we can consider the k-centre problem as modelling the max- imum time takes to reach a client from one of the k facilities. But if we allow travelling 1.4. GRAPH COVERING PROBLEMS 33 directly between client locations, we might instead be interested in minimising the maximum length path. For such instances, the k-path covering problem offers a useful extension on the k-centre objective, by allowing us to account for the ordering in which locations are visited. We can model such problems as a covering of the vertices of the graph, using graph models such as paths and trees to structure where previously we selected clusters. These can be considered as a sub-class of routing problems, which include the fam- ous Travelling Salesman Problem. The structures used to cover the graph therefore represent the routes taken to traverse the graph – for different applications we need to consider different structures for the covering. Many approximation algorithms for covering problems are inherently sequential, as identifying which elements should be added to a solution often depends on pre- vious selections – there are often elements with overlapping coverage that need to be accounted for. Some parallel algorithms for covering consider approaches for deal- ing with such cases – for example, Blelloch et al. [29] gave a parallel approximation algorithm for set cover that involves finding a sub-collection of “nearly independent” sets.

Covering Vertices and Edges

The proof of NP-hardness for the k-centre problem relied on a mapping from the dominating set problem – we will now consider this problem, and related models, in more depth. As discussed earlier, the dominating set problem involves selecting a minimal sub- set of the vertices such that all other vertices of the graph are dominated – that is, all vertices are in the selected set or neighbour a selected vertex. We can alternatively consider the dominating set problem as covering the vertex set with the minimal number of stars – where a star is defined by a vertex and the edges adjacent to it. This is closely related to the classic NP-complete [119] problem, in which the aim is instead to select a minimal set of vertices such that all edges are incident on one of the selected vertices. Both vertex cover and dominating set involve selecting a set of vertices to cover some component of the graph – dominating set seeks to cover the vertices; vertex cover aims to cover the edges. 34 CHAPTER 1. RELATED WORK

Definition 12. Given a graph G = (V, E), the vertex cover problem involves selecting a minimal subset S ⊆ V such that for all (u, v) ∈ E either u ∈ S or v ∈ S. The vertex cover problems is one of the classic NP-complete problems [119] – a simple reduction to the maximal matching problem (which is polynomial-time solv- able) results in a 2-approximation algorithm [85]. Vertex cover has also been shown to be APX-complete [158]. The hypergraph vertex cover problem is a generalisation of vertex cover to hypergraphs, where edges can connect more than two vertices – hypergraph vertex cover is equivalent to set cover. Vertex cover can not be approximated to within a factor of 1.3606 in polynomial time unless P = NP [66]. According to the Unique Games conjecture, it might not be possible to approximate vertex cover to within a factor of 2 − ε, or vertex cover on a k-uniform hypergraph to within a factor of k − ε [123]. Definition 13. Given a graph G = (V, E), the edge cover problem involves selecting a minimal subset S ⊆ E such that every vertex in V is incident to an edge in S. The edge cover problem involves selecting edges of a graph to cover the vertices – in contrast to the vertex cover problem, edge cover is in P and has a polynomial time algorithm based on mapping instances to a maximum matching problem [84]. For any bipartite graph with no isolated edges, a maximal matching is also a minimal edge cover [127]. In general, the size of a maximal matching provides a lower bound on the size of a minimal edge cover. In 1965, Edmonds gave a maximum matching algorithm that runs in O(|V|2|E|)-time [70]. Micali and Vazirani gave an algorithm that finds a maximum matching in O(|V|1/2|E|)-time [152]. There are several parallel implementations of algorithms for graph covering prob- lems. Khuller et al. [124] gave a parallel primal-dual approximation algorithm for the vertex cover problems. Lattanzi et al. [130] demonstrated how both vertex cover and edge cover can be solved under the MapReduce paradigm. Their vertex cover solu- tion exploits the connection with the maximum matching problem, and maintains the 2-approximation guarantee. The max-cover problem (closely related to set cover) was implemented in MapReduce by Chierichetti et al. [50]. We can generalise edge cover to path cover, in which the vertices are covered by paths instead of disjoint edges. Path covering problems can either minimise the number of paths, or their length. 1.4. GRAPH COVERING PROBLEMS 35

Bounded Coverings

For many applications, there might be a constraint on the size allowed for each of the graph structures used in the cover – this can represent bounds on travelling time or distance. For example, if each path used to cover the graph represents the route taken by some robot agent, we might need to account for battery life, or the distance travelled from the hub. To model such constraints, we consider bounded covering problems, where we are given some limit B on the cost that each component is allowed to use. For routing problems, this bound can be considered as a distance constraint, as it gives an upper bound on the distance that can be travelled. The objective is then to minimise the total number of components required so that all vertices of the graph are covered. The bounded path cover problem is a particularly useful example of such a model.

Figure 1.2: A tree cover is more flexible than a clustering or star cover, as vertices need not be assigned to the centre.

Definition 14. Given a complete graph G with positive edge weights and a value B, the minimum path cover problem is to find a path cover of G with the minimal number of components, where the sum of the edge weights in each component is no more than B. 36 CHAPTER 1. RELATED WORK

Li et al. [136] considered a variant on the minimum path cover problem in which each path must start from a given “depot” vertex (although the paths can end at any vertex). They show that a lower bound for solutions to this problem can be given by using a bound on solutions for the multiple travelling salesman with time windows (mTSPTW) problem considered by Desrosiers et al. [64].

Figure 1.3: A path cover is is a special case of a tree cover that does not allow for branching.

Arkin et al. [8] gave a 3-approximation to the minimum path cover problem prob- lem that involves finding a series of minimum weight forests – each of the trees in the forest is converted into a tour, and split up into paths. They further proved that the best-possible approximation ratio is 2 unless P = NP by a reduction from the travelling salesman problem. Nagarajan and Ravi refer to path cover as the unrooted distance-constrained vehicle routing problem, and gave an alternative 3-approximation algorithm with better runtime complexity [155]. Their approach simplifies the tech- niques used in the algorithm of Arkin et al. [8]. This problem has also been solved using LP-rounding techniques with a constant integrality gap [156]. We can also consider problems where the goal is to cover the vertices of the graph with a collection of subtrees. If we are given a bound B on the size of a tree and aim to minimise the total number of trees used, this is the minimum tree cover problem. 1.4. GRAPH COVERING PROBLEMS 37

Arkin et al. [8] prove the hardness of this problem via a reduction from the minimum star cover problem. The minimum path cover algorithm of Arkin et al. [8] is a 3- approximation for the minimum tree problem. Even et al. [74] give an algorithm that finds a rooted tree covering of a graph using at most k trees where the maximum cost of a tree is 4B for some bound B – if no solution of size at most B exists, then the algorithm fails. They iterate upon this procedure to develop an algorithm that minimises the maximum cost of a tree in a rooted covering.

Figure 1.4: Classical clustering problems such as k-centre and k-median assign points to the nearest centre, creating stars.

The minimum star cover problem (also referred to as bounded star cover) involves finding a minimal set of stars of total cost at most B that cover the vertices of the graph, where a star is defined as a vertex and a subset of the edges adjacent to that vertex. Star coverings of graphs are closely related to more typical clustering models, as a standard k-centre or k-median solution can be interpreted as a set of stars. Figure 1.5 shows an example of a star covering that is distinct from solutions to these classical clustering models. The primary distinction is that standard clustering solutions only involve selecting the set of centres – a star covering might not always assign points to the nearest centre e.g., when the cost of the nearest star is already close to the bound B. Arkin et al. [8] give a 2α + 1-approximation to the minimum star cover problem 38 CHAPTER 1. RELATED WORK by finding a series of α-approximate k-median solutions and splitting up the clusters – the current best k-median algorithm has an approximation ratio of α = 2.611 [35], giving a minimum star cover solution with at most 6.222 times more stars than the optimal solution.

Min-Max Covering Problems

If we are given a bound k on the number of components (instead of a bound B on the size of each component), then we can consider the min-max version of the covering problems. For these instances we aim to select at most k components, such that the maximum cost of a component is minimised. Arkin et al. [8] give a series of approximation algorithms for min-max covering problems by utilising solutions to the bounded version of these problems. The gen- eral model for these algorithms is to perform a binary search over plausible values of the bound B to determine the minimum value of B that returns a solution with less than k components. A variation of this approach gives 4-approximations to both the k-path cover problem and the k-tree cover problem

Figure 1.5: In a star covering, points are not necessarily assigned to the nearest centre.

The k-tree covering problem (also referred to as min-max k-tree cover) is similar to the k-star cover problem, but requires the graph to be covered by trees. As trees are a generalisation of stars, this appears to be an easier problem to solve than the k-star problem. Existing results for k-star are bicriteria approximations, possibly returning 1.4. GRAPH COVERING PROBLEMS 39 infeasible solutions with more than k stars – but for the k-tree problem, Even et al. [74] have given a 4-approximation. Motivated by a multi-agent robot routing problem, Zheng et al. [188] adapted the LP-rounding algorithm of Even et al. for instances where vertices are also weighted, giving a 16-approximation for this problem. Applegate et al. [6] apply branch-and-cut search techniques to exactly solve a latency version of the min-max path cover problem with a fixed set of depots – the objective has a secondary goal of minimising the average wait-time for each of the vertices to reflect a customer-service model. They refer to this problem as the newspaper routing problem, as the initial example of this problem involved delivering newspapers to customers.

Metric k-Star Coverings

While many covering problems can be modelled on general graphs, we can consider models closer to those used in metric clustering – in particular, coverings on metric- weighted complete graphs. In the metric k-star cover problem, we are given a complete graph G = (V, E) with metric edge weights d : E → R+, and an integer k bounding the number of stars. The aim of this problem is to find a set of k stars on the graph such that every vertex is in a star, and the maximum cost of a star is minimised. Definition 15. Given a complete graph G = (V, E), a metric l : E 7→ R+, k > 0, the k-star cover problem involves identifying a set of stars S1,..., Sk such that every vertex of G is in some star and maxi l(Si) is minimal. The k-star problem can be interpreted in terms of a clustering problem, but it differs from classical clustering objectives such as k-median and k-centre in that it can be suboptimal to assign points to the nearest cluster centre. This is because we might assign disproportionately many points to a single centre, resulting in unnecessarily large solutions. In the worst case, all points can be assigned to the same centre – depending on the distances to other cluster centres, this can be as bad as O(k) times optimal (because splitting the points across the k centres can reduce the maximum cluster size by as much as a factor of k). Figures 1.6 and 1.7 give an illustration of how the k-star solution size can be reduced by assigning points to further centres if this splits up a large cluster. 40 CHAPTER 1. RELATED WORK

Figure 1.6: Assigning points to the nearest centre can result in solutions with biased stars – it is often better to balance assignment across the centres.

Figure 1.7: It is often better to assign points to further centres to avoid over-weighted stars.

This problem is somewhat similar to the k-median problem, with the added com- plication of having to optimally assign vertices to centres. The best known algorithm for the k-star problem utilises a solution for the k-median problem. Most current results for this problem are bicriteria approximations – they obtain solutions within a constant factor of optimality that use a constant factor more than k stars. We can consider the k-star cover problem on a metric in terms of the following integer program. The variable yi indicates whether a vertex i is a root, and variables xij indicate whether vertex j is in a star rooted at i. The objective is to minimise the size M of the maximal star in the solution. An LP-relaxation of this linear program would allow both fractional roots and fractionally assigned vertices, so rounding to an integral solution could be difficult. 1.4. GRAPH COVERING PROBLEMS 41

Minimise M

s.t. ∑ dij · xij ≤ M ∀i ∈ V j∈V

∑ xij = 1 ∀j ∈ V i∈V

∑ yi ≤ k ∀i ∈ V i∈V

xij ≤ yi ∀i, j ∈ V + xij ∈ {0, 1}, yi ∈ N , ∀i, j ∈ V

Even et al. [74] presented a bicriteria approximation algorithm for the k-star prob- lem using a LP-rounding approach. They used the LP-relaxation of the following integer program for the minimum star cover problem, which minimises the number of stars instead of the maximum cost of a star, which here is bounded by B.

Minimise ∑ yi i∈V

s.t. ∑ xij ≥ 1 ∀j ∈ V i∈V:dij≤B

∑ dij · xij ≤ B · yi ∀i ∈ V j∈V

xij ≤ yi ∀i, j ∈ V

xij ∈ {0, 1}, yi ∈ N, ∀i, j ∈ V

They apply binary search to find the minimal value of B such that the LP-relaxation gives a solution with no more than k stars; hence B becomes a lower bound on the optimal cost of a k-star cover. When rounded to an integral solution, the resulting star cover has cost at most four times the optimal lower bound B; however, their algorithm also returns up to 4k stars. Another bicriteria result for the k-star problem is from Arkin et al. [8]. They use the (3 + 2/k)(1 + ε)-approximation algorithm for the k-median problem by Arya et al. [12] to obtain a feasible k-star cover of the graph. They bound the size of the largest star in the k-median solution relative to the optimal k-star size and then use the metric properties of the graph to split each large star into smaller stars such that their cost 42 CHAPTER 1. RELATED WORK is bounded. The resulting k-star solution has at most three times the optimal cost and 3k stars. As the approximation ratio of k-median has since been improved to 2.611 + ε by Byrka et al. [35], the approximation for k-star is also improved – both the approximation ratio for the solution and the maximum number of stars. There is no known constant factor approximation algorithm for the k-star problem that is guaranteed to return a feasible number of stars – both of the known algorithms can return more than k stars. The k-star problem is even hard on simple metrics – Ahmadian et al. [2] proved that this problem is APX-hard on the Euclidean metric, and NP-hard on a line metric. They also give a polynomial-time approximation for the line-metric case.

Related Models

Problems such as facility location and machine scheduling are widely studied, and are central topics in this field. Models such as the k-star problem are of interest due to their relation to broader questions in the fields of combinatorial optimisation and clustering. Arkin et al. [8] motivated their interest in these graph covering problems with applications in vehicle routing, such as nurse station location. For this applic- ation, a star in a graph represents a nurse station and the patients to be visited; in between visits, a nurse must return to the station to pick up supplies. The closely related questions of covering graphs with trees and paths have been applied to the problem of multi-robot routing [188].

Rooted and Unrooted Coverings

In general, we consider the unrooted version of k-star and k-tree – where the centres (or roots) are not predefined, but must be selected as part of the solution. The rooted variant of this model provides the centres as part of the problem definition – the solution is therefore a collection of edges based around the given roots. In general, given a graph G = (V, E) and a subset R ⊆ V such that |R| = k, a rooted graph covering involves finding a collection of minimal size subgraphs of G rooted at the vertices in R – these subgraphs can be required to be trees, path, tours, or some similar graph object. We can consider these roots as a set of pre-defined depots or facilities – the purpose of the cover is to identify an optimal set of routes based on the location 1.4. GRAPH COVERING PROBLEMS 43 of these facilities/roots. In the case of rooted star covering, these roots give a predefined set of centres for the clusters – unlike classical clustering models, this problem remains NP-hard even when the set of cluster centres are already defined. We can interpret the rooted star cover problem as a – the distances from points to the depots relate to the cost of assigning the items to the bins. Even et al. [74] used the relation between rooted star cover and bin packing to prove the hardness of many of the star and tree covering problems that we consider. As with rooted k-star, the rooted tree cover problem is NP-complete, and has been given a (4 + ε)-approximation algorithm by Even et al. [74]. This algorithm involves an algorithm for bounded tree covers, where each tree has cost no more than bound B – performing binary search over feasible values of B gives an approximation to the general problem. The algorithm for the bounded problem involves splitting up components of a minimum spanning tree and matching the components to the roots using paths of bounded size.

Bin Packing and Parallel Machine Scheduling

The rooted variants of graph covering problems can be related to other types of assignment problems such as bin-packing and machine scheduling. Even et al. [74] prove the hardness of tree and star covering problems via a reduction from the bin packing problem, in which the objective is to minimise the number of bins of uni- form capacity B required to fit some collection of items. This model relates nicely to bounded covering problems, and can be mapped to the rooted bounded star cover problem. If we fix the number of bins and do not assume a packing limit B, we can relate bin packing to the more general field of parallel machine scheduling. Machine schedul- ing problems are typically involve a set of jobs and a set of machines, with fixed processing time for each job on each machine. The aim is to assign jobs to machines such that some objective function is optimised. For example, a common aim for machine scheduling problems is to minimise the makespan, which is the maximum completion time of a machine. The completion time of a given machine is the sum of the processing times of the jobs assigned to it, and when machines run in parallel, we 44 CHAPTER 1. RELATED WORK can consider the makespan as the maximum of these sums across all the machines. For unrelated parallel machines, we consider all machines to start processing at time zero, with jobs possible having different processing times on different machines. Machine scheduling problems are commonly expressed using three-field schedul- ing notation [93], in which three types of problem characteristics are delimited by vertical bars. The first field specifies the machine environment – for example, 1 indic- ates a single-machine problem, P indicates that there are m parallel machines, and R indicates that there are m independent machines in which each job-machine pair (i, j) is assigned a processing time pi,j. The second field specifies the job characteristics and constraints. Examples include prec to indicate the presence of precedence constraints, and dj to specify deadlines for each job j. The final field expresses the objective function of the problem. As an example, 1|prec|Cmax indicates a single-machine scheduling problem with precedence constraints with the objective of minimising the makespan. The best known approximation algorithm for the problem of scheduling unre- lated parallel machines to minimise the makespan (denoted R||Cmax in scheduling notation) gives a 2-factor approximation [134]. Lenstra, Shmoys and Tardos used a linear programming approach to obtain a fractional solution, which they used to find a lower bound on the optimal cost. By bounding the number of fractionally assigned jobs, they could reassign jobs to machines such that the resulting solution was no more than twice their lower bound. They also proved that obtaining an approximation ratio better than 3/2 is NP-hard [134]. Gairing et al. [82] presented a fully combinatorial algorithm for the problem of scheduling unrelated parallel machines to minimise the makespan which also gives a factor-2 approximation. Their algorithm involves modelling the machine schedul- ing problem in terms of flows on a bipartite graph and using an algorithm for the minimum cost flow problem which is modified to handle unsplittable flows.

Capacitated Facility Location Problems

Ahmadian et al. [2] refer to the k-star covering problem as Minimum Load k-Facility Location, indicating its close relation to bounded facility location models. The fa- cility location problem has been widely studied, and hence there are many known 1.4. GRAPH COVERING PROBLEMS 45 approximation algorithms and variants on the problem which may be of interest with regards to the k-star and other covering problems. A variant on the standard facility location model is capacitated facility location, in which each facility has a bound on production, which can be considered in terms of a bound on the cost of a star rooted at that facility. For this problem, designating a vertex as a root is equivalent to opening the relevant facility, and stars are representative of the clients assigned to the facility at which the star is rooted.

Definition 16. Given sets F of facilities and C of clients, a metric d : C × F → R+, and a cost function c(·) on the facilities and a bound B, the capacitated facility location problem involves finding a subset S ⊆ F of facilities such that ∑i∈S c(i) + ∑j∈C mini∈S d(i, j) is minimised and the cost associated with each facility in F is at most B.

This is similar to the model used for the Uncapacitated Facility Location problem (UFL). However, with the addition of the capacity constraint B, the optimal solution for UFL can be infeasible for the capacitated problem if the production costs assigned to a particular facility exceed B. This bound is similar to a limit on the cost of a star – equivalently, in order to meet this constraint, we might need to assign clients to facilities other than the nearest available. Levi et al. [135] used linear programming and rounding to obtain a 5-approximation algorithm for the special case where all facility opening costs are equal. It remains an open problem to find an approximation algorithm using linear programming for the general case [135].

The Steiner Tree Problem

Definition 17. Given a graph G = (V, E) and a set T ⊆ V of terminal vertices, a Steiner tree is a tree in G that spans the vertices of T.

The Steiner tree problem (or minimum Steiner tree problem) is the problem of finding a minimal-cost Steiner tree for a given graph G and subset T of the vertices of G. This is a generalisation of the minimum spanning tree (MST) problem – however, while the MST problem is solvable in polynomial time, the minimum Steiner tree problem is a classic NP-complete problem [119]. The Steiner tree problem on a Euclidean metric admits a PTAS [51], however on general metrics this problem is APX-complete [27]. 46 CHAPTER 1. RELATED WORK

In general, it is NP-hard to approximate the Steiner tree problem to within a factor of 96/95 [51] and the best current result is a (ln(4) + ε)-approximation by Byrka et al. [34] using an iterative randomised rounding technique.

1.5 The Min-Sum Set Cover Problem

We consider the Min-Sum Set Cover problem (MSSC) – a problem closely related to a wide spectrum of models in optimisation theory. We assume the input is a collection S = {S1, S2,..., Sm} of sets covering a universe U. For an ordering A =

{A1, A2,..., Am} of the sets in S, let C(i) indicate the cover time of element i ∈ U – where the cover time indicates the smallest index j of a set Aj ∈ A that includes the element i.

Definition 18. Given a collection of sets S covering a universe U, the min-sum set cover problem involves finding an ordering of the sets in S such that ∑i∈U C(i) is minimised. The min-sum objective is equivalent to minimising the average cover time of the elements in U. Drawing from scheduling theory, we adapt the standard MSSC model to allow for precedence constraints. A precedence constraint is a binary relation i ≺ j indicating that i must occur before j in an ordering. We introduce the problem of precedence-constrained min-sum set cover, which we will refer to as precMSSC. If the sets in precedence-constrained MSSC are substituted with weighted jobs, the model simplifies to the well-known problem of precedence-constrained scheduling to minimise the sum of weighted completion times. This was given an efficient 2- approximation by Chekuri and Motwani [46] using a combinatorial approach that reduces the problem of finding a good precedence-closed collection of jobs to the min-cut problem, which can be solved optimally in polynomial time [83].

Classic Set Cover Problems

Unlike the min-sum set cover problem, classic set cover models don’t involve order- ings or cover times – the objective is only to select sets such that all elements in the universe are covered.

Definition 19. Given a collection of sets S covering a universe U, the set cover problem ¯ involves finding a minimal sub-collection S ⊆ S such that ∪A∈S¯ A = U. 1.5. THE MIN-SUM SET COVER PROBLEM 47

Set cover is NP-Complete [119] even in the special case where each element occurs in exactly two sets – this can be shown via reduction from the vertex cover prob- lem [76]. Selecting a set and covering elements is therefore equivalent to choosing a vertex and covering the adjacent edges. Extending to the general set cover problem requires that edges represent a set of vertices – this is the hypergraph vertex cover problem. The dimension of the hypergraph is the maximum number of sets that an element occurs in.

Algorithm 4 Greedy(S, U, w) 1: R ← U 2: RESULT ← ∅ 3: while R 6= ∅ do ← |s∩R| 4: A arg maxs∈S w(s) 5: RESULT = RESULT ∪ {A} 6: R = R\A, S = S\{A} 7: return RESULT

The greedy set-cover algorithm was analysed by Chvatal [54], and chooses sets in order of the number of new elements they cover. This algorithm is actually for the weighted version of the set cover problem, in which each set s ∈ S is assigned a (positive) weight w(s), and the objective is to minimise the sum of weights of sets ∗ in the solution S , given as ∑s∈S∗ w(s). This algorithm (given in Algorithm4) has an approximation ratio of ln n − ln ln n + Θ(1) for the set cover problem [177]. The approximability threshold for set cover is (1 − ε) ln n unless all problems in NP have deterministic nO(log log n)-time algorithms [76]. We can model weighted set-cover in terms of the following linear program. Here, we use binary variable xj to indicate whether set Sj is in the solution.

Minimise ∑ wjxj j∈S

s.t. ∑ xj ≥ 1 ∀i ∈ U j∈S:i∈Sj

xj ∈ {0, 1}, ∀j ∈ S 48 CHAPTER 1. RELATED WORK

It was noted earlier that covering problems have two standard models – we either aim to minimise the number of sets required to cover all elements, or are given an upper bound on the number of sets and aim to maximise the coverage. The max cover (or k-Max-Cover) problem is a bounded version of the set cover problem – we are given a bound k, and need to select at most k sets such that the maximum number of elements is covered. This problem is NP-hard, and can not be approximated to within 1 − 1/e + o(1) of optimal unless P = NP [76] – the greedy algorithm has a 1 − 1/e-approximation guarantee [76; 102]. Many useful set functions are submodular – that is, the value gained from adding a single element to the set decreases as the size of the set increases. Intuitively, this can be considered in terms of diminishing returns. Formally, we have the following equivalent definitions:

Definition 20. A function r(·) on a finite set U is submodular if for every Y ⊆ U, X ⊆ Y and x ∈ U\Y we have r(X ∪ {x}) − r(X) ≥ r(Y ∪ {x}) − r(Y).

Definition 21. A function r(·) on a finite set U is submodular if for every X, Y ⊆ U we have r(X ∩ Y) + r(X ∪ Y) ≤ r(X) + r(Y).

Minimisation of submodular functions admits polynomial time algorithms [110] – the maximisation problem is NP-hard [157]. Many optimisation problems, such as max cover, can be modelled in terms of maximising a submodular function with cardinality constraints – a simple greedy algorithm gives a 1 − 1/e-approximation for all such problems [157].

Covering with Min-Sum Criteria

Definition 22. Given an ordered collection of sets A = A1,..., An, the cover time C(i) of an element i is the minimal index j such that i ∈ Aj.

Feige et al. [77] model the min-sum set cover problem in terms of a linear ar- rangements of vertices in a hypergraph, with the objective of minimising the average cover time for the hyperedges. They give a 4 − ε-approximation for this problem by simplifying a result of Bar-Noy et al. [22], and prove that this is optimal as a consequence of a result for related min-sum colouring problem [23]. We can define 1.5. THE MIN-SUM SET COVER PROBLEM 49 the min-sum colouring problem as follows: given a graph G = (V, E), a min-sum colouring is a linear ordering of the vertices in V that minimises the sum of cover times for the edges. Feige et al. [77] consider min sum colouring as a special case of MSSC.

Min-sum set cover has been studied in a wide variety of contexts, with several application-specific generalisations. Munagala et al. [154] refer to the problem as pipelined set cover for the application of query optimisation on data streams, and use linear programming techniques to prove that both local-search and greedy al- gorithms give a 4-approximation for this problem. There is a significant amount of literature on similar models of linear orderings in the context of pipelined filter ordering – the goal is to order a collection of query operators, each associated with a cost and a probability that indicates the expected proportion of the data elements that it “filters”. That is, instead of a fixed set of elements to be filtered out by each query, there is instead a probability of a query filtering out results of the queries – the behaviour between queries is assumed to be independent. As with the min-sum model, the sum cost objective for query optimisation aims to minimise the sum of the costs for each of the data elements – here the total cost for an element j is the time taken before j is filtered, weighted by the cost factor of the query that filtered j. That is, the sum cost metric measures the weighted sum of filtering (or processing) times for each of the input elements, and the objective is to order the queries such that the expected costs are minimised. Burge et al. [32] considered precedence constraints in this context and found that not only is the sum cost metric NP-hard under this model, but an approximation with ratio better than a nθ (for an arbitrary positive θ) would imply an improved approximation algorithm for the densest k-subgraph problem (which has a current best approximation ratio of n1/4+ε [28]). If the current best-known approximation ratio for densest k-subgraph is the best-possible, then the best possible approximation for the sum cost metric is n0.064 unless P = NP [32].

Min-sum set cover can also be modelled in terms of a facility location problem. Chakrabarty and Swamy [39] defined the minimum latency uncapacitated facility location problem (MLUFL) and the minimum group latency problem (MGL), for which they identify MSSC as a special case. As MSSC is a special case of MLUFL on a uniform metric, it is also a special case of the matroid median problem, which 50 CHAPTER 1. RELATED WORK allows an 8-approximation [180]. In the context of machine learning, Kaplan et al. [117] refer to an on-line variant of MSSC as learning with attribute costs – they give a 4-approximation for this problem based on the work of Feige et al. [77]. Shayman and Fernandez-Gaucherand [172] consider the problem of precedence-constrained fault detection in telecommunica- tions networks – their objective function utilises a risk-sensitive optimality criterion that aims to minimise variance in fault detection. Blelloch et al. [29] give a parallel algorithm for set cover and facility location problems that can also be used to give a 4- approximation to MSSC. Cohen et al. [55] give a constant factor approximation to the problem of sequential trial optimisation, which extends MSSC by only charging for the first f tests – Feige et al. [77] prove that a greedy algorithm gives a 4-approximation to this variant. MSSC is also closely related to a range of problems in decision tree analysis. Chakara- varthy et al. [37] proved the hardness of the minimal decision tree problem via a reduction from MSSC – the connection between this problem and MSSC was emphas- ised by Gupta et al. [96]. Ghasemzadeh and Jafari [86] define the min-cost identifica- tion problem for applications in constructing wireless sensor networks – they prove the hardness of this problem by a reduction from MSSC. The techniques used by Feige et al. [77] for MSSC have been useful in providing approximations for decision tree problems [38]. The min-sum vertex cover problem (MSVC) is a special case of min-sum set cover in which each set contains exactly two elements. Under the hypergraph cover model of MSSC, min-sum vertex cover can be interpreted as the special case where the hypergraph is a standard graph – each edge pairs exactly two vertices. This relates the problem back to the usual vertex cover problem. The distinction is in the objective function, as the min-sum variant of the vertex cover problem minimises the sum of cover times for the edges, rather than the final cover time – as with the difference between set cover and min-sum set cover. Using LP-rounding, Feige et al. [77] gave a 2-approximation to the min-sum ver- tex cover problem. Berenholz et al. [25] made an incremental improvement on this result, providing an 1.99995-approximation to MSVC. Using multi-stage randomised rounding on the same LP as Feige et al. [77], Iwata et al. [111] improved this to give 1.5. THE MIN-SUM SET COVER PROBLEM 51 a 1.79-approximation. They also introduce a generalisation of GenMSSC, and prove that the greedy MSSC algorithm of Feige et al. [77] gives a 4-approximation even when the cost function is supermodular.

Generalisations of Min-Sum Set-Cover

Consider the following equivalent formulation of MSSC: we have a collection S of sets from a universe U, and select elements from the universe to cover the sets. A set is considered covered the first time an element it contains is selected – the objective is to minimise the sum of cover times. A related problem is the min-latency set cover problem (MLSC) – in which the cover time of a set is the time at which the last of its elements is selected. This problem is was proven to be strongly NP-hard by Hassin and Levin [101], who also gave an e- approximation algorithm. They also identified the problem as a variant of single ma- chine scheduling with precedences (denoted 1|prec| ∑j wjCj in three-field schedul- ing notation [93]), for which there are several 2-approximations [46; 99; 83]. This construction involves modelling both the sets and their elements as jobs, with each element being precedent to the sets that contain it. MLSC is a variant on the minimum latency problem – given n points in a metric space, find a Hamiltonian path from a given point s that minimises the sum of visit times for the points. The minimum latency problem has also been referred to as the deliveryman problem and the travelling repairman problem [7], and several algorithms for this problem rely on solutions to the prize-collecting Steiner tree problem [7; 88] The k-travelling repairman problem – using k tours to cover the vertices given a single depot – was given a constant-factor approximation by Fakcharoenphol et al. [75]. Lagoudakis et al. [129] refer to this problem as MiniAve (for minimising the average), and motivate their studies with applications to multi-robot routing. The multi-depot version of this problem has a 24-approximation by Chekuri and Kumar [45], who refer to the problem as the maximum coverage problem with group budget constraints and model it in terms of set covering. A generalisation of both MSSC and MLSC is the Generalised min-sum set-cover prob- lem (GenMSSC), in which we are given a value k indicating the number of times a set needs to be hit before it is deemed to be covered. As with MSSC and MLSC, the 52 CHAPTER 1. RELATED WORK objective is to minimise the sum of cover times. Azar et al. [15] introduce the gener- alised min-sum set cover problem in terms of a search ranking problem, which they refer to as multiple intents reranking. They give a O(log r)-approximation algorithm for the general problem, where r is the maximum size of a set. For the special cases of MSSC and MLSC, they give a 4-approximation and a 2-approximation respectively. Bansal et al. [19] gave a randomised constant-factor approximation for GenMSSC using a randomised LP-rounding technique – their approach has an approximation ratio of 485. This has been improved to a 28-approximation algorithm for GenMSSC by Skutella and Williamson [176]. Im et al. [109] gave a 2-approximation algorithm for GenMSSC with pre-emption, that can be rounded to give a 12.4-approximation for the non-preemptive problem – they conjecture that this can be improved to a 4- approximation.

The Submodular Ranking problem is an interesting generalisation of both the set cover problem and the min-sum set cover problem in which cover times are arbit- rary submodular functions [14]. Submodular ranking extends GenMSSC problem by allowing for arbitrary submodular functions. Im et al. [108] introduced the min- imum latency submodular cover problem as a metric variant of the submodular rank- ing problem, which generalises both set cover and GenMSSC. The latency covering Steiner tree problem is a generalisation of both the latency group Steiner tree prob- lem and the generalised min-sum set cover problem – Im et al. gave a O(log2 |V|)- approximation algorithm for LCST. LCST is a special case of the minimum latency submodular cover problem. This relates MSSC to Steiner tree problems, submodular ranking and minimum latency problems.

The association between MSSC and minimum latency problems is straightforward – the cost for each element covered in a MSSC solution is the wait-time or latency associated with covering the element. The connection to submodular ranking con- siders the cost function in terms of the sets instead of the elements – for each set in the solution, the profit from adding additional sets is submodular. Scheduling to min- imise the weighted sum of a submodular function over the preceding elements has a 2-approximation algorithm that involves finding a maximum density subset [79]. 1.6. SCHEDULING AND PRIORITISATION WITH PRECEDENCES 53

1.6 Scheduling and Prioritisation with Precedences

Scheduling is a fundamental problem in operations research and computer science that appears under many variants. Scheduling problems generally involve ordering a set of jobs on a set of machines to optimise some objective – such as minimising the total processing time or some other cost function. We consider a specific type of scheduling problem with applications in software testing, that is useful in modelling test-case-prioritisation. In particular, given a set J of n jobs and some number M, the goal is to create a schedule S, ordering the jobs in J, and assigning them to the M machines such that the objective function (usually related to maximum or total completion time) is optimised. For the case where there is only one machine, S is simply a permutation of the jobs in J. The completion time for each job j ∈ J under schedule S is denoted as C(j), and the maximum completion time as Cmax. A standard measure of the effectiveness of a schedule is the makespan Cmax = maxj∈J C(j) – the time at which the final job is completed. Schedules are said to be preemptive when jobs are allowed to be stopped and restarted; we specifically consider non-preemptive schedules. A common heuristic in machine scheduling is the Shortest Processing Time rule (SPT) – also referred to as Smith’s rule – in which jobs are scheduled in increasing order of their processing times [178]. This approach produces an optimal solution for min- imising the sum of completion times on a single machine in the absence of precedence constraints. For the case where jobs have different weights, the Weighted Shortest Processing Time (WSPT) [163] heuristic is commonly used. Rothkopf [168] initially considered the impact of time on the value of jobs, and introduced the Weighted Discounted SPT rule (WDSPT). Lawler and Sivazlian [131] generalised the SPT rule to account for delay costs and processing costs. The more general problem of minimising the sum of some non-decreasing func- tion on the completion times has been referred to as the min-sum scheduling problem. The first constant factor approximation for this problem was given by Bansal and Pruhs [21], who give a 16-approximation. This was later improved to give a pseudo- polynomial-time algorithm with a (4 + ε)-approximation ratio by Cheung et al. [48] using a primal-dual approach. (Cheung and Shmoys [49] initially reported that this algorithm gave a 2-approximation – this was later proven to be a 4-approximation by 54 CHAPTER 1. RELATED WORK

Mestre and Verschae [151].) Scheduling relates to many of the other optimisation models we have considered previously. For example, rooted star covering problems are an alternative model for makespan-minimisation scheduling problems [74], hardness bounds for scheduling can be proven via relations with hypergraph vertex cover [20], and problems such as min-latency set cover generalise common scheduling models [101]. Test-case prioritisation is the problem of ordering software tests to optimise the rate at which defects in the code-base are identified. We consider a scheduling problem with applications for software testing. Empirical studies have suggested that in prac- tice there can be underlying dependencies in tests, requiring that certain pairwise orderings between tests need to be met to accurately identify some defects [186]. To properly account for these relations, we consider the problem of test-case prioritisa- tion with precedence constraints. For a set S with some precedence constraints on the elements, a simple method for obtaining an ordering the elements of S to meet the precedences is by using a topological sorting algorithm. As stated in Section 1.1, topological sorting of a DAG can be achieved in polynomial time. One standard topological sort algorithm (based on the O(|V| + |E|) breadth-first strategy of Kahn [115]) involves maintaining a set of “available” vertices, where a vertex is considered available if all of its precedences have been met by the current ordering. Rothermel et al. [167] consider the application of greedy algorithms to the test- case prioritisation problem – they observe that greedy techniques are not optimal for this problem. Li et al. [139] consider the application of genetic algorithms and hill-climbing algorithms to the test-case prioritisation problem. In comparison to the greedy algorithms of Rothermel et al. [167] they find that their approaches are not as effective – while genetic algorithms perform relatively well, they are more computationally expensive than the more effective greedy alternatives. Berend et al. [24] study a precedence-constrained test-ordering problem in which each test has an (independent) probability of uncovering faults – once a fault is detec- ted, the test procedure halts. Under this model, the case without precedences is in P, and the addition of precedence constraints makes the problem NP-hard. They con- sider two objectives – minimising the expected runtime, and maxmising the probab- 1.6. SCHEDULING AND PRIORITISATION WITH PRECEDENCES 55 ility of halting before some predefined time T. Shayman and Fernandez-Gaucherand consider fault detection with precedences [172], where each component has an inde- pendent probability of failing – their objective involved minimising an exponential function that captures risk-aversion.

Machine Scheduling and Precedence Constraints

Precedence relations define a partial ordering of the job set J, and the precedence con- straints are often represented as a directed acyclic graph. The first work on machine scheduling with precedences to minimise sum of completion times was by Sidney in 1975 [174]. A Sidney decomposition is a partitioning of the jobs into sets by using a generalisation of Smith’s SPT rule [178]. The SPT rule orders jobs in decreasing order of w(j)/p(j), the ratio of weight to processing time. A Sidney decomposition involves ∗ ∗ identifying a set of jobs J ⊆ J such that w(J )/p(J∗) is maximised and for each job j ∈ J∗ all predecessors of j are also in J∗. The set of jobs in J∗ can be scheduled before those in J\J∗ – these remaining jobs are further ordered by recursing the Sidney decomposition procedure. Lawler demonstrated that finding a Sidney decomposition of the job set can be achieved in polynomial time [132]. Single-machine scheduling with precedence con- straints to minimise the sum of weighted completion times is NP-hard for general precedence graphs even in the case of unit weights and processing times, but can be solved in O(n log n) time if the precedence relations are series parallel [132]. There are other machine scheduling models which value completing jobs earlier, such as models with degrading machines (also referred to as deteriorating jobs [150]), or models which aim to maximise the revenue and in which task values decrease over time [185]. Deterioration models typically penalise later jobs by increasing the runtime or the rate of scheduled maintenance tasks, thereby increasing the length of the schedule.

AND/OR Scheduling and MSSC

Definition 23. Scheduling with AND/OR precedences is a scheduling model in which there are two types of jobs – AND-jobs are available only when all precedences are met, while OR-jobs only require that at least one of the precedences are met. 56 CHAPTER 1. RELATED WORK

Goldwasser and Motwani [89] introduce the concept of AND/OR precedence- constrained scheduling as a model for a disk assembly problem. Note that when representing AND/OR precedences as a directed graph, cycles can be allowed – so the result is not necessarily a DAG. For the standard definition of a precedences graph (with only AND jobs) there is no allowance for cyclic dependencies. We can consider precMSSC in terms of scheduling with AND/OR precedences by representing the precedence-constrained sets as AND-jobs, and the elements as OR-jobs – an element can be covered once at least one of the sets covering it has been selected. To model precMSSC as an AND/OR schedule, we have a job for each set and for each element, with an OR precedence from job i to job j iff i is a set containing j. This allows us to model the concept of a set covering an element – the set representing an element is available the first time we schedule a set that covers it. We include the precedences between sets in the initial precMSSC problem as AND-precedences between the respective jobs. The jobs representing sets have unit processing time pi and zero weight wi; those representing elements have zero processing time and unit weight. The weighted completion time for each element is the number of sets scheduled before it was covered, while the jobs representing sets have zero weight and therefore no value. So we can model this as 1|and/or| ∑ wi pi on jobs V1 and V2. Scheduling to minimise the makespan is Label Cover-hard for general AND/OR precedences [89]. This means that the problem is as hard as the Label Cover, a well-known problem in optimisation that admits no approximation achieving a ratio 1−ε of 2log n, for a fixed value of ε > 0, unless all problems in NP can be solved by a deterministic algorithms in O(npolylog(n))-time [9]. Erlebach et al. [73] prove that minimising the average completion time with AND/OR precedences is Label Cover- hard, but that scheduling available jobs with Smith’s shortest processing time (SPT) rule gives an O(n)-approximation. They further prove that this algorithm gives an √ O( n)-approximation for the special case with a single processor where all jobs have equal weights. Note that this result does not carry over to precMSSC, where AND jobs have zero weight and unit processing time, but OR jobs have unit weight and zero processing time. Coolen et al. [57] considered a subset of AND/OR precedence problems that is equivalent to MSSC – two groups of jobs V1 and V2 (Vi of sets, V2 of elements) and 1.6. SCHEDULING AND PRIORITISATION WITH PRECEDENCES 57

OR precedences from V1 to V2 (indicating that a set covers an element). They refer to this problem as sequential search with disjuctive group activities, and consider the objective of minimising the weighted sum of completion times. When jobs in V1 have unit processing time and zero weight, and jobs in V2 have zero processing time and unit weight (as with MSSC) they prove this problem to be strongly NP-hard – even when each set covers exactly three elements. They show a series of special cases can be solved in polynomial time, such as when the sets are disjoint, and when each set covers two elements.

Single Machine Scheduling

Many machine scheduling problems can be simplified by considering the single ma- chine variant. For example, single-machine scheduling to minimise the makespan is in P, and can be trivially solved with any schedule that has no breaks – the value is simply the sum of the processing times. The same objective with parallel machines

(denoted P||Cmax) is NP-complete [133]. However, even for instances with parallel machines, fairly simple algorithms can achieve good approximations. For example, when the jobs are arbitrarily ordered, scheduling the next job in the sequence on the earliest available machine gives a 2-approximation (this is often referred to as list scheduling)[91]. If the jobs are first sorted in decreasing order of processing time, this approach gives a 4/3-approximation – this is known as the longest processing time

(LPT) rule [92]. Hochbaum & Shmoys [104] gave a PTAS for P||Cmax. For jobs with unit-length processing times, scheduling with precedences to minimise makespan can be achieved optimally in polynomial time for instances with either one or two machines [81; 80]. Another common objective in machine scheduling is to minimise the average com- pletion time, or the weighted sum of completion times. In the absence of further constraints, this problem can be optimally solved in polynomial time using Smith’s

SPT rule [178]. Often each job j is assigned a release date rj, indicating the earliest time that job will be available for processing. Using three-field scheduling notation [93], the single-machine scheduling problem would be represented as 1|| ∑ w(j) · C(j), or

1|rj| ∑ w(j) · C(j) if we have release dates. This problem has been widely studied, and has been shown to be a special case of the vertex cover problem [4]. The un- 58 CHAPTER 1. RELATED WORK weighted version of this problem with release dates is NP-hard [133], but has a 2- approximation algorithm [160]. If we have precedence constraints instead of release dates, the problem becomes 1|prec| ∑ w(j) · C(j), which again is NP-hard [133], unless the precedence graph is a forest [105]. Hall et al. [99] give a 3-approximation for minimising ∑ w(j) · C(j) with release dates, a 2-approximation with precedences, and a 7-approximation for cases with both release dates and precedence-constraints. Single machine scheduling with precedences with the objective of minimising the weighted sum of completion times can be denoted as 1|prec| ∑ w(j) · C(j), and mod- elled with the following LP:

Minimise ∑ w(j) · C(j) j∈J

s.t. C(j) = ∑ p(i)δij + p(j) ∀j ∈ J i∈J

δij + δji = 1 ∀i, j ∈ J, i < j

δij + δjk + δki ≤ 2 ∀i, j, k ∈ J, i < j < k or i > j > k

δij = 1 ∀i, j ∈ J, i ≺ j

δij ≥ 0 ∀i, j ∈ J, i 6= j

Potts [164] gave the above linear ordering relaxation – the δij terms are linear ordering variables, where δij = 1 indicates that job i precedes j in the given schedule. This model was modified by Chudak and Hochbaum [52] to give a half-integral program and a 2-approximation. This problem also allows 2-approximations using techniques such as LP-rounding [99; 170]. Hall et al. obtained a 2-approximation algorithm using Potts’s linear program [100]. Chekuri and Motwani [46] and Margot et al. [149] both independently proved that the linear ordering relaxation of Potts [164] has a factor-2 integrality gap for this problem. As 1|prec| ∑ w(j) · C(j) is a special case of the more general submodular order- ing problem, the earliest 2-approximation technique can be credited to Pisaruk in 1992 [161] – a result which was later refined to give a fully combinatorial algorithm [162]. A simple 2-approximation technique based on Sidney’s decomposition theorem was independently discovered by Margot et al. [149] and Chekuri and Motwani [46]. 1.6. SCHEDULING AND PRIORITISATION WITH PRECEDENCES 59

Woeginger [184] considered the problem of single-machine scheduling with pre- cedence constraints to minimise the average weighted completion time and proved the equivalence of the approximability ratio for the general problem and several spe- cial cases – including the case where all jobs have unit processing time. This work also introduces a 1.61-approximation algorithm for several of these special cases [184]. Correa and Schulz [58] and Ambuhl¨ and Mastrolilli [4] both prove that every in- stance of 1|prec| ∑ w(j) · C(j) can be converted into a vertex cover problem, and that an α-approximation for vertex cover implies an equivalent approximation for the ma- chine scheduling problem. This problem does not admit a PTAS unless all problems in NP can be solved in randomised sub-exponential time [5]. Another interesting result is the ability to convert a single machine schedule into a parallel machine solution of bounded size. Given a schedule with an α-approximation guarantee for some single machine scheduling problem, there is a procedure which can convert it into a solution for the multiple machine variant of the problem with a 2α + 2-approximation guarantee [47]. This result applies even in the presence of precedence-constraints, and is based on a modified list scheduling approach.

Test-Case Prioritisation

Testing is typically an important part of many software engineering projects, and when time and resources are constrained, test case prioritization can offer an effect- ive way to organize tasks to achieve a given goal. We may wish to minimise the time taken to ensure complete code coverage for the tests that have been run, or to maximise the rate of code coverage to find defects are covered as early as possible, thereby speeding up the rate at which defects can be fixed. For such purposes, it can be of value to be able to find a testing schedule with high likelihood of identifying faults within the code as early as possible. There are many possible constraints to consider when testing code. One aspect of this is how to effectively order the test cases in order to identify faults as quickly as possible. Test-case prioritisation is an approach for minimising the overheads of the software testing process by performing the tests in the test suite in an efficient order. An alternative method is test-case minimisation, where the objective is to reduce the overall size of the test suite while maintaining the same coverage [167]. 60 CHAPTER 1. RELATED WORK

While the majority of work on test prioritisation fundamentally assumes that tests are inherently independent, recent work suggests that there are implicit dependen- cies in a significant number of test suites that arise in practice [186]. We therefore consider a type of single-machine scheduling with precedences for which each job contains a set of tasks, and the goal is to achieve all of the given tasks in a minimal amount of time in such an order that maximises the average rate at which tasks are achieved. For the purpose of fault detection, we can consider a task as a deterministic model for a defect. That is, where a test case has some probability of uncovering a defect, we model it with a job that deterministically covers a set of tasks. We focus on machine scheduling with application in test prioritisation, where each test runs a set of the given subprocedures, and we aim to test these subprocedures for faults in order to find faults as quickly as possible.

We assume a fault matrix F, with each test t ∈ T associated with a set F[t] of faults it can identify. Given the set Π of permutations of test suite T, and some function f indicating the rate of fault detection for a given permutation, Rothermel et al. [167] define the test case prioritisation problem as the problem of finding a permuta- tion π ∈ Π such that f (π) is maximal. Using three-field scheduling notation [93] we can consider the precedence-constrained test-case prioritisation problem as 1|prec| f , where f is the (unknown) rate of fault detection. Since f depends on whether a fault is detected, and this can not be determined before the tests are performed, one of the goals of test-case prioritisation is to identify an appropriate measure of f . We need to consider alternative means of estimating fault detection effectiveness of each of the tests using more accessible sources of information. One measure is code-coverage – the lines of code that each test executes. Given a code-coverage matrix L, we can attempt to model the actual rate of fault detection with the rate at which lines of code in L are covered. We can evaluate the effectiveness of a given schedule against fault matrix F, indicating which faults are identified by which tests.

Rothermel et al. [167] and Elbaum et al. [71] consider the use of information ac- quired from prior test runs to improve estimation of fault locations. For instances where defect location information is not known in advance, we can consider code cov- erage as a proxy for fault coverage. The hypothesis here is that the more lines of code a given test covers, the higher the likelihood of identifying a fault. Rothermel et al. [167] 1.7. SUMMARY 61 consider a range of greedy algorithms that prioritise tests based on a range of criteria including code coverage, as well as prioritising tests that cover previously untested code sections and tests with high probability of fault detection. Elbaum et al. [71] consider tests to have two degrees of coverage – total coverage, indicating the overall code coverage of a test, and additional coverage, measuring the previously untested code that a test will cover. Total coverage is a fixed value of a test, while additional coverage is ordering-dependent. Haidry and Miller [98] have previously considered the addition of precedences to the test-case prioritisation problem. They conjecture that more complex interactions between tests in the precedence graph might imply a higher likelihood of faults being present in the relevant parts of the underlying code-base. Based on this assumption, they present a range of greedy-style heuristics that prioritise tests based on the struc- ture of the precedence graph.

Table 1.1: Current best-known solutions

Problem Best-known Hardness k-centre 2 [90] 2 − ε k-median 2.611 + ε [35] 1 + 2/e [113]

1|prec| ∑j wjCj 2 [46; 99; 83] NP-hard [133]

P||Cmax 1 + ε [104] NP-hard [133]

R||Cmax 2 [134] 3/2 [134] set cover 1 − 1/e [76] 1 − 1/e + o(1) [76] min-sum set cover 4 [77] 4 [77]

1.7 Summary

In this thesis, we build on ideas from a range of different models, including machine scheduling, clustering and graph traversal problems. As many of these problems are NP-complete, there often exist mappings allowing us to represent the problems we consider in many different contexts. Where we introduce new models, we provide hardness results for these problems. 62 CHAPTER 1. RELATED WORK

As many clustering algorithms are fundamentally sequential and often involve large datasets, we consider parallel approaches for solving metric clustering prob- lems. As metrics graphs are usually used to represent distances, many of these mod- els relate to graph traversal problems. We find that min-sum objective functions appear in several of the models we con- sider. The k-star clustering problem seeks to minimise the sum of points assigned to a cluster centre. Many machine scheduling problems aim to minimise the sum of the running times of the jobs assigned to a given machine. Borrowing from these contexts, we add precedence constraints to the min-sum set cover problem. Overall, we cover a wide range of topics connected under the theme of NP-hard graph-based optimisation problems. We consider several strategies for efficiently solving such problems, ranging from approximation algorithms, parallel solutions and heuristics, as well as analysing the complexity and approximation guarantee for our solutions.

Table 1.2: Our contributions

Problem Conditions Approx. Runtime Hardness k-centre Parallel 4 O(k n/m) n/a √ √ √ Min-sum set cover Precedences O( max n, 2m) polynomial O(max 20 m, 16 n) Test-case prioritisation Precedences n/a O(n2) NP-hard CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE

2 Efficient Clustering with MapReduce

While approximation algorithms provide near-optimal results with polynomial time complexity, often data sets are large enough that running sequential algorithms re- quires prohibitively large amounts of RAM. Even on the high-end, modern RAM is measured in terabytes, while datasets such as the graph of connections in the human brain can reach petabyte-scale [148]. For contemporary massive data sets, RAM- based solutions for clustering problems become impractical – and although there exist good sequential algorithms for approximating k-centre, they are not obviously parallelisable. We therefore consider the design and implementation of parallel ap- proximation algorithms for the k-centre problem. MapReduce [63] is a framework for processing large datasets with which we can solve clustering problems such as k-centre. The schemes described here are based on an observation about the parallel- isation of Gonzalez’s 2-approximation algorithm [90], and on the initial work of Ene et al. [72] on clustering with MapReduce. The O(1)-round factor-10 approximation of Ene et al. [72] is based on sampling; we analyze its runtime in depth. We also provide detailed experimental results to sup- port our findings – the first for this problem, as the initial paper on this topic did not feature experimental results for k-centre. We observe that a pivotal subprocedure of the sampling algorithm only runs when k is sufficiently small relative to the number

63 64 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE of data points, and that in some scenarios the procedure fails to terminate. As well as resolving each of these issues, we parameterize the sampling scheme to trade-off runtime (based on the number of MapReduce rounds) with approximation guarantee (based on the probabilistic analysis). Experimentally, we find that this leads to better runtimes with only marginal impact on the quality of the average solution obtained. Furthermore, we observe that Gonzalez’s greedy algorithm can be efficiently par- allelized in several MapReduce rounds; in practice, we find that two rounds are sufficient, leading to a 4-approximation. Via both empirical and theoretical analysis, we find that the parallel version of Gonzalez is about 100 times faster than both its sequential version and the parallel sampling algorithm [72]; MapReduce Gonzalez solution quality is comparable to the original sequential 2-approximation algorithm.

2.1 Clustering with MapReduce

The k-centre problem is one of the fundamental NP-hard clustering problems on a metric input. The key task is to choose the optimum set of k centres: each of the remaining vertices would be assigned to its nearest centre. As a reminder, we can define the problem as follows:

Definition 24 (k-Centre). Find a subset of the vertices of size at most k — we refer to these points as centres — such that the maximum distance from a vertex to its assigned centre is minimised. For a set of points V, a solution set S containing at most k vertices, and a distance function d, the objective of this problem can be considered as minimising the objective minv∈V maxs∈S d(v, s).

Previous work on adapting clustering algorithms to the MapReduce framework in- cludes facility location [30], the k-means problem [17] and the k-median problem [72].

Contributions

We describe a multi-round parallel algorithm for k-centre, analyze in detail several parallel algorithms, and compare the performance of these approaches with one of these sequential methods. Inspired by theoretical guarantees and evaluation, we run comprehensive experiments, including trading off approximation for running 2.2. PARALLEL K-CENTRE 65 time. We analyze in detail the strengths of our MapReduce scheme for k-centre, and investigate more carefully an existing scheme. We provide a very careful and detailed examination both of the best-known Map- Reduce approximation algorithm for k-centre [72], based on sampling, and a parallel implementation of Gonzalez’s algorithm (that typically gives a 4-approximation). The 2-round special case of the latter approach was recently considered by Malkomes et al. [147], although their analysis and experiments differ considerably from ours. We provide a supplementary result bounding the solution size for greedy k-centre solutions on any subset of the initial dataset. Furthermore, Malkomes et al. [147] only compare against random algorithms – we provide the first comparison against existing parallel k-centre approximation algorithms. We describe in depth the per- formance and computational requirements of these approaches, and detail how this procedure can be adapted to allow for cases where RAM is insufficient even for the 2-round parallel solution. Based in part on a careful calculation of its running time, we generalize the sampling MapReduce scheme of Ene et al. [72], to trade off approx- imation guarantee for speed. We compare the performance of our MapReduce Gonzalez (MRG) algorithm with the sequential algorithm of Gonzalez [90](GON) and the parallel sampling algorithm of Ene et al. [72](EIM). Our experiments (the first for EIM) show that MRG is sig- nificantly faster than the alternatives, while being almost as effective. Our results conform with the findings of Malkomes et al. [147], regarding the performance of parallel greedy algorithms for k-centre. Ene et al. [72] presented only theoretical results for their k-centre MapReduce scheme; they reported that their algorithm performs poorly due to the sensitivity of k-centre to sampling. Unfortunately, there are no experimental results nor implementation details to confirm this. To contrast with our simpler algorithm, we empirically in- vestigate the performance their k-centre scheme.

2.2 Parallel k-centre

We describe and analyze an approximation algorithm for the k-centre problem that, for most practical cases, achieves a 4-approximation in only two MapReduce rounds. The intuition is that a sequential k-centre algorithm finds in the first round a sample 66 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE from each of the reducers such that the distance to all of the unsampled points is bounded. Running a standard factor-2 algorithm on the sample reveals a factor-4 solution to the whole instance. Additional rounds can be performed in cases where even the sample is too large for a single machine: this would usually occur for very large values of k. Experiments show that this approach often returns solutions of quality as good as that of the baseline sequential 2-approximation algorithm. The 2- round case of our algorithm is similar to the approach of Malkomes et al. [147]. Along with generalizing to larger instances, we analyze the run time of these algorithms in more detail, and provide an alternative, shorter proof of the two-round factor-four approximation.

Description

The standard k-centre approach that will be referenced in this thesis is the factor-2- approximation of Gonzalez [90], which we refer to as GON. This algorithm chooses an arbitrary vertex from the graph, and marks it as a centre. At each following step, the vertex farthest from the existing centres is marked as a new centre, until k centres have been chosen. As the edge weights comprise a metric, the triangle inequality ensures that the resulting set of centres comprises a 2-factor approximation.

Parallelized version Given a point set V and a metric d, with OPT representing the optimal covering radius, Algorithm5 obtains a set of centres {Si} for which all points in Vi – where {Vi} partitions V – are within radius 2 · OPT from Si. Running GON

(see Algorithm2) on the sample S = ∪iSi obtains k centres whose covering radius for S is 2 · OPT. Assume that we have m machines each with capacity c. If n/m ≤ c and k · m ≤ c then, due to the triangle inequality, this results in a 4-approximation MapReduce algorithm for k-centre. If the sample is too large to fit onto the final machine, additional iterations of the first round can be run on the sample until there are few enough points – this procedure is visualised in Figure 2.1. We will prove that each additional iteration adds 2 to the overall approximation ratio. We refer to this multi-round scheme for k-centre as MRG, for “MapReduce Gonza- lez”; the procedure is shown in Algorithm5. 2.2. PARALLEL K-CENTRE 67

Algorithm 5 MRG(V, k, m) 1: S ← V 2: while |S| > c do 3: Let Vi refer to the points from V mapped to reducer ρi. 4: Each ρi runs GON on Vi, and returns the set of k centres Si. 5: S ← ∪iSi 6: The mapper sends all points in S to a single reducer. 7: This reducer runs GON on S, and returns the set of k centres CG. 8: return CG.

Analysis of MRG

Current data no

V1 Vi Vm

S1 Si Sm

| ∪ Si| ≤ c?

yes Solve on sample

Return solution

Figure 2.1: A flowchart visualising the decision procedure for MRG.

Algorithm MRG clearly runs in polynomial time; to prove the four approximation of the two-round case, we prove the following intermediate result. RAM usage is O(max(n/m, k · m)), depending on which of the parallel rounds receives the most points. For an arbitrary subset S of the vertex set V, let SG denote the set of points in the solution obtained by running GON on S, and let SOLS denote the objective value, the covering radius, of this solution.

Lemma 25. For each S ⊆ V, SOLS ≤ 2 · OPT. Proof. Let V∗ be an optimal set of centres. The vertex set V can be partitioned into k ∗ m ∗ ∗ sets {Vi }i=1 such that all points in set Vj are within OPT of some centre j ∈ V . 68 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE

G ∗ First, consider the case where every set S ∩ Vj has exactly one point. This point, sj, ∗ ∗ can serve as the centre for every point in Vj . Then every point x in Vj , and hence in ∗ ∗ Vj ∩ S, is within 2 · OPT of sj, as both x and sj are within OPT of j ∈ V . ∗ G ∗ However, if this is not true then there is some partition Vj with |S ∩ Vj | > 1; ∗ in this case we can show that all points in partition Vj are within 2 · OPT from each other. Algorithm GON adds a new centre to SG only when it is the farthest from the points previously added to SG. The presence of two centres within 2 · OPT implies that all points in S are within 2 · OPT of SG (if there were some point farther, it would be in SG instead). Therefore, for every subset S of V, the value of the k-centre solution returned by GON on S is at most twice the optimal solution for V.

Lemma 26. Each iteration of Algorithm5 adds two to the approximation factor.

Proof. We prove this for the base case: the case for further iterations can be proven inductively. Let Vi refer to the points mapped to reducer ρi. Since we run GON on Vi, every point in Vi is within 2 · OPT of a centre in Ci and hence in C. From Lemma 25, running GON on C arrives at a set of centres CG that is a 2 · OPT solution on C. By the triangle inequality, it then follows that every vertex in the graph is within 2 · OPT + 2 · OPT = 4 · OPT of the k centres C∗. k-centre in Two Rounds

With sufficient space, the consequence of Lemma 25 and 26 is a factor-four approxim- ation. We now describe the properties of the setup and input for instances when the algorithm can run effectively in two rounds. We can assume that n > k; otherwise the solution to k-centre is trivial. We further assume that n/m > k: if this is not the case, then we can reduce the number of machines. For small k, we only require that there is sufficient space across the machines to store the dataset: that is, m · c ≥ n. We could also exploit external memory, for example by running multiple instances of our MapReduce algorithm and using a k-centre algorithm on the disjoint union of the solutions.

Theorem 27. If n/m ≤ c and k · m ≤ c, then the k-centre algorithm can be implemented in MapReduce in two rounds with a 4-approximation guarantee. 2.2. PARALLEL K-CENTRE 69

Multi-round analysis

For instances where k · m > c, we lack the required memory to store the sample on a single machine, and therefore run further iterations of the while loop. In such cases, MRG uses more MapReduce rounds, which in turn loosens the approximation guarantee. We now analyse the general performance of MRG, including the approx- imation guarantee for instances where two rounds is not sufficient. During each sampling iteration, the size of the sample is decreased – this procedure ends when the remaining points can fit on a single machine. We assume that k < c; without this condition, selecting k centres from a single machine seems to require incorporating external memory in some manner. Even relaxing the requirement that k · m ≤ c, it is still necessary that k ≤ c.

Figure 2.2: A bad assignment and seeding for MRG.

Assuming that n/m ≤ c, after the first round we have k · m centres, so we send them to m0 = d(k · m)/ce ≤ (k · m)/c + 1 machines. After the second round, we have k · m0 centres, which we can send to m00 ≤ d(k · m0)/ce ≤ m · k2/c2 + m · k/c + 1 machines. In general, the number of machines required after i rounds observes the 70 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE bound 1 − (k/c)i m(i) ≤ m · (k/c)i + , (2.1) 1 − k/c and we can run the final round when m(i) < 2. As i increases, the second term in inequality (2.1) approaches 1/[1 − (k/c)], which itself will be less than 2 only if 2k < c. Intuitively, during each round we select k centres from each of the machines, so if k is close to c then the reduction in the number of centres in each round will be small. However, as long as k is relatively small compared to c, this should only involve a small number of rounds. For each additional round, we increase the approximation bound by α = 2 additively, so when using i iterations of the inner loop the resulting algorithm has a 2(i + 1)-approximation guarantee. Combining the arguments in these two paragraphs, we conclude with the following lemma.

Lemma 28. Choosing i so that inequality 2.1 is satisfied, if n/m ≤ c, then k-Centre can be approximated in i rounds with a 2(i + 1) · OPT guarantee.

Figure 2.3: Points can be as far as 2OPT from the nearest centre in the sample. 2.2. PARALLEL K-CENTRE 71

Tightness of Approximation

The worst-case approximation ratio is tight for this algorithm – we can construct cases where the approximation ratio is arbitrarily close to our bound. Intuitively, this algorithm can perform poorly when a point is assigned to a centre 2 · OPT away from it, which is in turn assigned to a centre 2 · OPT away in the second round. In the worst case, these distances are additive (bounded by the triangle inequality), which results in a 2(i + 1)-approximation.

Figure 2.4: Adversarial seedings can result in solutions as bad as 4OPT for MRG. The large points indicate the solution for k = 3, while the circled points are the other sample points. Colour is used to indicate points assigned to each of the three machines.

As an example, consider an instance which runs in two rounds – we will show how to construct cases for which Algorithm5 can return a 4-approximation given sufficiently poor assignment to machines and choice of initial points for the GON subprocedures. Figure 2.2 shows an example of such an assignment for k = 3 and m = 3. The radius of an optimal solution is given by the large red circles. Each colour indicates points sent to a different machine – for each machine, the initial centre has been circled. The initial choice of centre for the red machine is peripheral to two 72 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE clusters – one of these clusters is not well represented on either of the other two machines. Figure 2.3 shows the sample chosen from this seeding, with some points as far as (2 − ε) · OPT from the nearest sample centre. With a poor choice of seed for the final k-centre instance on the sample, this can result in the solution indicated by the large points in Figure 2.4, where some points are close to 4OPT from the nearest centre returned by Algorithm5.

2.3 Analysis of EIM sampling

We have shown how parallel instances of a k-centre algorithm can be used to obtain a sample of bounded distance from the data – we will compare against the probabilistic sampling procedure by Ene et al. [72] In general, Ene et al.’s iterative-sampling pro- cedure (which we refer to as EIM) has slower runtimes and slightly better solutions than both the sequential and MapReduce versions of Gonzalez’s algorithm; hence we aim to explore trade-offs between these factors.

Algorithm 6 EIM-MapReduce-Sample(V, E, k, ε) 1: S ← ∅, R ← V 2: while |R| > (4/ε)knε log n do i 3: Map each partition R of R to a reducer ρi. i i ε 4: Reducer ρi adds points in R to S i.i.d. with probability 9kn (log n)/|R|, and to Hi with probability 4nε(log n)/|R|. S i S i 5: Let H := 1≤i≤dnεe H and S := S ∪ ( 1≤i≤dnεe S ). The mappers assign H and S to one machine. 6: The reducer sets v ← Select(H, S). 7: The mappers arbitrarily partition R, with Ri denoting these sets. Each redu- i cer ρi receives v, R and S. i 8: Reducer ρi returns Ri = {x ∈ R | d(x, S) > d(v, S)}. S 9: Let R := i Ri.

10: Output SOLE := S ∪ R.

The general idea behind Algorithm6 is that each iteration of the loop at step2 will increase the size of the sample and decrease the size of the set R of “unrepresented 2.3. ANALYSIS OF EIM SAMPLING 73 points”, exiting the loop when R is small enough to be added to the sample. There are three stages of map and reduce steps within each iteration of the sampling loop. The set R of “unrepresented points” is initially the entire data set. The first stage takes two random samples from the data – one to be added to the cumulative sample S, and the other to represent the remaining points.. The second stage calls Algorithm7 to finds a “pivot” point based on the current samples – points closer to S than the pivot are considered “well represented” by the sample. The third stage involves removing points from R that are closer to the sample S than the pivot – as such points are now considered to be represented by the sample. These three steps are repeated until the size of R falls below the threshold size (4/ε)knε log n – at which point the union of S and R is returned. So that we can trade-off runtime and approximation ratio, we add a new parameter φ to EIM that decreases the sampling rate. Furthermore, we make some alterations to the scheme to prevent eccentric behaviors that were sometimes observed in our implementation. These alterations sometimes decrease the size of the set R of points considered “unrepresented” by the sample after each iteration, therefore allowing us to improve the running times for the algorithm.

Termination

The first change we make to the algorithm adjusts the removal of points from R to ensure that the size of the set decreases in every iteration. For our implementation of step8 of EIM-MapReduce-Sample, we remove vertices whose distance from S is equal to that from v to S. In the original presentation such a vertex would remain in R. Our change avoids rounds in which no vertices are removed from R, as this could lead to the procedure looping indefinitely. In initial version of the algorithm, the pivot point v (and all points in R of equally distant from S) would not be removed from R. With relatively small graphs there can be significant overlap in the sets S and H, giving a non-trivial probability that the point v ∈ H will also be in S. In such cases, the vertex v will be at equal distance to S as the points prior to it in the ordering given in step2 of Select(). As the threshold for removing points from R is strictly bounded above by the distance from v to S, 74 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE if v ∈ S no points can be removed from R as the metric property does not allow negative edge-weights. This would mean that even points added to the sample might not be removed from R. When this occurs in one iteration, it increases the relative size of R ∩ S, also increasing the probability of no vertices being removed from R in subsequent rounds, as H is sampled from R. If all points in R are eventually added to S then it follows that the sample H ⊂ S, and the algorithm cannot terminate as the threshold for removal from R will always be zero. Therefore we assume that sampled points should always be removed from R, and as such have adapted the algorithm to reflect this.

Algorithm 7 Select(H, S), with our parameter φ. 1: For each point x ∈ H, find d(x, S) 2: Order the points in H according to their distance to S from farthest to smallest. 3: Let v be the point in position φ(log n)th in the ordering. 4: return v

Tradeoff

Ene et al. [72] prove that with high probability (w.h.p.) their MapReduce procedure will run in O(1/ε) rounds. An event is said to occur with high probability if the probability that it occurs is greater than 1 − O(1/n2). We consider how to adapt this algorithm to decrease the number of rounds, and hence improve runtimes. To achieve this, we introduce parameter φ to Select() to trade-off approximation and running time: the original algorithm fixed φ to be 8. In the original EIM scheme, the point v that was chosen from sample H has the property that 8 log n of the points in H are farther from S than v. The expected number of points in R that are farther from S than v is therefore 8 log n · |R|/|H| = |R| · 2/nε. By choosing a different index for point v, we can alter the number of points that remain in R: the sampling algorithm terminates when |R| falls below the threshold defined by v, so this potentially decreases the number of iterations. We introduce a variable φ, and choose v such that it is the φ(log n)th farthest point in h from S. Setting φ = 8 is equivalent to the initial algorithm of Ene et al. [72], and as the difference is in the Select() subprocedure, the MapReduce implementation does not change. 2.3. ANALYSIS OF EIM SAMPLING 75

Once we have the sample of the initial data given by EIM-MapReduce-Sample, a feasible k-centre solution can be obtained by running a sequential k-centre procedure on the given sample in an additional MapReduce round. Note that in line 3, R is partitioned into d|R|/nεe sets of size at most dnεe, and in line 7 the mappers partition R into dn1−εe sets of size at most d|R|/n1−εe.

Analysis

In this section, we prove that the probabilistic 10-approximation for EIM still holds for the resulting procedure given appropriate values of φ, however with lower prob- ability bounds. This proof is based on the original analysis with the necessary details altered for correctness and relevance. The main impact of the parameter φ is to vary the number of points we consider to be represented by the existing sample. To analyze this, we first require a formal description of what it means for points to be well represented (or satisfied) by a sample. Let Y denote some subset of the vertex set V (that we hope will ‘represent’ V). Each vertex x ∈ V is assigned to its closest point in Y, breaking ties arbitrarily but consistently. For each x ∈ V, let A(x, Y) denote the point in Y to which x is assigned; if x is in Y we assign x to itself. For y ∈ Y, let B(y, Y) denote the set of all points assigned to y when Y is the ‘assignment subset’.

Definition 29. For a sample S, set Y ⊆ V and assignment A(x, Y) for each x ∈ V, a point x is satisfied by S with respect to Y if miny∈S d(y, A(x, Y)) ≤ d(x, A(x, Y)). In words, there is a point y in the sample that is closer to A(x, Y) (the Y-assigned point of x) than x is to A(x, Y).

Let SOLE denote the set of points returned by EIM-MapReduce-Sample (Algorithm6).

The set SOLE might not include every centre in Y, but a point x can still be satisfied if A(x, Y) ∈/ SOLE by including a point in SOLE that is closer to x than A(x, Y). If EIM-MapReduce-Sample returns a set that satisfies every point in V, then the sample is very representative of the initial dataset, and the clustering algorithms based on it should perform well. However, there is no guarantee that the sample satisfies all points; instead we can only guarantee that the number of unsatisfied points is small, 76 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE

x

A(x,Y)

Figure 2.5: The blue points are in the set S∗. If a sample S includes a point inside the circle, then S satisfies x with respect to S∗. and their contribution to the performance of the clustering algorithms is negligible compared to that of the satisfied points. The sets at the core of EIM-MapReduce-Sample change with every iteration. Denote the state of sets R, S and H at the beginning of iteration ` by R`, S`, and H` respect- ively, where R0 = V and S0 = ∅. The set of points that are removed from R during iteration ` is denoted by D`, so R`+1 = R` \ D`. Let U` denote the set of points in R` that are not satisfied by S`+1 with respect to Y. Let U denote the set of all points that are not satisfied by SOLE (the sample returned by the algorithm) with respect to Y.

If a point x is satisfied by S` with respect to Y, then it is also satisfied by SOLE with S respect to Y, and therefore U ⊆ `≥1 U`.

x

S

Y

y

Figure 2.6: Point x is satisfied by S with respect to S∗, but point y is not. 2.3. ANALYSIS OF EIM SAMPLING 77

Lemma 30. Let Y be an arbitrary set with no more than k points. Consider iteration ` of ε −2 EIM-MapReduce-Sample, where ` ≥ 0. Then P[|U`| ≥ |R`|/3n ] ≤ n .

From the analysis of Ene et al. [72], we have the following proof of Lemma 30.

Proof. Consider some point y in Y. Recall that v(y, Y) denotes the set of points that are assigned to y, and that U` is a subset of R`. It suffices to show that   |R`| 1 P |U` ∩ v(y, Y)| ≥ ≤ . 3kne n3

The lemma follows by taking the union over all points in Y, noting that |Y| ≤ k ≤ n. |R`| Now we need to bound the probability that |U` ∩ v(y, Y)| ≥ 3kne occurs (noting that ∗ v(y, S ) ∩ R` is a superset of v(y, Y) ∩ U`). This bound would imply that none of e the |R`|/3kn closest points in v(y, Y) ∩ R` from y were added to S`, because if some ∗ point x were added to S`, then then all points in v(y, S ) ∩ R` farther than x from y would be satisfied, and therefore not in U`. 9kne The probability that a point is not added to S` is 1 − log(n) by step 4 of Al- |R`| gorithm6, with the event for each point being independent, and we need this to hold e for |R`|/3kn points. Hence we have

| | e  |R |   9kne  R` /3kn | ∩ ( )| ≥ ` ≤ − P U` v y, Y e 1 log n 3kn |R`| |R |/3kne  log n−3  ` 1 ≤ + ≤ 1 e 3 . |R`|/3kn n

x The last inequality follows by noting that limx→∞(1 + z/x) = exp(z) with x = e −3 |R`|/3kn and z = log n . For positive x this is an increasing function, giving us an − upper bound of exp(log(n 3)) = 1/n3.

We now show that our adaptation of EIM retains the same probabilistic 10α + 3- approximation guarantee.

Lemma 31. Let Y be a set of no more than k points. Consider iteration ` of the while loop in Algorithm6. Let v` denote the threshold in the current iteration: the point in H` that th is the φ log(n) most distant from S`+1. Then there exist values a and b such that, for 78 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE some γ > 0,   a|R`| b|R`| 2 P ≤ |R`+ | ≤ ≥ 1 − . nε 1 nε n1+γ Proof. Recall that we selected a pivot point v, and discarded the points that are well represented by the current sample, compared to v. Note that R`+1 is the set of points in R` such that the distance to the sample S`+1 is greater than the distance between the pivot point and the sample. Ene et al. introduce these handy definitions: for a vertex t, we refer to the number of points in R farther from S`+1 than the point t as the rankR of t. For some value i, and a set Y ⊆ R, define L(i, Y) = {x ∈ Y : rankR(x) ≤ i} as the number of points in the set Y that have rank smaller than i. ε ε Let r = d|R`|/n , and let |H`| = c · n log n. By design,

E[L(a · r, H`)] = a · r · |H`|/|R`| = acd log n and

E[L(b · r, H`)] = b · r · |H`|/|R`| = bcd log n .

If a · r ≤ |R`+1| ≤ b · r, then with high probability, the pivot (chosen to be the th φ log(n) point in H`) will be in the range [L(a · r, H`), L(b · r, H`)]. Choosing δ so that

(1 + δ)E[L(a · r, H`)] ≤ φ · log n gives δ ≤ −1 + φ/(acd). By the Chernoff inequality,

P[L(a · r, H`) ≥ φ · log(n)] (2.2)

= P[L(a · r, H`) ≥ (1 + δ)E[L(a · r, H`)]] (2.3)

= P[L(a · r, H`) ≥ (1 + δ) · acd log(n)] (2.4) −δ2 · acd log(n) ≤ exp (2.5) 2 + δ 2 = n−δ ·acd/(2+δ). (2.6)

Since the Chernoff bound requires that δ > 0, we insist that φ > acd. 2.3. ANALYSIS OF EIM SAMPLING 79

−(1+γ) The lemma statement can be satisfied by P[L(a · r, H`) ≥ φ · log n] ≤ n , which we can achieve by finding values of a, c, d, and φ that, by Inequality 2.6 for some γ > 0 satisfy (φ/(acd) − 1)2 · acd ≥ (1 + γ) . (2 + (φ/(acd) − 1)) Letting x = 1 + γ, this is equivalent to (acd)2 − (2φ + x)acd + (φ2 − xφ) ≥ 0, which p has real roots at acd = φ + x/2 ± 2xφ + x2/4. Similarly,

P[L(b · r, H`) ≤ φ · log n] (2.7)

= P[L(b · r, H`) ≤ (1 − δ)E[L(b · r, H`)]] (2.8)

= P[L(b · r, H`) ≤ (1 − δ) · bcd log n] (2.9) −δ2 · bcd log n ≤ exp (2.10) 2 2 = n−δ bcd/2 ≤ n−x. (2.11)

Choosing δ so that (1 − δ)E[L(b · r, H`)] ≤ φ · log n gives δ ≤ 1 − φ/(bcd). Since the Chernoff bound requires that δ > 0, this gives the constraint φ < bcd. For Inequality 2.11 to hold, we need to find values of b, c, d, and φ such that (bcd)2 − p (2φ + 2x)bcd + φ2 ≥ 0. This has real roots at bcd = φ + x ± 2xφ + x2. p This gives feasible solutions for acd ≤ φ + x/2 − 2xφ + x2/4 and bcd ≥ φ + x + p 2xφ + x2. For later results we require that a = 1 and b ≤ 5. So for there to exist feasible values of c and d, we have the following constraint,

p r φ + x + 2xφ + x2 x x2 ≤ φ + − 2xφ + , (2.12) b 2 4 where b ≤ 5 and x = 1 + γ. When this bound holds, we can find values of each of the parameters such that the probability of |R`+1| being outside of the defined bounds is less than 1/nx, and therefore with a = 1, b = 5,

|R | 5|R | P ` ≤ |R | ≤ ` ≥ 1 − 1/nx − 1/nx , nε `+1 nε

−(1+γ) which is 1 − 2n . With this probability, the number of points in R` that are 80 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE

ε ε farther from S than v` (and hence the size of the set R`+1) is in the range [|R`|/n , 5|R`|/n ].

Ene et al. [72] prove that with probability 1 − O(1/n), it is possible to map each unsatisfied point to a satisfied point such that no two unsatisfied points are mapped to the same satisfied point. Such a mapping allows them to bound the cost of the unsatisfied points with regards to the cost of the optimal solution. Their proof re- lies on the choice of b = 4, and the bound from Lemma 31 giving a probability greater than 1 − 2n−2. However, we use b ≤ 5, and only assure a probability of 1 − 2n−(1+γ). Therefore, we prove that the required mapping exists with probab- ility 1 − O(n−(1+γ)); by setting γ = (log log n)/ log n this gives a probability of 1 − O(1/ log n), which is sufficient for large values of n. The choice of b arises from the requirement that b/nε < 2: for ε = 0.1 and n ≥ 10,000, this holds for b ≤ 5. In the original analysis, Ene et al. proved that their results hold with high probability, which they define as having probability ≥ 1 − O(1/n2). We instead bound our confidence in these results with probability 1 − O(1/n1+γ) for a variable γ, which we will refer to as with sufficient probability, or w.s.p.. The following result follows from the above analysis and that given by Ene et al [72].

Lemma 32. Running an α-approximation algorithm on the sample returned by Algorithm6 gives a probabilistic 4α + 2-approximation for k-Centre w.s.p. for φ > 5.15.

When running a 2-approximation algorithm on the sample, this result gives a 10- approximation bound on the resulting procedure. To achieve a success probability higher than 1 − O(1/n), we need x ≥ 1: by the bound in Inequality (2.12), this implies that φ > 5.15.

2.4 Runtime

We now analyze the runtimes of the MapReduce algorithms for k-centre described above. Ene et al. [72] proved that their sampling procedure required O(1/ε) rounds with high probability, while MRG can run in two rounds given sufficient resources. In this analysis we consider also the computations required in each of the rounds to determine the expected overall runtime. 2.4. RUNTIME 81

MRG:

Assuming that n/m ≤ c and k · m ≤ c, it follows that MRG will run in two consecutive MapReduce iterations. The first iteration involves running m concurrent instances of Gonzalez’s k-centre algorithm [90], each on n/m vertices. The runtime of Gonzalez’s algorithm on N points is O(k · N): each time a new centre is selected, we need to find the distance of that centre to all of the other vertices. So the runtime for the first round of MRG is O(k · n/m). During the second round, MRG runs the k-centre algorithm on the k · m centres obtained from the first round; this gives a runtime of O(k2 · m). Therefore the total runtime of MRG is O(k · n/m + k2 · m) – for reasonable values of the parameters we would expect the dominant term to be kn/m (as k > n is a trivial case for k-centre).

EIM:

The sampling algorithm EIM has, with high probability, T ∈ Θ(1/ε) iterations of the primary “while” loop at step2 of Algorithm6 – each comprising three MapReduce rounds – which is followed by a final clean-up round at the end that solves a single k-centre instance on the sample. Let R` and S` denote the state of sets R and S, respectively, in iteration ` of the main loop of the algorithm. Counting from the first `ε iteration, |R0| = n and, with high probability, |R`| = O(n/n ). In each iteration, points in R are added to H with probability 4nε(log n)/|R|, so |H| is O(nε log n). ε And in step 5, |S`| becomes |S`−1| + O(kn log n), so that, starting with |S0| = 0, we ε have |S`| = O((` + 1)kn log n). We now analyze each of the MapReduce rounds.

Round 1 The first set of reducers assigns the points in R to the sets S and H, which happens independently on each of the m machines over which R is partitioned. This round involves O(|R`|/m) operations during iteration `, so the total number of op- erations is ! |R | 1 n  1 n  ` ∈ O ∈ O · . ∑ ∑ `ε − −ε `

Round 2 The second reduce round sends the points in S`, together with those in H, to a single reducer, and finds the distance between points in H and those in S. This 82 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE

ε 2 uses O(|H| · |S`|) distance calculations per iteration, which is O((` + 1)k(n log n) ), ε 2 for each of the T ∈ O(1/ε) (w.h.p.) iterations. The total is O(k(n log n) ∑`

Round 3 The third reduce round finds the distance between points in R` to the points in the sample S`, in order to compare them to the pivot point v, though split across m machines. The third round requires O(|R`| · |S`|/m) distance calculations in iteration `, so summing over all iterations this is ! 1 ` + 1  k n1+ε log n  O kn1+ε log n ⊆ O · . ∑ `ε ( − −ε)2 m `

Final round This sends |ST| points to a single machine, on which Gonzalez’s k- centre algorithm is run. With high probability, the time this takes will be:   O ((k/ε)nε log n · k) = O (k2/ε)nε log n .

Based on our experimental evaluations, we find that the dominant procedure for EIM is Round 3, in which points are removed from R: this is not surprising, as in most cases k · n1+ε is much larger than k2nε because k < n. Furthermore, MRG also has O(k · n/m) complexity for cases where k · m < n/m. Experiments confirm that the dominant round for each algorithm has a linear k, rather than quadratic k2 term. Comparing the dominant round of each algorithm, we expect EIM to be slower by a factor of nε(1 − n−ε)−2 log n.

2.5 Experiments

We implemented the typically 2-round algorithm, MRG, and EIM, our variant of the sampling algorithm from Ene et al. [72]. For consistency with previous literature, our implementation mimics that of Ene et al. [72]; simulating parallel machines sequen- tially on a single machine, and taking the longest of the machine processing times as the processing time for that MapReduce round. This does not account for moving data between machines, but as MRG involves fewer rounds, the cost of this would be less than for EIM. 2.5. EXPERIMENTS 83

Table 2.1: Theoretical Performance Comparison

Algorithm Approx. Rounds Runtime GON [90] 2 · OPT n/a O(k · n) MRG 4 · OPT 2 O(k n/m)  kn1+ε log n  EIM [72] 10 · OPT O(1/ε) O m(1 − n−ε)2

We compare these parallel procedures with GON, the sequential algorithm of Gonza- lez; his algorithm has a 2-approximation guarantee, giving a baseline with which we can compare the performance of the parallel algorithms. We evaluate the algorithms over a range of k and vary the numbers of inherent clusters. For their experiments, Ene et al. [72] generate synthetic data, designed to have a fixed number of similarly sized clusters, and test their algorithm for values of k equal to the number of clusters. The algorithms were implemented in C, and experiments were run on a “com- modity” machine with 8GB of memory and an Intel R CoreTM i7-2600 CPU running at 3.40GHz.

Experimental Design

We test the three k-centre algorithms over a range of k and for varying numbers of inherent clusters – for many applications the number of clusters may not be known in advance, and the number of clusters required can be independent of the structure of the data. We extend these experiments to test on graphs with different underlying structures, to better determine how well these algorithms are likely to perform in practice. We use the Euclidean metric for all of the experiments given, and compute the distances between points as they are required, rather than storing them. The k-centre algorithm assumes a complete graph as input, and the procedure for sending vertices to reducers is independent for each vertex. In contrast, a matrix representation of a graph with all distances stored explicitly could result in a significant proportion of the data sent between machines being redundant in a given round. For all MapReduce implementations, we use GON as the subprocedure for se- lecting the final centres. Experiments use a fixed value of m = 50, and a range of 84 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE values of n and k. As a preliminary step, we ran several instances of EIM over a range of values of ε, and determined that the optimal value (in terms of runtime and solution size) of ε was = 0.1. This corresponds with the value Ene et al. use in their experiments. In Section 2.3, we introduced a parameter φ to the EIM sampling procedure; we test EIM over a range of values of φ. Our experiments consider the impact of lowering φ from the “original” value of 8, both in terms of runtime and effectiveness. Corres- ponding with results in Section 2.3 we test φ = 6 as a value that retains the initial approximation guarantee with sufficient probability; to determine the robustness of the algorithm, we test φ = 4 and φ = 1, which are below the bound of φ ≥ 5.15 that was given in Lemma 32.

Data sets

We test against a combination of real and synthetic data sets, primarily in two and three dimensions, but with several real datasets of larger dimension. The datasets have a range of sizes, from 10,000 through to 1,000,000 points, with varying degrees of inherent clustering. By testing the performance of the algorithms on data sets with known number of clusters and cluster diameters, we can determine whether the algorithms can accurately identify the clusters. Our synthetic datasets have three different distributions:

Unif The n points are uniformly distributed in a two-dimensional square.

Gau The k0 cluster centres, where k0 might not equal k, are uniformly randomly generated in a unit cube. The n points are assigned to these clusters uniformly at random, resulting in clusters of roughly similar size. Points follow a Gaussian distribution around their centre with σ = 1/10, all mimicking Ene et al. [72].

UnB Similar to Gau, with the distribution biased such that around half of the points are in a single (inherent) cluster; the distribution between the remaining clusters remains uniform. 2.5. EXPERIMENTS 85

We generate three graphs of each size and type, and run the algorithms twice over each data set, taking the average. This gives a total of six results for each type of dataset, over three different graphs. We further test these procedures on real datasets from the UCI Machine Learning Repository [140], over a range of sizes, applications and dimensions. We run four tests over each of the real data sets, and take the average result. We include results for the 25, 010-point training set for the Poker Hand data set, and the 10% sample from the 4, 000, 000-point KDD Cup 1999 data set.

109 102 EIM EIM GON GON 108 MRG MRG

107

101

Value 106 Value

105

104 100 0 20 40 60 80 100 0 20 40 60 80 100 k k Average solution value for KDD Cup dataset. Average solution value for Poker Hand dataset.

102 103

102 101

101 100

100

10-1 Runtime Runtime 10-1 EIM EIM 10-2 GON 10-2 GON MRG MRG 10-3 10-3 0 20 40 60 80 100 0 20 40 60 80 100 k k Runtime in seconds for KDD Cup dataset. Runtime in seconds for Poker Hand dataset.

Figure 2.7: Average solution value and runtimes over a range of k. The algorithms generally perform well, however EIM has a poor average on the KDD Cup 1999 10% sample from the UCI Machine Learning Repository [140]. 86 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE

102 100

101 10-1 100

10-1 10-2 Runtime Runtime 10-2 -3 EIM 10 EIM 10-3 GON GON MRG MRG 10-4 10-4 0 20 40 60 80 100 0 20 40 60 80 100 k k (a) Gau (n = 1,000,000,k0 = 25). (b) Unif (n = 100,000).

Figure 2.8: Runtimes in seconds over a range of values of k. Corresponding with our theoretical analysis, EIM runs slower than both MRG and the sequential alternative, with MRG being the fastest of the algorithms considered.

102 102 EIM EIM GON GON MRG MRG 101

101 Value Value

100

10-1 100 0 20 40 60 80 100 0 20 40 60 80 100 k k (a) Gau (n = 1,000,000, k0 = 25). (b) Unif (n = 100,000).

Figure 2.9: Average solution value over a range of values of k. All algorithms behave comparably, and accurately identify fixed clusters.

2.6 Results

Overall MRG is faster than the alternative procedures, often by orders of magnitude, with EIM running slower than the sequential algorithm despite being parallelized, conforming with the analysis in Section 2.4. 2.6. RESULTS 87

Table 2.2: Average solution value over a range Table 2.3: Average solution value over a range of k for Gau (n = 1,000,000, k0 = 25). of k for Unif (n = 100,000).

k MRG EIM GON k MRG EIM GON 2 96.04 93.11 95.86 2 91.33 95.80 91.18 5 61.90 61.58 63.31 5 50.68 50.65 53.14 10 41.31 39.43 39.72 10 33.35 31.12 32.35 25 0.961 0.854 0.961 25 18.49 18.01 18.27 50 0.762 0.683 0.719 50 13.14 12.39 12.36 100 0.607 0.556 0.573 100 9.144 8.764 8.727

Table 2.4: Average solution value over k for UnB (200,000, k0 = 25). EIM is notably better when k matches the number of clusters. k MRG EIM GON 2 97.96 93.69 93.37 5 64.61 64.28 61.72 10 40.17 40.05 40.39 25 0.932 0.828 0.939 50 0.668 0.643 0.655 100 0.515 0.530 0.500

Solution quality

While both MapReduce procedures have worse approximation guarantees than the sequential alternative, we find that in practice they perform comparatively. In most cases, solutions for the parallelized algorithms are comparable to the 2-approximation baseline, GON, with EIM performing slightly better for synthetic data. Ene et al. [72] suggested that their algorithm did not perform well, citing sensitivity to outliers. Our results show otherwise: sampling fewer points can in fact occasionally provide better results, as there is a lower chance of sampling points that are peripheral to the cluster. The tendency for GON to favor outliers is mitigated, rather than amplified, by sampling. This effect is particularly evident for synthetic datasets such as Gau, and is particularly evident when k0 = k – as shown in Tables 2.2–2.4. In general, we find that EIM is marginally more effective than MRG. Excepting the poor performance of EIM on the KDD Cup 1999 10% sample, the same occurs on the real data sets, as seen in Figure 2.7. 88 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE

Table 2.5: Average solution value versus φ, in Table 2.6: Average running time versus φ, in EIM, for Gau (n = 200,000, k0 = 25). For EIM, for Gau (n = 200,000, k0 = 25). For each k, the lowest value is in italics. each k, the lowest value is in italics. aaφ 1 4 6 8 aaφ 1 4 6 8 k a k a 2 88.4 80.4 85.5 86.5 2 0.050 0.059 0.165 0.135 5 59.9 60.9 56.5 61.9 5 0.080 0.130 0.368 0.314 10 36.2 35.5 34.7 35.3 10 0.283 0.480 0.549 0.552 25 0.796 0.780 0.826 0.840 25 0.588 0.505 1.47 1.42 50 0.630 0.617 0.610 0.666 50 0.693 0.816 2.84 2.24 100 0.478 0.492 0.505 0.535 100 0.726 0.757 3.78 3.59

Running time

For the majority of the experiments, EIM ran using two iterations of the main loop, for a total of seven MapReduce rounds. Some of the graphs vary between one or two iterations — four or seven MapReduce rounds — as the number of points removed per round is probabilistic. From Figure 2.10, we can see how the ratio of n to k affects whether EIM uses the MapReduce sampling procedure or (simply) sends the entire dataset to a single machine. We can also note that in Figure 2.10b, MRG displays a different trend from Figure 2.10a. In Section 2.4, we showed that the runtime is O(k2/m + kn/m). For larger values of k and small values of n, the k2 · m term dominates. As n grows, the k · n/m term becomes more prominent, so the trend becomes similar to that in Figure 2.10a. From our runtime analysis in Section 2.4, both MRG and EIM have a round with a O(k2) term in the running time. When k is large relative to n, this term can dominate.

Runtime/Approximation Tradeoff

We monitor the performance of EIM procedure as a function of the threshold for removing points (the parameter φ). As expected, the variability of results increases as the parameter φ is decreased, and the runtimes significantly decrease. Tables 2.5 and 2.6 respectively compare the average solution value and runtimes over different values of φ. The algorithm speeds up significantly for φ < 5.15 (see Lemma 32), yet still returns good solutions: in some cases solutions are better for smaller φ. This 2.6. RESULTS 89

1 10 101

0 10 100

-1 10 10-1

-2 10 10-2 Runtime Runtime

-3 EIM EIM 10 10-3 GON GON MRG MRG 10-4 10-4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 n 1e6 n 1e6 (a) k = 10 (b) k = 25

102 102

1 101 10

0 100 10

-1 10-1 10 Runtime Runtime

EIM -2 EIM 10-2 10 GON GON MRG MRG -3 10-3 10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 n 1e6 n 1e6 (c) k = 50 (d) k = 100

Figure 2.10: Runtimes in seconds for fixed k on Gau (n, k0 = 25), with n ranging from 10, 000 to 1, 000, 000. For sufficiently small values of n relative to k, EIM behaves identically to GON. The is caused by the condition on the while loop: if k is large enough, the condition is never met and no sampling occurs, so the GON subprocedure is run on the entire dataset.

seemingly counterintuitive behavior is explained by the GON sub-procedure – it selects the farthest points as centres, which are likely peripheral to the cluster. By sampling fewer points, it is less likely for extremal points to be present in the subgraph on which GON is run. Lowering φ sometimes improves the average solution value and decreases runtimes. However this behaviour is likely to be more volatile: the guarantee on the performance has lower probability, giving a higher chance that a very poor solution is returned. 90 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE

102 100

101 10-1

100 10-2

10-1

10-3 Runtime Runtime 10-2 EIM EIM 10-4 10-3 GON GON MRG MRG 10-4 10-5 0 20 40 60 80 100 0 20 40 60 80 100 k k (a) Gau (n = 1,000,000,k0 = 50). (b) Gau (n = 50,000, k0 = 50).

Figure 2.11: Runtimes in seconds for graphs of different sizes over a range values of k. When k becomes too large, relative to n, EIM no longer performs sampling and defaults to GON.

2.7 Conclusions

In typical instances, our parallel procedure for k-Centre performs as a two-round 4-approximation. Experimentally, solutions are comparable to those of a sequential 2-approximation, while runtimes demonstrate a significant speedup. The existing sampling-based MapReduce procedure [72], while slightly more effective, can be very slow: we give the first runtime analysis for this method, supporting our em- pirical results. We also parametrized the sampling procedure to improve runtimes, sometimes even giving better solutions despite the lack of a provable effectiveness bound. We prove that we can bound the probability of bad solutions for a range of values of the tuning parameter φ. Indeed, the algorithm still returns good solutions even for values of the parameter below the provable bounds, while running faster. We give an example for which the approximation factor for MRG is tight. However, it is unclear how likely such cases are in practice. It would be of interest to find bounds on the probability that our algorithm gives a poor approximation. And what is the effectiveness when MRG needs more than two rounds? Im and Moseley [107] described a 3-round 2-approximation MapReduce procedure for the k-centre problem under the assumption that OPT is known, and announced a 4-round procedure that does not require prior knowledge of the optimal solution – these details have yet to appear. Recently Malkomes et al. [147] presented a par- 2.7. CONCLUSIONS 91 allel adaptation of the k-centre algorithm comparable to a special case of our ap- proach. Currently all such approaches rely on the sequential algorithm of Gonza- lez [90] – it would be interesting to determine the effect of incorporating an alternative 2-approximation algorithm, such as Hochbaum & Shmoys’s [103], as the sequential procedure. 92 CHAPTER 2. EFFICIENT CLUSTERING WITH MAPREDUCE CHAPTER 3. PARALLEL COVERINGS OF GRAPHS

3 Parallel Coverings of Graphs

Motivated by vehicle routing problems, we consider a range of network optimisation problems that involve covering graphs efficiently. These problems relate to optimisa- tion problems such a travelling salesman, facility location and even machine schedul- ing. As is typical for many optimisation problems, covering problems on graphs are often computationally expensive and the RAM required to perform the calculations may be impractical. In Chapter2 we demonstrated how the k-centre problem can be solved using the MapReduce framework for parallelisation. We now consider the problem of adapting several existing approximation algorithms for graph covering problems to the MapReduce paradigm.

3.1 MapReduce for Covering Problems

Covering is a broad category of combinatorial problems that involve selecting a min- imal sub-collection of components from some collection in order to cover some other elements of that structure. A typical example of this would be the set cover problem: choose a minimal sub-collection of sets such that all elements are “covered”. In this example, an element is considered covered when a set in our sub-collection includes it – that is, covering assumes some correspondence from items in our solution to

93 94 CHAPTER 3. PARALLEL COVERINGS OF GRAPHS elements of the collection we are covering. Blelloch et al. [29] gave a parallel ap- proximation algorithm for set cover, that involves finding a sub-collection of “nearly independent” sets. The covering problems we will be considering are similar to the edge cover prob- lem, in that we aim to select edges of the graph and to cover all of the vertices. However, we have additional restrictions on, for example, the number of connected components and the structure these components can have. We provide parallel solutions for several graph covering problems such as the k- path cover and k-star cover problems, by adapting algorithms of Arkin et al. [8] and Even et al. [74] to the MapReduce model. These problems are useful in modelling routing problems on networks, similar to the k-centre problem considered in the previous chapter.

3.2 Tree Covering Problems

Given a graph with positive rational edge weights, a tree cover of G = (V, E) is a forest of trees selected from the edges in E such that every vertex in V is incident to some tree. The weight of a tree T is the sum of the weights of the edges in T, and the cost of a tree cover is defined to be the maximum weight of a tree. Given an integer k, the k-tree cover problem involves finding a tree cover of G with no more than k trees such that the cost of the tree cover is minimised. The bounded version of this problem adds a bound B, and aims to determine whether a solution exists in which no tree has cost greater than B. Tree coverings of graphs can represent many different types of routing problem, and have been applied in the area of multi-agent routing [188].

Sequential Algorithm for k-Tree Cover

Algorithm8 describes how to find either a k-tree cover of size at most 4B or proof that there is no k-tree cover of G with size less than B [74]. This is generalised to give a 4-approximation to the k-tree cover problem by using a binary search over possible values of B. Khani and Salvatipour [122] improve upon this algorithm to obtain a 3-approximation. 3.2. TREE COVERING PROBLEMS 95

The idea behind this algorithm is to find a minimum spanning forest using edges of weight less than B – if each of the connected components can be split into parts of bounded weight then we have a feasible solution. Even et al. [74] proved that this is a (4 + ε)-approximation for the problem.

Algorithm 8 k-Tree-Cover(G, k, B) 1: Let G1 ... Gi be the components of the graph obtained by removing all edges of weight greater than B from G.

2: Let Mi = MST(Gi) and ki = bw(Mi)/2Bc for each i

3: if ∑i(ki + 1) > k then 4: return Failure: B is too low. 5: SOL = ∅ 6: for j ∈ [1 . . . `] do i i i i 7: Let S1 ... Sc, L be the edge-decomposition of tree Mi, where w(Sj) ∈ [2B, 4B) and w(Li) < 2B. i i i 8: SOL = SOL ∪ {S1 ... Sc, L } 9: return SOL.

The sequential algorithm for the k-tree problem given by Even et al. [74] obtains a solution with a 4-approximation ratio, and involves a binary search over a value B which bounds the minimum cost of a solution. For each value of B, the algorithm finds a minimum spanning tree (MST) of the input graph. If the number of trees required to get the weight below 2B is greater than k, then no solution exists for the given value of B. This follows by observing that, for each connected component Gi, the number of trees in the optimal solution that cover the w(MSTi) vertices of Gi must be at least 2B + 1/2 – otherwise we would be able to use the edges in the optimal tree cover to construct a MST of Gi with cost less than MSTi. At step7 of Algorithm8, each of the trees is decomposed into subtrees of bounded size. For each edge e = (u, v) in the tree Tr rooted at r, where u is the parent of v, let Te consist of the edge (u, v) and the subtree rooted at v. Each subtree Te with weight w(Te) ∈ [B, 2B] can be split from Tr – we need only consider the case where all subtrees are under-weighted or over-weighted.

A subtree Te is under-weighted if w(Te) < B and over-weighted if w(Te) ≥ 2B. If 96 CHAPTER 3. PARALLEL COVERINGS OF GRAPHS

r all subtrees are under-weighted, then Tr is the left-over tree L . If all subtrees of Tr are under- or over-weighted, there exists some over-weighted subtree Te, e = (u, v) such that all subtrees rooted at v are under-weighted. If the children of v are e1 ... e`, s we find the smallest index s such that ∑i=1 w(Tei ) ≥ B and split away the subtree ∪s [ ] i=1Tei – the weight of this subtree will be in the range B, 2B , as all of the subtrees

Tei were under-weighted. To find a solution the general version of the problem, we can use the procedure for the bounded version of the problem – this requires an appropriate value of B. Performing a binary search over possible values of the bound B can give us a solution to the k-tree cover problem. Even et al. [74] proved that this binary search procedure is polynomial in n, the number of vertices.

MapReduce Algorithm for k-Tree Cover

We will now demonstrate how this algorithm can be adapted to a parallel framework, using the existing MST MapReduce algorithms. The input for the algorithm is a list of the edges in the graph. We arbitrarily partition the edge set, and send each partition to a reducer. Each reducer removes edges of weight greater than B, and returns the remaining edges. We can then apply the adapted version of the edge-based MST algorithm from Lattanzi et al. [130] to obtain a minimum spanning forest of the subgraph. This algorithm takes a constant number of rounds, and is in MRC0. The tree splitting procedure can be added to the last reduce round of the MST algorithm, as the entire MST is present on a single processor. This allows us to solve the bounded k-tree cover problem in only a constant num- ber of rounds. The k-tree cover problem is solved by running a binary search over possible values of B, increasing the required number of rounds to O(n) – as we can bound potential values of B by wmax · n, where wmax is the maximum cost of an edge.

3.3 The Minimum Path Cover Problem

Covering problems on graphs are often useful as a model for routing problems, where we aim to minimise the distance travelled or the number of agents required to tra- 3.3. THE MINIMUM PATH COVER PROBLEM 97 verse the network. Path covering problems are a special case of tree cover, where we have the added requirement that each component is a path. Path cover models have applications in areas such as calculating trajectories for cooperative patrolling tasks for robot agents [159], designing large-scale wireless sensor networks [61] and trajectory planning in wireless sensor networks [125].

Definition 33. Given a complete graph G with positive rational edge weights and a value B, the minimum path cover problem is to find a collection of paths in G with the minimal number of components, where the maximum cost of a path is no more than B.

This problem differs from k-tree cover, in that we aim to minimise the number of clusters rather than the size. Arkin et al. [8] gave a 3-approximation to this prob- lem that involves finding a series of minimum weight forests. They further proved that the best-possible approximation ratio is 2 unless P = NP by a reduction from the travelling salesman problem. Nagarajan and Ravi refer to path cover as the un- rooted distance-constrained vehicle routing problem, and introduced an alternative 3-approximation algorithm with better runtime complexity [155]. Their approach simplifies the techniques used in the algorithm of Arkin et al. [8]. This problem has also been solved using LP-rounding techniques with a constant integrality gap [156].

Algorithm 9 Min-Path-Cover(G, B) 1: M ← MST(G)

2: Let S1 ... S` be the connected components of the forest obtained by removing edges of weight greater than B/2 from M. 3: SOL = ∅ 4: for i ∈ [1 . . . `] do

5: Let Ti be the tour obtained by doubling the edges of Si and shortcutting repeated vertices. 6: ( ) ≥ while ∑e∈Ti w e B do 7: S = longest path starting from an end of Ti with weight no more than B. 8: SOL = SOL ∪ {S}

9: Ti = Ti\S

10: SOL = SOL ∪ {Ti} 11: return SOL 98 CHAPTER 3. PARALLEL COVERINGS OF GRAPHS

Algorithm9 describes the 3-approximation algorithm by Nagarajan and Ravi [155]. 2·MSTi During the loop at step4, each tour Ti is split into at most B + 1 paths – the total 2 number of paths returned is therefore at most B ∑i≤` MSTi + `. Note that cost(M) ≥ B ∑i≤` MSTi + 2 `, since each of the ` components arises from removing only edges of cost more than B/2. Let OPT be the optimal set of paths, with size |OPT| – it can be B B shown that B|OPT| ≥ ∑P∈OPT cost(P) ≥ ∑i≤` MSTi + 2 ` − 2 |OPT|. It follows that the total number of paths returned is at most 3|OPT|. The Minimum path cover algorithm by Nagarajan and Ravi [155] can be implemen- ted in MapReduce using a parallel MST algorithm. This implementation preserves the 3-approximation guarantee of the original algorithm.

MapReduce k-path

We now describe how Algorithm9 can be adapted to the MapReduce framework. Finding a minimum spanning tree using MapReduce can be achieved using the MST algorithm of Lattanzi et al. [130]. We can adapt the last round of their algorithm so that edges with weight greater than B/2 are removed from the MST. The final round of the MapReduce MST algorithm sends the entire MST is to a single processor – this allows us to calculate the sequence of vertices in the tour Ti for each component Si during the same reduce round. However, the edges used to “shortcut” past repeated vertices are not in the reducer. Because the weights are metric, we can use the triangle inequality to give an upper bound on the weights of an edge (u, v), by summing the weights on the path from u to v in Si. We can perform the procedure of splitting a path with weight no more than B by using these upper bounds on the edge weights. The analysis by Nagarajan and Ravi [155] to obtain the 3-approximation ratio remains the same when using the upper bounds on the weights.

Lemma 34. Algorithm9 can be adapted to MapReduce using a constant number of rounds.

3.4 The k-star Covering Problem

In this section we consider a particular type of network optimisation problem in which all the vertices of an edge-weighted graph are to be covered by a set of at most 3.5. HARDNESS OF K-STAR COVERING 99 k stars such that the maximum cost of a star is minimised. A star in a graph is a root vertex and a subset of the neighbours of the root along with the incident edges. The cost of a star is the sum of the weights of the edges in the star. We can alternately define a star as a bipartite graph K1,` for any ` ∈ Z – that is, a in which one of the partitions contains a single vertex. In the k-star problem, we are given a complete graph G = (V, E) with weights d : E → Q+ on the edges that obey the metric property, and a bound k on the number of stars. For the minimum star cover problem, we are given a bound B, and aim to find a cover of G such that no star has weight greater than B and the number of stars is minimal.

3.5 Hardness of k-star Covering

One interesting special case of the k-star problem is a line metric, where all vertices are in linear position. It appears that this does not simplify the problem in any useful way, even in the case where vertex positions are integral. The problem is that assignment of vertices to centres can still result in “crossings”, so there is still no obvious way of determining which vertices should be roots. Ahmadian et al. [2] gave a polynomial-time approximation scheme for k-star cover on graphs with a line metric. They also proved that solving such instances is strongly NP-hard, indicating the difficulty of this problem even for seemingly simple cases. Furthermore, they demonstrate that the k-star problem on the Euclidean metric is APX-hard – there exists a PTAS reduction to Euclidean k-star cover from every prob- lem that allows polynomial-time constant-factor approximation.

Figure 3.1: Assignments may cross over entire stars, even for a line metric.

The difficulty in solving the k-star cover problem arises from the dual objectives of selecting centres and assigning points to clusters. Even once the optimal set of centres is selected, the assignment subproblem remains NP-hard. The rooted k-star problem 100 CHAPTER 3. PARALLEL COVERINGS OF GRAPHS has the set of centres predefined, so the objective is only to assign the remaining vertices to centres such that the maximum cluster weight is minimised.

Sequential algorithm

Even et al. [74] gave a gave a bicriteria approximation to the k-star cover problem using LP-rounding. This technique gives a 4-approximation guarantee, but uses up to 4k stars. Given a complete graph with a metric on the edges and the vertex set partitioned into sets of clients and facilities, the k-median problem involves finding a set of at most k facilities such that the sum of the distances of clients to nearest facilities is minimised. Arkin et al. [8] use approximate solutions for the k-median problem to obtain solutions for star covering problems. They give a (2α + 1)-approximation to the minimum star cover, where α is the approximation ratio of the k-median algorithm used as a subprocedure. This pro- cedure finds a k-median solution for all k ∈ [1 . . . n], and splits the stars to obtain a ∗ series of potential solutions SOLk. Let OPT be an optimal solution, with k stars – since OPT is a feasible k∗-median solution, it follows that the solution given by the k∗-median approximation algorithm has cost at most αBk∗. Each time a star is split from the solution, it reduces this sum by at least B/2, thereby adding at most 2αk∗ ∗ ∗ stars – so when k = k , we have a solution SOLk∗ with no more than (2α + 1)k stars. ¯ The algorithm returns a solution SOLk¯, where k = arg mink |SOLk| – it follows that k¯ ≤ (2α + 1)k∗. The bicriteria (3 + ε, 3 + ε)-approximation algorithm of Arkin et al. [8] for the k-star cover problem (which they refer to as min-max star cover) relies on their algorithm for the minimum star cover problem, which is described in Algorithm 10. This procedure solves the minimum star cover problem with a 2α + 1-approximation ratio, where α is the approximation ratio of the k-median algorithm used. The minimum star cover problem takes as input a weighted graph G = (V, E) and a bound B – the goal is to find a star-cover of G with a minimal number of stars such that no star has weight greater than B. The idea behind the algorithm of Arkin et al. [8] is to find a series of k-median solutions, and split each of the clusters into stars with weight less than B. This gives a series of potential star covers – the covering 3.5. HARDNESS OF K-STAR COVERING 101

Algorithm 10 Min-Star-Cover(G = (V, E), B) by Arkin et al. [8]. 1: for k ∈ [1 . . . n] do 2: Let c1, c2,..., ck be a k-median solution on G 3: = { } Let Si v1, v2,..., v|Si| be the points assigned to ci in the k-median solution, ordered such that d(vj, ci) ≤ d(vj+1, ci). 4: SOLk = ∅ 5: for i ∈ [1 . . . k] do 6: ( ) > while ∑v∈Si d v, ci B do |Si| 7: Let ` be the maximum index such that ∑j=` d(vj, ci) ≥ B/2. ¯ |Si| 8: Let S be a star with vertex set ∪j=`vj and root v`. 9: SOLk = SOLk ∪ S¯. |Si| 10: Si = Si\ ∪j=` vj. 11: SOLk = SOLk∪ a star with vertex set Si and root ci. 12: k¯ = arg mink |SOLk|. 13: return SOLk¯. with the fewest stars is returned. Arkin et al. [8] proved that their minimum star cover algorithm can be used to find a k-star solution with a bi-criteria (3, 3)-approximation – that is, a solution with at most 3k stars and with cost no more than 3 · OPT. This approximation ratio is based on the k-median (3 + 2/k)(1 + ε)-approximation algorithm by Arya et al. [12] – there have since been further improvements on the approximation ratio for k-median. The current best-known approximation is 2.611 + ε, by Byrka et al. [35]. Algorithm 10 can be used to obtain a k-star solution by performing a binary search over possible values to estimate an optimal solution size B∗. Using a k-median al- gorithm with an α-approximation, we can run Algorithm 10 with B = αB∗ to find a α-approximation to the k-star instance with α · k stars.

MapReduce Algorithm for k-star

As the k-star algorithm described above requires a k-median solution, adapting the algorithm to MapReduce would involve using a parallel k-median algorithm. Ene et al. [72] gave a MapReduce algorithm for the k-median problem, which uses a sequential weighted k-median algorithm as a subprocedure. Their algorithm is based on a procedure which, given a set of points and a metric, 102 CHAPTER 3. PARALLEL COVERINGS OF GRAPHS selects a sample of the points such that all the points in the initial set are relatively close to some point in the sample. The details of this algorithm are described in Chapter2, and in Algorithm6. The sample points and aggregate weights are then sent to a final reducer, which runs a weighted k-median algorithm and returns the approximate solution. If the approximation ratio of the weighted k-median algorithm is α, then this algorithm achieves a (10α + 3)-approximation to the k-median algorithm. The sequential algorithm for the k-star problem uses a k-median algorithm as a subprocedure, and therefore a MapReduce implementation of this approach will require the use of the MapReduce k-median algorithm of Ene et al. [72]. After obtaining a solution for the k-median problem, we need to determine the assignment of the points to their centres. We can split the points across the machines, and send the k-median solution to all machines – it would only require one round to find the assignment of points to their nearest centre. The procedure described by Arkin et al. [8] for splitting stars into smaller stars is independent across each of the stars – so we can assign stars to machines and perform the splitting in parallel. For dense graphs, the size of a star will be small enough to be processed on a single reducer. This gives a 20α + 7-approximation to the minimum star cover problem, and a 10α + 3-approximation to the k-star cover problem using 10α + 3 stars.

Approximating k-star with k-median

The k-star bears some relation to k-median problem, as noted by Arkin et al. [8] as a method for obtaining a k-star solution. However, it seems to be difficult to obtain a reasonable bound on the size of a k-star solution, as the optimal k-median solution may assign most of the vertices to a single centre. This is generally suboptimal for the k-star covering problem. Since the optimal k-median solution can be at most equal to the sum of the costs of the stars in the k-star problem, the maximum cost of a star in the k-star problem can be no less than the size of the optimal k-median solution divided by the number of stars. An optimal k-median solution can be no more than k times the optimal k-star cost, because a k-star solution is a feasible k-median solution and if the optimal k-star 3.5. HARDNESS OF K-STAR COVERING 103 size is B then total cost of all the stars is no more than k · B. This implies that if all stars in the optimal k-median solution have equal size, then it is also an optimal solution for the k-star problem. Let OPT = {S1,..., Sk} be an optimal k-median solution, ∗ ∗ ∗ ∗ and OPT = {S1,..., Sk } an optimal k-star solution. As OPT is a feasible k-median solution, d(OPT∗) ≥ d(OPT). Noting that the maximum star size in OPT∗ can be no more than the maximum star size in OPT, we get the following inequalities.

∗ ∗ k · max d(Si) ≥ k · max d(Si ) ≥ d(OPT ) ≥ d(OPT) i≤k i≤k

If all of the stars in the optimal k-median solution are of equivalent cost, then it implies that k · maxi d(Si) = d(OPT), which makes the above inequality an equality and implies that our k-median solution is also optimal for the k-star problem.

We can use clustering problems such as k-median and k-centre to find candidates for the cluster centres in k-star. Once we have a set of optimal centres for the k-star problem, the remaining subproblem is equivalent to the rooted k-star problem. Even et al. [74] considered this in terms of a bin-packing problem, however it can also be modelled as a machine scheduling problem – we can model each of the centres as a machine, and aim to assign jobs to the machines to minimise the total processing time. Interpreting each edge weight as the processing time of a job on a given machine, we can model the weight of a star as the processing time for the equivalent machine. This gives us an independent parallel machine scheduling problem with machine- dependent processing times. The problem of minimising the total processing time on parallel machines is NP-hard, but allows a 2-approximation [173]. Lenstra, Shmoys and Tardos [134] approximate the problem of scheduling unrelated parallel machines using linear programming. Given the relationship between the k-star cover problem and various different op- timisation models – ranging from clustering to scheduling problems – it remains an interesting open problem with many possible avenues for research. 104 CHAPTER 3. PARALLEL COVERINGS OF GRAPHS CHAPTER 4. SCHEDULING WITH PRECEDENCES

4 Scheduling with Precedences

Scheduling is a fundamental problem in operations research and computer science with a range of applications. We saw in Chapter3 that the k-star problem contains a subproblem related to scheduling parallel machines to minimise the makespan. There are many other variants of scheduling, which may include precedence con- straints between jobs, weights that are dependent on the machine assigned or on the job ordering, and jobs that each achieve different sub-tasks. The majority of these problems are NP-complete, often even for simple cases such as single machine scheduling. In this chapter we will give an overview of the different types of schedul- ing problems and the results and bounds for these problems. We will also examine a more specific scheduling problem with applications in test suite prioritisation.

4.1 Test-case Prioritisation with Precedences

Measures for Evaluating Fault Detection Rates

We assume a set T of tests, and a precedence graph represented as a DAG P = (T, E) where the edge set E imposes pairwise ordering requirements i ≺ j on the tests i, j ∈ T. When an test i is required to be executed before a test j, we say j depends on i, and that there is a precedence constraint i ≺ j. Each job in T has unit processing time, and

105 106 CHAPTER 4. SCHEDULING WITH PRECEDENCES

Table 4.1: An example fault-coverage matrix F, indicating which test case can identify each fault.

Faults Test case f1 f2 f3 f4 f5

t1 x t2 x x t3 x x x t4 x t5 x there is only a single processor. Let F be a “fault matrix” indicating which faults are identified by each of the tests, and F[j] for j ∈ T be the set of faults covered by test j. Similarly, let L indicate which lines of code are covered by the tests, with L[j] a set indicating the code coverage of test j ∈ T. There are several ways we can measure the effectiveness of a schedule, depending on the testing model being used. For example, if we halt testing as soon as a defect is found, we might only be interested in the earliest cover time. However, if we are interested in finding all faults, we might instead want to minimise the time the last fault is detected. We focus on two primary metrics for measuring the fault-coverage performance of a schedule.

Measure 1: Average fault-coverage time

The first measure we consider for determining the effectiveness of our algorithms is the average rate at which faults are detected in each of the test suite orderings. As a measure of the rate of fault detection, we use the Average Percentage of Faults Detected (APFD) metric defined by Rothermel et al. [167]. The APFD is interpreted as a percentage, with higher values implying a higher rate of fault detection. Rothermel et al. [167] define the Average Percentage of Faults Detected (APFD) metric, which measures the percentage of tests detected over the number of test cases (jobs) that have been executed.

Definition 35. If there are m faults to be detected by n tests and Ci is the coverage time of fault i, then 1 ∑ C APFD = 1 + − i∈T i . (4.1) 2n m · n 4.1. TEST-CASE PRIORITISATION WITH PRECEDENCES 107

100% 100% 100%

3 3 3

2 40% Coverage Time 1

% Defects Detected 20%

f2 f3 f1 f4 f5 1 2 3 4 5 Defects covered Time

(a) Coverage time for each defect (b) Rate of coverage over time

Figure 4.1: Average rate of fault detection for the test ordering t1, t2, t3, t4, t5. Tests are drawn from the defect matrix in Table 4.1

We can visualize the rate at which tasks are completed in terms of a histogram in which tasks are ordered by completion time – the smaller the area under this histogram, the faster the jobs are completed. As an example, consider the fault- coverage matrix in Table 4.1. Consider the order t1-t2-t3-t4-t5, and Figure 4.1, which represents the number of faults detected against the number of the test cases that have been executed as a histogram. After executing t1, 20% of faults will have been detected; after t2, this increases to 40%. The area under the histogram in Figure 4.1b corresponds with the rate of fault detection as measured by the APFD score. Under this test ordering, fault f1 is covered at time 3; fault f2 at time 1 – applying Equa- tion (4.1) to these values, the APFD score is

1 3 + 1 + 2 + 3 + 3 55 − 24 1 + − = = .62 . 10 25 50

If we order tests in decreasing order of the number of defects covered, we get the schedule t3-t2-t1-t4-t5; after executing the only fist test case, t3, 60% of faults have been detected. The area under the histogram in Figure 4.2b is larger than in Figure 4.1b, which corresponds with the intuition that the second test ordering detects faults at a faster rate. Under this alternative ordering, with tests scheduled in order of their defect cov- 108 CHAPTER 4. SCHEDULING WITH PRECEDENCES

2 2 100% 100% 100% 100%

60% 1 1 1 Detection Time % Tasks Covered

f1 f4 f5 f2 f3 1 2 3 4 5 Faults Time

(a) Coverage time for each defect (b) Rate of coverage over time

Figure 4.2: Average rate of fault detection for the test ordering t3, t2, t1, t4, t5. Tests are drawn from the fault matrix in Table 4.1 and ordered by fault coverage erage, the APFD score is

1 1 + 2 + 2 + 1 + 1 55 − 14 1 + − = = .82 . 10 25 50 which is notably higher than the APFD of the previous ordering, fitting with our expectations that this is a better ordering. Note that the only aspect of the APFD score that depends on the schedule are the completion times Ci – the other terms are constant for any given instance – so the problem of maximising APFD is equivalent to minimising ∑i∈T Ci. This implies that maximising the APFD measure inversely relates to the min-sum set cover problem, which is NP-hard to approximate to within four times the optimal solution [77].

Measure 2: Last fault coverage time

As an alternative to the APFD, Haidry and Miller [98] defined the All Faults (AF) metric as Cmax/m, where Cmax is the time at which the final fault was identified and m is the total number of tests. As with APFD, the AF metric is expressed as a percentage of the test suite, and is adjusted based on the number of tests to be executed. Here the aim is to detect all of the faults as early as possible, and our measure of this is 4.1. TEST-CASE PRIORITISATION WITH PRECEDENCES 109

Table 4.2: An example code-coverage matrix L. An “x” indicates that the given test case covers the corresponding line of code.

Lines of Code Test case l1 l2 l3 l4 l5 l6 l7 l8

t1 x x x t2 x t3 x x x x t4 x t5 x x the proportion of tests that need to be performed to identify all faults. This objective differs from APFD in that scheduling tests which cover fewer defects earlier can be beneficial if they allow later tasks to be scheduled earlier. The purpose of this measure is to model the cost of executing the test suites.

Returning to the example test ordering t1-t2-t3-t4-t5, the AF score for this schedule is 3/5 = .6, as the last task was completed by the third of five tests. Our alternative schedule t3-t2-t1-t4-t5 achieves an AF score of 2/5 = .4 as it detects all of the defaults using only two tests. Again, the second ordering has a better score, which agrees with our intuition that this should be a better schedule.

This All Faults objective is closely related to the set cover problem - the tests relate to sets, and the tasks to the set elements to be covered. The minimum number of sets required to cover all elements is a lower bound on the length of our schedule. This problem is similar to set cover, in that we wish to choose the minimum number of tests (sets) that cover all of our tasks (items). The primary distinction is that we consider cases in which there are precedences imposed on the tests, restricting which tests may be scheduled at any given time.

Set cover can not be approximated to within a factor of (1 − o(1)) · ln n [76] in polynomial time unless P = NP. If we have an instance of the set cover problem with sets S = {S1 ... Sn} and universe U, we can convert it to a scheduling problem with n tests, where test i covers the elements in set Si. The defects are the elements in U, and the precedence graph has no edges. If we find a solution with minimal AF score, we can use the prefix of that schedule up to the point where the final defect is identified 110 CHAPTER 4. SCHEDULING WITH PRECEDENCES

4 100% 100%

80% 80%

60% 2

Detection Time 1 1 1 % Tasks Covered

f1 f4 f5 f2 f3 1 2 3 4 5 Faults Time

(a) Coverage time for each defect (b) Rate of coverage over time

Figure 4.3: Average rate of fault detection for the test ordering t3, t1, t5, t2, t4. Tests are drawn from the coverage matrix in Table 4.2 and ordered by code-coverage as a solution to the set cover problem – as all defects are “covered” and the minimum number of tests have been used. From this, we have a bound on the approximability of the AF scores.

Lemma 36. It is not possible to approximate AF scores to within a factor of (1 − o(1)) · ln n in polynomial time unless P = NP.

Code Coverage

For many cases, the defect coverage data shown in Table 4.1 is not available prior to running tests – we therefore seek alternative measures for estimating fault locations. One commonly available source of such data is code coverage – the number of lines of code covered by each test. Table 4.2 shows an example code-coverage matrix for tests t1–t5. We will explore the efficacy of code coverage as an estimator for defect locations in test case prioritisation.

Consider the schedule t3-t1-t5-t2-t4 that arises from ordering the tests in decreasing order of the number of lines of code they cover. We can see the usual histogram representation of faults covered for this ordering in Figure 4.3b. 4.1. TEST-CASE PRIORITISATION WITH PRECEDENCES 111

While the tests are ordered based on the code-coverage matrix in Table 4.2, we score the schedule based on coverage of the defects in Table 4.1. The AF score for this test ordering is therefore 3/5 = .6 as all defects are covered after three tests are executed. The APFD score for this test ordering is given by

1 1 + 2 + 3 + 1 + 1 55 − 16 1 + − = = .78 . 10 25 50

Note that, while this is not as good as the scores for the defect-prioritising schedule, these results are still better than the default ordering. We will test how effectively code-coverage can be as an estimator for defect detection by comparing the perform- ance of our algorithms based on these different data sources.

Precedence Constraints

We specifically consider the problem of test-case prioritisation under precedence constraints.

Definition 37. A precedence graph is a directed acyclic graph (DAG), P = (T, E), in which T is a set of nodes – each representing a test case – and E is a set of arcs where i ≺ j represents j directly depending on i.

Given the precedence constraints given by Figure 4.4, the schedule t3-t2-t1-t4-t5 that we found earlier by ordering the tests in decreasing order of coverage is no longer infeasible – as the test t3 can not be processed before test t1. Instead, we can consider the test ordering t4-t5-t1-t3-t2 – which does meet the precedence constraints – the APFD score is 1 1 + 3 + 4 + 4 + 5 55 − 34 1 + − = = .42 . 10 25 50 The AF score for this schedule is 5/5 = 1, as the last task was completed by the last test. While this test ordering does meet the precedence constraints, the scores are not particularly promising.

Alternatively, we could order the test suite t1-t3-t2-t4-t5 – which also meets the precedences. The rate of coverage can be seen in Figure 4.5b. The AF score for this schedule would be 3/5 = .6, as the last task was completed by the third test out of 112 CHAPTER 4. SCHEDULING WITH PRECEDENCES

t1 t4

t2 t3 t5

Figure 4.4: An example precedence graph.

3 100% 100% 100%

80% 2 2 2

1 Detection Time

% Tasks Covered 20%

f2 f1 f4 f5 f3 1 2 3 4 5 Faults Time

(a) Coverage time for each defect (b) Rate of coverage over time Figure 4.5: Average rate of fault detection for the test ordering t1, t3, t2, t4, t5. Tests are drawn from the defect matrix in Table 4.1 and meet the precedences in Figure 4.4.

five total. The APFD score is

1 2 + 1 + 3 + 2 + 2 55 − 20 1 + − = = .7 . 10 25 50

While this is not as good as our schedule based on defect coverage, it is the best we can find that meets the precedence constraints.

4.2 Our Contributions

For our contributions to this topic we consider two heuristics, each prioritising dif- ferent aspects of this problem. We measure the effectiveness of these strategies over a range of real and synthetic datasets, and consider cases with ground-truth defect- 4.2. OUR CONTRIBUTIONS 113 coverage data as well as instances where code-coverage data is used as an estimator for defect locations. The first of our algorithms focuses on processing the dependency graph to ensure that precedences are met, scheduling the ‘best’ of the jobs that are available at any given time. The result is a topological sorting of the jobs according to the precedence DAG, weighted locally according to priority. We refer to this approach as topological- sort greedy, to reflect the relation to topological sort. The second approach greedily chooses the highest-weighted job, independent of precedence relations; after obtaining an ordering, it then reshuffles the ordering to enforce dependencies. Our reshuffling method involves moving jobs forwards in the schedule to meet the requirements of their earliest dependent; this approach keeps highly prioritized jobs further ahead in the schedule. We refer to this procedure as coverage-first greedy. We also test different schema for assigning weights to the jobs, such as lookahead strategies and updating the test weights after new tests are scheduled. We compare the performance of several different lookahead models, and find that a simple path- based lookahead outperforms more complex approaches. In measuring the performance of these algorithms, we consider two different met- rics – the Average Percentage of Faults Detected (APFD) and All Faults (AF) metrics described above. We find that updating the coverage information is highly effective in improving the performance of both greedy algorithms. The effectiveness of looka- head is less definitive – for many cases a single-step lookahead gives a notable im- provement, but larger lookaheads can be misleading, and lead to over-fitting. Com- paring different lookahead strategies, we find that a simple path-based lookahead is the most reliable, with the alternatives prone to over-weighting ineffective tests. Overall, our experiments reveal that the code coverage of a test can be an effect- ive estimate for the likelihood of identifying defects, performing better than both alternative algorithms and a random baseline. When comparing against the same underlying algorithm, schedules based on defect-coverage data revealed significantly better solutions under both AF and APFD measures than code-coverage based sched- ules generated by the same algorithm. We also find that many of the additional weighting approaches such as updating and lookahead can lead to over-fitting to 114 CHAPTER 4. SCHEDULING WITH PRECEDENCES the code-coverage data, leading to worse solutions.

4.3 Algorithms and Analysis

This section outlines the set of algorithms that we consider for prioritising test suites with precedence constraints. The problem of scheduling under precedence constraints to minimize the total weighted completion time is known to be NP-hard [132]. Several near-optimal ap- proximation algorithms have been proposed [1; 46; 99; 149; 161; 170], as this problem is of significant interest to the research community. However, for the application of test case prioritisation we often need to rely on limited information to estimate the effectiveness of a test – we therefore consider simpler heuristic-based algorithms. These algorithms may not be as close to optimal for scheduling as other algorithms, nor do they necessarily come with approximation guarantees. Nonetheless, the code- coverage data is only a rough approximation of fault-finding ability in itself, so these simpler algorithms for prioritisation using code coverage provide an appropriate means of avoiding over-fitting. Li et al. [139] demonstrate that for the test-case prior- itisation problem, simple greedy algorithms are often just as effective as more soph- isticated approaches, such as hill climbing and genetic algorithms. Consequently, we focus on simple greedy approaches for our problem of prioritisation under preced- ence constraints.

Topological Sort Greedy algorithm

This algorithm (see Algorithm 11) is a version of the well-known best-first search algorithm [169], and is the starting point for all algorithms that we assess. Our pro- cedure keeps a set of the currently available tests, based on whether the precedences for a given test have been met. Only available tests are evaluated when considering which test to next add to the schedule. The set of available tests is initially the set of independent tests in the precedence graph, and is updated as new tests are scheduled and more tests have their precedences met. The algorithm proceeds by scheduling the highest weight test from among those available, and appending it to the end of the current schedule. Children of the newly scheduled test (stored in a list children[i] 4.3. ALGORITHMS AND ANALYSIS 115 for each test i) can be added to the availability set iff they have no further precedences. The output is an ordered list of nodes representing a schedule of tests.

Algorithm 11 A topological-sort inspired algorithm. Require: G = (V, E): The n-vertex m-arc graph corresponding to precedence constraints between test cases. Require: w(·): A function mapping test cases (vertices) to their coverage values. Ensure: RESULT: A test suite – that is, a permutation of the vertices – prioritised by coverage value, that observes the precedence constraints. 1: RESULT ← [] 2: for i ∈ V do 3: p[i] ← 0 {Number of parents i has} 4: children[i] ← ∅ {Initialise the set {j | i ≺ j}} 5: for u ≺ v ∈ E do 6: p[v] ← p[v] + 1 7: Add v to children[u] 8: R ← {j ∈ V | p[j] = 0} 9: while R 6= ∅ do 10: i ← arg maxR w(i) 11: Append i to RESULT 12: R.remove(i) 13: for j ∈ children[i] do 14: p[j] ← p[j] − 1 15: if p[j] = 0 then 16: Add j to R 17: return RESULT

In this algorithm, the coverage of a test case, w(·), is the number of entries in the coverage-matrix row: that is, how many lines of code the test will execute, ignoring the number of times each line is run. Viewing L[j] as a set containing the lines of code covered by test j ∈ T, our weight function is therefore w(j) = |L[j]|. This algorithm differs from a typical best-first search in that its goal is to evaluate all nodes, rather than reach a goal node. As such, it is a complete search of the underlying precedence graph. Let n denote both the number of rows in the coverage matrix and the number of nodes in the precedence graph. The complexity of calculating the coverage value w(·) of each test is O(n · c), in which c is the number of columns in the coverage matrix: for example, the number of statements in the program under test. If we let m denote 116 CHAPTER 4. SCHEDULING WITH PRECEDENCES the number of precedence constraints, or edges in the graph, then, we can describe the running time of Algorithm 11 precisely. The initialization loop at line5 requires constant work per edge, while the previous loop is constant per vertex: so far, O(m). In the main loop, line9, there are at most n calls to find and remove the maximum-weight test; one update of a p[·] count per edge, and at most n additions to R. Since the numbers of removals from, additions to, and find minimums of R are all n, we can consider a priority queue whose running time is balanced across these operations: the heap, with O(log n) time. Therefore, the overall running time is O(m + n log n). Typically, the number of edges in the graph, m, is O(n), meaning that the worst-case time complexity of this algorithm is O(n log n). We mention in passing that there are also algorithms for topologically sorting a graph – a key component of Algorithm 11 – that are based on depth-first search.

Coverage-first algorithm

The previous approach was driven by the precedence constraints, with the prior- ity/coverage of a node driving the choice in the topological sort. In contrast, Al- gorithm 12 greedily adds tests to the solution based on coverage; after an ordering is generated, it then locally fixes the solution to comply with precedence constraints. The output is again an ordered list of nodes representing a schedule of tests. Later, we will see how this approach can be made more similar to the greedy algorithm for the Set Cover problem. The idea behind this algorithm is to try to run the tests with the highest coverage as early as possible, regardless of how many precedences they may have. The initial sorting procedure is precedence agnostic, and the rearranging procedure attempts to maintain the initial ordering as much as possible. For example, consider the test with highest coverage, which would be placed first by the sorting procedure. After the test ordering is rearranged to meet the precedence constraints, the only tests that will be run earlier that this test are those that are required to be run before it, either due to direct or indirect precedences. We note in passing that if w remains the same between iterations, then line5 can be a (descending) sort loop. The worst-case time complexity of this algorithm is O(n2). 4.3. ALGORITHMS AND ANALYSIS 117

Algorithm 12 A coverage-first greedy algorithm. Require: G = (V, E): The n-vertex m-arc graph corresponding to precedence constraints between test cases. Require: w(·): A function mapping test cases (vertices) to their coverage values. Ensure: RESULT: A test suite – that is, a permutation of the vertices – prioritised by coverage value, that observes the precedence constraints. 1: R ← V 2: RESULT ← [] 3: j ← 0 4: while R 6= ∅ do 5: i ← arg maxj∈R w(j) {Weight function w might change between iterations.} 6: Append i to RESULT 7: Remove i from R 8: index[i] ← j 9: j ← j + 1 10: for j ∈ V do 11: children[j] ← ∅ 12: for u ≺ v ∈ E do 13: Add v to children[u] 14: i ← n 15: while i ≥ 0 do 16: v ← RESULT[i]. 17: if children[v] 6= ∅ then 18: m ← minu∈children[v] index[u] 19: if m < i then 20: Move v to position m 21: Shift “up” items in positions m through i 22: Update array index for all those items. 23: else 24: i ← i − 1 25: else 26: i ← i − 1 27: return RESULT 118 CHAPTER 4. SCHEDULING WITH PRECEDENCES

Ordering the tests by priority only involves sorting in decreasing order of weight, which is O(n log n).

Analysis of the Rearrangement Procedure

We can consider the index i as a pointer into the array. This pointer moves leftward by one step if the test in that position is located correctly with regards to its children – we will refer to such tests as fixed. Otherwise, that test is moved leftward in front of its leftmost child, and the pointer remains in place (although it now points at the test that was previously in position i − 1). The number of times the pointer moves is clearly O(n), as it starts at the right-hand end of the array and moves to the left one step at a time. We now analyse the number of such moves that occur – that is, the number of times the pointer index remains in place while the item is shifted forward. A test will no longer be moved when the subgraph of its descendants is correctly ordered. In particular, when the pointer reads a test v that has no descendants, or whose descendants have already been passed by the pointer, the pointer will step leftward, to position i − 1 and test v will not be read again – its position in the schedule is fixed. Consider now the number of times a test A can be moved. The first time it is moved, it is positioned immediately in front of its leftmost child, and (hence) ahead of all its children. At this point, A is the start of a sequence of (at least) two correctly ordered tests. For A to be read by the pointer again, every test positioned between it and its previous position, i, will need to be processed. This now includes all children of A that A “jumped” over. Therefore, before A is next read, each of those children of A will be pointed to at least once, and will either be placed directly in front of one of its own children, or become fixed. The second time A is moved, it must have had some child placed to the left of A’s previous position (otherwise it would be fixed). By the argument in the previous paragraph, the leftmost child will be the start of a correctly ordered sequence of length two. Hence A is the start of a sequence of tests in precedence order of length (at least) three. In general, the tth time test A is moved, it becomes the start of a sequence of at least t + 1 tests in correct order. This can be proved inductively, based 4.3. ALGORITHMS AND ANALYSIS 119 on the fact that its leftmost child, moved at least t − 1 times, is the start of a sequence of t tests in ≺ order. A test A becomes fixed once all of its direct descendants are fixed – the number of times a test will be moved before this happens is bounded by the height of the subgraph rooted at A. The worst case is when the graph is a chain of height n, sorted in reverse precedence order: A will be moved Θ(n) times before the correctly ordered sequence reaches a leaf. Therefore, the total number of moves is bounded by the sum of the heights of the tests in the precedence DAG, giving an O(n2) bound on the complexity of this procedure.

Updating algorithms

For each of the two greedy algorithms, we consider a range of different strategies for improving the quality of returned schedules, at some cost to their runtime. In the updating versions of the algorithms, the coverage matrix is updated after each test is scheduled. For example, the initial coverage value of test t1, in Table 4.2, is 3, as it covers three lines of code. The weight is therefore w(t1) = 3. However, if t3 is selected as the first test in the prioritised test suite, thereby executing line l2 (among others), then the code coverage value of t1 becomes 2. Line l2 will have been covered by test t3 before test t1 is executed – and to reflect this we update the weight w(t1). This strategy involves updating the coverage after scheduling each new test, and recalculating the function w(·) to reflect the new coverage values. This approach ap- plies to both Algorithms 11 and 12. It is particularly important for the latter scheme: maintaining up-to-date information on coverage is a key part of the greedy Set Cover algorithm, which inspires this one. The updating-coverage strategy has been employed by other prioritisation algorithms in the research literature [71]. Considering time complexity, there are O(c · n + m) updates and, again, O(n) calls to find the item with the maximum w value, in some sort of priority queue. If c is constant, and m is proportionate to n, then the running time of Algorithm 11 with updating will be O(n log n); otherwise, with a naive priority queue, it is O(n2) for both greedy algorithms. 120 CHAPTER 4. SCHEDULING WITH PRECEDENCES k-lookahead greedy algorithm

So far, the algorithms have focused on what appears to be the best test at the time, greedily choosing the single test that has the best w(·) value. In the second main variant, the algorithm looks several steps ahead, considering the downstream effect of its local choice. We choose in advance the number of steps to look ahead, k. Then the “lookahead-value” function, wk, of a sequence of tests t1, t2,..., tk can be defined recursively as

wk(ti) = (k + 1) · w(ti) + max wk−1(tj) , (4.2) tj∈children[ti] in which w(ti) is the (non-lookahead) coverage value of test ti, typically given by coverage as in w(ti) = |L[tj]|. Note that the weight where k = 0 is equivalent to no lookup – that is, w0(t) = w(t).

In the expression for wk(·), w(ti) is multiplied by a factor which diminishes with each recursive step, which therefore assigns a heavier weight to tests earlier in the path. This encourages an increase the rate of fault detection, not just the number of faults. For example, if k = 3, and there are two paths with the (non-lookahead) weights h1, 3, 2i and h1, 2, 3i, we should choose the first path because both paths cover six items, but the first path finds them at a faster rate.

Assuming that the wk(·) terms are calculated statically, rather than with updat- ing w(·) components (such as for coverage-updating), the time taken to determine the k-lookahead values of every test is O(k · Nk−1 + m) , where Nk−1 is the number of length k − 1 paths in the graph. This running time is additional to the time taken to run Algorithm 11 or Algorithm 12. In our experimental evaluation (Section 4.5), we assess k = 1, k = 2 and k = 5. For most of our real dependency datasets, k = 5 is near complete lookup.

Updating k-lookahead greedy algorithm

We can of course combine the “updating” and lookahead techniques. As before, we only calculate the updates at each stage when a new test is scheduled. Consequently, we do not update the calculation for subsequent steps on the lookahead path – this algorithm is simply the application of the coverage updating strategy employed in 4.4. SPECIAL CASES 121 the updating greedy algorithm, except with k-lookahead. The coverage matrix is recalculated after each test is scheduled, but calculation of the k-lookahead value does not account for updates that would be made along the lookahead path. That is, if test t1 covers the same lines of code as its child t2, the lookahead weight for t1 will still weight t2 based on the coverage it would have achieved if t1 were not going to be scheduled first. The value of coverage updating is that only after t1 has been scheduled, the weight assigned to t2 will reflect the overlap in their coverage.

4.4 Special Cases

Bad cases for approximation

For the objective of completing all tasks as early as possible, it seems that the topsort greedy approach in Algorithm 11 can perform arbitrarily poorly. We can consider the case where a single test tmax covers all lines of code (and, presumably, defects), and has a single precedence from a test that only covers one line. If there are m − 1 other tests which each cover two lines, the greedy algorithm will always choose these tests over the test required to make tmax available. For the case where the weights are not updated after a test is scheduled, each of the m − 1 tests can cover the same two lines of code and still be given priority, despite not adding any additional value. We therefore cover the majority of lines of code with the very last test, when an optimal ordering would have achieved this after only two tests. The coverage-first greedy algorithm in Algorithm 12 evades this case by ignoring precedences in the initial ordering. For the above example, this algorithm would identify the optimal test ordering by scheduling tmax first, and then moving its par- ent before it during the reshuffling. However, we can still construct equally poor examples for this strategy. Consider the following case: two tests t1 and t2 with no precedences, each covering half the lines of code, so that scheduling only these two tests would achieve complete coverage. We also have a test tx that covers all but one line of code, but has m − 3 parents in the precedence graph which all cover only one line. The initial ordering given by Algorithm 12 would schedule tx before t1 and t2, and the reshuffling would move all m − 3 tests to the beginning of the schedule. This schedule covers the last line of code with the final test, and potentially only covers 122 CHAPTER 4. SCHEDULING WITH PRECEDENCES one line of code until the third from last test – an optimal schedule would cover all lines of code with only two tests. For the k-lookahead version of these greedy algorithms, where the weight of an available test ti takes into consideration the coverage of tests dependent on ti for up to k stages, we can still construct similar examples. The example for Algorithm 12 remains the same – k-lookahead has little impact on this strategy as the initial sched- ule ignores precedences. For Algorithm 11 we can consider the test tmax that covers all of the code to depend on a series of k tests that each cover a single line of code.

That is, test tmax is k + 1 steps after the first available test in its dependency path. So rather than covering all lines of code by running k + 1 tests, this procedure instead schedules tests with smaller coverage.

Approximation bounds for special cases

In the absence of precedence relations, both of our greedy algorithms perform equi- valently. Algorithm 11 will be unconstrained as to which sets are selected, reducing it to running the same as the loop at step4 in Algorithm 12. Similarly, Algorithm 12 will not need to reorder tests to meet precedences. This means that both algorithms will simply sort the tests in decreasing order of weight. If we use coverage updating, we have the standard greedy algorithm used for coverage problems such as set cover, for which it gives an O(ln n)-approximation [54]. When the coverage of the sets are disjoint, this problem reduces to single ma- chine scheduling with precedences and weights. The AF scores are equivalent to the makespan of the schedule. For this case the solution is trivial, as regardless of the test ordering we will need to run all tests – the AF is always 1. The APFD scores relate to the sum of the flowtimes for each job/test. This problem has a 2-approximation algorithm, which is optimal unless P = NP [46].

4.5 Experiments

In this section, we evaluate algorithms using software systems built for industry – several of which are currently in use. Four of our systems have real precedence constraints, while the remainder are tested with synthetic precedences. The goal of 4.5. EXPERIMENTS 123

Table 4.3: Metrics for the systems being tested

Lines of Precedence Artifact Type Faults Tests code constraints

gsm1 unit 385 15 51 65 gsm2 unit 975 14 51 65 czt component 27,246 27 548 314 bash (×6) system ≈ 59,800 3-8 1061 461 this evaluation is to provide evidence to determine the effectiveness of the algorithms presented in Section 4.3 at improving the rate of fault detection. We compare our algorithms against the algorithms defined in the previous work by Haidry and Miller [98] on dependency-based prioritisation. In addition, we compare against a “random” algorithm, which is the same as the greedy algorithm presen- ted in Algorithm 11, except that instead of choosing the best node at each step, it chooses a random node. This strategy is intended to represent the performance of an unprioritised test ordering that still meets the precedence constraints.

Objects of analysis

We performed the evaluation on four real systems, described here. We obtained test suites for three of these systems from a developer on the CZT project. The exceptions are the GSM test suite, which were used in the work of Haidry and Miller [98], and test suites from the Software-artefact Infrastructure Repository (SIR)∗. Table 4.3 describes the systems, including the number of disconnected tests (no dependency relations associated), the depth of the graph, and its density (the pro- portion of potential edges present, m/[n(n − 1)/2]. There are in fact six versions of the Bash system – folded into one line in the table – so the lines of code, number of functions, and number of faults vary between versions. There are two distinct classes of data used for our experiments: those with real precedence information and those for which precedences were randomly generated.

Real Precedences

∗See http://sir.unl.edu/ [67]. 124 CHAPTER 4. SCHEDULING WITH PRECEDENCES gsm1 and gsm2 The first software system we consider is GSM 11.11, a file system with security permissions on a Subscriber Identity Module (SIM) for mobile devices. When the correct codes are presented to the system, files can be accessed, read and updated, and the access permissions modified. The GSM 11.11 system is described in more detail in the initial work by Miller and Strooper [153]. We use the code-coverage and precedence data generated by Haidry and Miller [98] based on the two GSM implementations described by Miller and Strooper. One of these versions is used in the field as part of the Sun’s Javacard Framework, the other was written solely for demonstration purposes. We use three versions of the fault location data described by Miller and Strooper: two of which describe real faults, and one containing faults seeded by a developer.

czt The next system is based on a typechecker and parser from the Community Z tools† package for the Z and Object-Z specification languages. We again use that test suites generated by Haidry and Miller [98], who describe these datasets in greater detail. The precedence graph represents data dependencies between tests, with a precedence indicating that a test references a declaration made in a preceding test. Fault location data is based on log messages in the project repository.

bash This system is based on the well-known Unix shell, Bourne-again shell (Bash), obtained from the Software-artefact Infrastructure Repository (SIR). The package con- tains six versions of the Bash system, with the number of faults ranging from three to eight – the test suite is the same for across all six versions. The precedence graph used for the bash datasets was put together by Haidry & Miller [98], who used information extracted from the test suites in the package. The precedence constraints are based on a coarse grained test suite, with a precedence constraint defined when a variable was defined in one test and used in a subsequent test. Our test orderings are based on a fine grained test suite in which each test is separate.

†See http://czt.sourceforge.net/. 4.5. EXPERIMENTS 125

Table 4.4: Metrics for the systems being tested from the Software-artefact Infrastructure Repository [67].

Lines of Artifact Faults Fault Type Tests Description code tcas 173 41 Seeded 1608 altitude separation schedule 412 9 Seeded 2650 priority scheduler replace 564 32 Seeded 5542 pattern replace print tokens2 570 10 Seeded 4115 lexical analyser tot info 565 23 Seeded 1052 information measure array definition space 9,564 35 Real 13,585 interpreter

Synthetic Precedences

Here we describe datasets that lack real precedence constraints – synthetic preced- ences for these datasets were generated for the purposes of our experiments. These synthetic precedence graphs were randomly generated such that there would be O(n) edges in expectation for a test suite with n tests. As with the bash data described above, these programs were obtained from the Software-artefact Infrastructure Repository (SIR).

Siemens The siemens datasets are a widely used resource for research in fault de- tection, and represent a range of applications, from airport collision data to lexical analysers. These datasets were created for the purpose of studying the effective- ness of fault detection heuristics by manually seeding the base programs with faults that would ideally represent the type of faults that would appear in practice. Fault coverage is based on which tests identify faults across the different versions of the implementations. For each of these datasets we also consider a “two-bug” instance, based on combining pairs of the initial faults.

Space The space dataset represents a codebase for grammar-checking an array- based language. It consists of 9564 lines of code (of which 6218 are executable), and a test pool of 13, 585 tests. The initial data had three versions, each containing a single fault: we use a two-fault version of the data based on combining pairs of the original faults. 126 CHAPTER 4. SCHEDULING WITH PRECEDENCES

Independent variables

The independent variables in the experiment were the techniques used to prioritise the test cases:

• Random ordering [rand]: A random test suite is considered as a baseline, and is intended to represent the case of a test suite that is prioritised arbitrarily. We generate 10 randomly ordered test suites and take the average result to represent the performance of random test suites.

• Code coverage-based Techniques: These techniques rely on the coverage data to estimate the locations of faults.

1. Topsort prioritisation with function coverage [ts-c]: This technique uses Algorithm 11, with the weight function measuring the value of a test, w, being based on its function coverage.

2. Coverage-first prioritisation [cf-c]: This technique uses the greedy algorithm, Algorithm 12, with the weight function measuring the value of a test being its function coverage.

3. Updating greedy prioritisation [uts-c, ucf-c]: These techniques use the “updating” algorithms from Section 4.3, and use function coverage to weight the tests.

4. Greedy with k-lookahead: These techniques use the k-lookahead greedy algorithm from Section 4.3 on top of the aforementioned coverage-based greedy schemes. We evaluate three additional values for k: k = 1, k = 2, and k = 5. The base algorithms without lookahead can be considered as k = 0.

• Defect coverage based techniques: These algorithms have access to the ground- truth data – so they know which tests will identify each fault.

1. Defect-aware greedy prioritisation [ts-d, cf-d, uts-d, ucf-d]: These are the same as the coverage-based greedy approaches, but with the “ground truth” defect data used in place of the coverage values for weighting tests. 4.5. EXPERIMENTS 127

• Dependency-based prioritisation (DSP) [98]: These techniques use the preced- ence data to estimate fault locations under the assumption that more complex dependency structures correlate with higher coverage. Both DSP techniques are static techniques: they do not use coverage information, such as statement and branch coverage of tests.

1. Dependency-based prioritisation volume: [dsp-v]: This technique, taken from the work of Haidry and Miller [98], uses the structure of the pre- cedence graph to prioritise tests. Based on the assumption that preced- ences between tests represent interactions in the underlying systems, this technique prioritises complete sequences of tests based on the depend- encies between them. Test sequences are prioritised by executing denser sub-graphs earlier. Tests are weighted based on their direct and indirect dependants. The available test with the highest weight is chosen, and a weighted DFS technique is used to schedule tests in the subgraph depend- ent on that test.

2. DSP height [dsp-h]: Similar to DSP-volume, this technique prioritises scen- arios by executing longer paths in the precedence graph earlier in the test sequence. The overall strategy is identical to DSP-volume but with tests weighted by the height of the subgraph rooted at that test. As with dsp-v, this algorithm is based on the work of Haidry and Miller [98].

Studies on prioritisation techniques consider the use of optimal test orderings. These orderings use known fault information to provide an upper bound for the rate of fault detection, and as such, are not true prioritisation techniques, but are used for evaluation only. However, for test suites with precedence constraints, calculating the optimal value is NP-hard [132], so we do not consider these here.

Setup of case studies

Test suites are based on those used by Haidry and Miller [98], who generated cov- erage data by executing the tests to determine which lines of code were executed, and which faults they identified. This gave us the defect and coverage data – details 128 CHAPTER 4. SCHEDULING WITH PRECEDENCES

Figure 4.6: Combined APFD and AF results for real and synthetic dependencies

of how the dependency data was generated (where relevant) can be found in Sec- tion 4.5. We then developed scripts that use these three sources of information to automatically calculate the orderings for each of the techniques being applied. From these schedules, we generated the AF and APFD scores. For datasets with existing dependency data, we calculated orderings and scores of 10 random sub-graphs of the existing precedence graphs. Where real dependency data was unavailable, we instead generated 10 random dependency graphs. This gives us 10 schedules for each instance we are testing, allowing us to reduce the observational error of the prioritisation techniques. 4.6. RESULTS 129

Figure 4.7: APFD box plots for 10 random sub-graphs of the precedence graph.

4.6 Results

In this section, we present the results of experiments outlined in Section 4.5. Figure 4.6 compares the APFD and AF scores across all of our approaches based on the number of lookahead steps used. Note that lookahead values are not relevant for the random schedules (which do not weight the tests), or for the dependency-based techniques – which always use the entire dependency structure to weight tests. For datasets with precedence information, the box plots are the results of running the algorithms on 10 random sub-graphs of each precedence graph and taking APFD and AF measures for each schedule. For datasets without real precedence information, 10 synthetic precedence graphs were generated, with the expected number of edges 130 CHAPTER 4. SCHEDULING WITH PRECEDENCES

Figure 4.8: AF box plots for 10 random sub-graphs of the precedence graph. being O(n) if there are n tests. Comparing the results for the greedy algorithms based on coverage (ts-c, sc-c, uts-c and usc-c) against those with defect data (ts-d, sc-d, uts-d and usc-d), we can see that the performance of both greedy algorithms increases significantly when the defect locations are known. However, our code- coverage based solutions are still better than than the dependency-based solutions, suggesting that code-coverage is a better estimator of fault locations.

Accuracy of Coverage Information

We use two sources of data for assigning weights to the tests our greedy algorithms are prioritising. We test the performance of code-coverage data as an estimator for 4.6. RESULTS 131

Figure 4.9: APFD and AF results for real datasets with randomly generated synthetic dependencies. 132 CHAPTER 4. SCHEDULING WITH PRECEDENCES defect locations, using coverage matrices that indicate the lines of code covered by each of the tests. And as a baseline to determine how much the performance of our algorithms is impacted by the accuracy of the underlying data, we use the “ground truth” defect-coverage data. The performance of these algorithms strongly depends on how well the informa- tion used in creating the ordering matches the “ground truth” of the defect locations. When using either coverage or dependency data as a stand-in for defect data, the quality of the results is affected significantly by how well that data matched the ground truth. The defect-coverage based tests demonstrate that our algorithms can perform very well under ideal circumstances, where we have accurate information on which tests will identify particular faults.

Comparison of Greedy-Style Algorithms

In this section we compare the topsort and coverage-first greedy schemes introduced in Section 4.3, and the differences in their performance given coverage vs. defect data. We find that both the topsort and coverage-first greedy algorithms perform signi- ficantly better on all datasets when the true location of defects is known in advance. While this is unsurprising, the degree to which these algorithms can be misled by the coverage data is notable – many datasets (such as czt) perform worse with increased lookahead. Furthermore, we find that updating the coverage information gives worse results on the gsm1 dataset when using the coverage data, suggesting that this is somehow misrepresenting the true fault locations. When the coverage data does not accurately model the fault locations, strategies that better fit the coverage data will deviate more from solutions that fit the ground truth.

Coverage Updating

In this section, we look at the impact of the updating greedy scheme on our greedy procedures. We find that this is the most consistent of the strategies used, and is particularly important in guaranteeing reasonable results for both of the strategies used. In par- ticular, updating the coverage information is an important aspect of the set-cover algorithm, and similarly we find it to be most useful for our coverage-first approach 4.6. RESULTS 133 to scheduling. The exceptions to this are the gsm1, tot info and replace datasets, for which updating the coverage information gives consistently bad results for code- coverage based test orderings. However, even for these instances the fault-coverage based results are improved by coverage updating. This implies that the poor res- ults were due to dissimilarities between the code-coverage data and fault locations – updating the coverage information could lead to over-fitting to the incorrect data. k-lookahead

The impact of lookahead on the greedy schemes seems to vary significantly. One explanation is that accounting for lookahead can prioritise tests with a larger number of descendants over those that achieve more immediate coverage. For the APFD metric in particular, which emphasises fast growth, this can be suboptimal. It would be of interest to consider alternative lookahead schemes, most likely with a lower emphasis on less immediate tests. In Figure 4.7 there is little notable improvement in APFD when using lookahead with the coverage-updating greedy algorithm, particularly when the ground truth is known. For the topsort algorithms, there is notable improvement in both AF and APFD scores for one lookahead step, however additional steps seem to have less impact. This is particularly notable for AF scores. In general, there seem to be di- minishing returns from additional lookahead after the first step. We find lookahead to be more detrimental for APFD scores than for AF scores. This is likely due to APFD scores placing greater value on immediate coverage, while lookahead weights tests by potential future coverage. For some datasets, such as gsm1 and czt, we find that increased lookahead decreases both APFD and AF scores – the rate of coverage is worse, but the final fault is identified earlier. When the coverage information is not accurate (either because we don’t have cor- rect defect locations, or because we aren’t updating the current coverage state) the APFD can degrade with increased lookahead. This implies that we might be over- fitting our solutions to the inaccurate coverage information, as we also observed in our discussion of coverage-updating. It is notable that datasets where lookahead has the most negative impact on APFD (such as gsm1 and czt) also have the largest improvement in AF when using coverage-updating and true defect location inform- 134 CHAPTER 4. SCHEDULING WITH PRECEDENCES ation, as this supports the hypothesis that our coverage information is less accurate in these instances.

Synthetic precedences

We consider our results for datasets with synthetic precedences, based on the Siemens and Space test suites. These datasets are significantly larger than those with real precedence information, giving some insight into the scalability of our algorithms. The results for these datasets show many of the same trends as above. We get consistently strong results when the location of defects is known, and relatively good performance otherwise depending on the accuracy of the coverage data. The updat- ing procedure gives a notable improvement when ground truth defect information is available, but can be misleading when relying only on function coverage. For all of the examples in Figure 4.9, the additional measure gives worse results when coverage data is used, and better results when defect locations are known. We find that the coverage-first algorithms are generally unaffected by lookahead. This is not particularly surprising, as these algorithms only account for dependen- cies after the initial prioritisation stage, allowing high-coverage tests to be scheduled earlier without the need for expensive lookaheads. In Figures 4.10 and 4.11, the replace test suites (both singlebug and twobug in- stances) from the Siemens datasets, the topsort algorithm shows a notable improve- ment in both APFD and AF with each additional lookahead step. For other datasets there is little impact from lookahead, other than the occasional slight improvement for single-step lookahead. Because the precedence relations used were randomly generated, they do not rep- resent any actual relations between the functions. Consequently, the dsp-v and dsp-h algorithms (which use the structure of the dependency graph to predict coverage) typically do not perform as well on these tests.

Heuristics and Lookahead Schemes

The performance of both of our greedy algorithms is heavily based on having ap- propriate weights assigned to the tests, so we test a variety of approaches to find the most appropriate for each of the algorithms. For weighting test with lookaheads, we 4.6. RESULTS 135

Figure 4.10: APFD results for Siemens datasets with randomly generated synthetic dependencies. 136 CHAPTER 4. SCHEDULING WITH PRECEDENCES

Figure 4.11: AF results for Siemens datasets with randomly generated synthetic dependencies. 4.6. RESULTS 137 will compare three different schemes. For a given lookahead depth k at a test t, we might base the lookahead weight of t on the coverage of the depth-k subtree rooted at t, or we may instead only want to consider the tests in the maximum-weight depth-k path from t. Alternatively, we can use the next k tests that might be selected from the subtree rooted at t, at each step selecting the available test ti from the subtree that maximises the weight (either L[ti] or F[ti] depending on whether code- or fault- coverage is used), then adding dependents of ti to the set of available tests. Formally, for a test t, lookahead depth k, and non-lookahead weight w(t) = |L[t]|, we have the following lookahead weights:

• Path-based lookahead This is the default lookahead scheme used in previous tests. We can represent the weight function as:

wk(t) = (k + 1) · w(t) + max wk−1(tj) . tj∈children[t]

The idea is to weight the current test t based on the highest coverage amongst the depth-k paths rooted at t in the dependency graph.

• Tree-based lookahead This lookahead scheme intends to capture all of the tests that are dependent on test t, rather than limiting lookahead to a path-like structure. The idea behind this is that a test t might have two children with high coverage, but a path-based lookahead will only account for one of them. We can represent this as:

wk(t) = (k + 1) · w(t) + ∑ wk−1(tj) . tj∈children[t]

• Best-first lookahead Instead of structuring the lookahead as a tree or path, we might weight test t based on what the greedy algorithm would select from the dependants of t. For example, after t is scheduled, we choose the highest

weighted test ti from the children of t – the next lookahead step will choose

the highest weighted test from the children of both t and ti. The idea is to find a balance between tree- and path- based lookahead – as the highest coverage sequence of tests might include multiple children of t, or be closer to a depth-k 138 CHAPTER 4. SCHEDULING WITH PRECEDENCES path best tree

(a) APFD scores (b) AF scores

Figure 4.12: Comparison of lookahead schemes on the gsm2 dataset. 4.6. RESULTS 139

path. We represent this as:

k wk(t) = (k + 1) · w(t) + max(k + 1 − j) · w(tj) , ∑ t ∈A j=1 j j

where Aj represents the set of nodes that have been made available by the

previous j choices in the lookahead sequence. The initial value of A1 is the set

children[t], and after each lookahead step when we select a test tj, the new set

Aj+1 is defined by removing tj from Ai an adding children[tj] to reflect which tests might now be available.

For the majority of datasets, there is little to no impact across the different looka- head schemes. In general, the tree-based lookahead pattern works slightly more poorly than the alternatives, which are mostly identical. For real datasets, there is negligible impact from different lookaheads – likely due to the small size of these datasets. An example can be seen in Figure 4.12 which shows the AF and APFD scores for all three lookaheads on the gsm2 dataset. There is very little change between the solutions for this dataset – the remaining real datasets have even less difference across the lookahead schemes. For the larger datasets with synthetic precedences, there is more of an effect, al- though it varies significantly. On the replace dataset (Figure 4.14), the tree-based best-first lookahead scheme performs better when coverage data is used, while path- lookahead is most reliable when defect-coverage data is available. On the tot info dataset – Figure 4.13 – path-based lookahead has almost no ef- fect. Meanwhile, best-first lookahead effects the scores after 5 steps, while tree-based lookahead requires only a single step. However, the impact of these lookahead strategies is not always positive, and the variance almost always increases, indicating that these lookahead strategies are higher risk. The print tokens 2 dataset (Figure 4.15) again shows the reliability of path-based lookahead over the alternatives – best-first lookahead has worse results after five steps of lookahead, while tree-based lookahead is nearly always worse than no looka- head for this dataset. This is likely because the lookahead schemes do not account for the elements covered earlier in the lookahead – this is particularly important for tree- lookahead, where the same element can be counted many times, biasing the weights 140 CHAPTER 4. SCHEDULING WITH PRECEDENCES path best tree

(a) APFD scores (b) AF scores

Figure 4.13: Comparison of lookahead schemes on the tot info single-fault dataset. 4.7. CONCLUSIONS 141 disproportionately towards elements that are easier to cover. For many cases where path-based lookahead has no impact, the other schemes will, suggesting that this lookahead strategy has less impact on the weights. Based on these results, we can identify some issues with these lookahead strategies.

• The weight assigned to child tests does not account for overlaps with the cov- erage of the parent (no updating). There might be significant overlap along the lookahead sequence, but this is not reflected in the weight that is assigned. It might be more effective to take the union of the coverages over the lookahead sequence, rather than to simply sum the weighted coverage values.

• These lookahead schemes do not account for multiple parents when consider- ing tests as available. The highest coverage child of a test t might have many other precedences, preventing it from being scheduled directly after t – this is not reflected in our lookahead schemes. It might be useful to constrain the lookahead sequence to tests that are immediately available, or to assign weights according to the proportion of precedences that will be met.

4.7 Conclusions

In this chapter, we defined a series of techniques for coverage-based test prioritisa- tion for test suites with precedence constraints. Even in the absence of precedence constraints, we proved that the AF and APFD objectives are NP-hard to optimise by relating them to set cover and min-sum set cover respectively. Scheduling with precedence constraints is also known to be NP-hard [132]. We have therefore focused on heuristic approaches for improving fault detection rates. The approximation al- gorithms we have introduced are both based on a standard greedy algorithm, but depend on different methods to preserve the precedence constraints. We performed an empirical evaluation of the techniques on four real systems, com- paring them to existing dependency-based prioritisation techniques presented in earlier work. In our evaluation of coverage-based techniques, code-coverage is used to es- timate the defect-coverage values of the test suites. Our results have shown that we achieve higher rates of fault detection when fault locations are accurately modelled. For both our topsort and coverage-first algorithms, 142 CHAPTER 4. SCHEDULING WITH PRECEDENCES we found that the combination of k-lookahead with updates was the best technique when reliable coverage data was available. For topsort, typically k = 1 or k = 2 was sufficient, while coverage-first was less dependent on lookahead, and often performed well without it. When coverage data is used to estimate fault location, the results are more variable. We have found that both topsort and coverage-first algorithms perform well with cov- erage data for the larger datasets (such as the Siemens datasets and space). However for the smaller datasets there is no consistent winner, as both coverage updating and k-lookahead can either positively or negatively impact the results depending on the underlying data. In general, we found that greedy with coverage performed better than either of the dependency-based techniques, regardless of which of the coverage data sets were used. This suggests that, while coverage data might not accurately model fault loca- tions, it is often a better estimator than dependency structures.

Future Work

Both of our greedy algorithms rely on using a weight function to prioritise tests; it is of further interest to determine impact that alternative weight functions have on performance. Our initial evaluations on different weight functions have focused on altering the lookahead strategy, which we have found to be prone to overfitting. Our experiments on alternative lookahead strategies have revealed that, while the initial path-based lookahead schema was the most reliable, alternative lookaheads can be of value. All of the lookahead schemes we considered used the same weighting strategy for tests based on lookahead depth – it would be of interest to measure the impact of increasing or decreasing the weight assigned to further lookaheads. Our lookahead strategy also does not account for which defects had been covered by tests earlier in the lookahead path – while this would add to the complexity of the algorithms, it would likely improve performance. We speculate that using lookahead strategies that consider the union of the coverage sets (rather than the sum of the individual magnitudes) might be more effective. There is also potential to expand this work to parallel machine models, which have been widely studied in the machine scheduling context. Very little work has been 4.7. CONCLUSIONS 143 done on extending test-case prioritisation models to allow for parallel machines – Qu et al. [166] introduced the problem of test-case prioritisation with multiple processing queues, along with a parallelised version of the APFD effectiveness measure, APFDp. Extending this model with precedence constraints remains open work. 144 CHAPTER 4. SCHEDULING WITH PRECEDENCES path best tree

(a) APFD scores (b) AF scores

Figure 4.14: Comparison of lookahead schemes on the replace single fault dataset. 4.7. CONCLUSIONS 145 path best tree

(a) APFD scores (b) AF scores

Figure 4.15: Comparison of lookahead schemes on the print tokens2 single fault dataset. 146 CHAPTER 4. SCHEDULING WITH PRECEDENCES CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES

5 Min-Sum Set-Cover with Precedences

Min-sum set cover is a set ordering problem, in which the objective involves min- imising the sum of cover times for set elements. It allows us to model scheduling problems in which each job is a collection of sub-tasks – rather than minimising job completion times, we consider the completion of the sub-tasks. We can use this as a model of fault coverage times, with sets representing the tests, and set elements indicating which faults are identified by a given test. With test-case prioritisation, we rely on heuristic approaches as we do not have complete inform- ation available. Assuming fault-coverage data is somehow available in advance, we can interpret the APFD metric in terms of minimising the sum of cover times for each of the (deterministic) defects. This is related to certain types of machine scheduling models, as well as to the well-established min sum set cover (MSSC) problem. We can model our precedence-constrained variant of test-case prioritisation by adding precedences to the min-sum set cover setup – which we will refer to as precedence- constrained min-sum set cover (precMSSC). The min-sum set cover problem is related to classic NP-hard problems such as set cover, machine scheduling and the linear arrangement problem. This problem has connections to scheduling [57], set covering [154] and vertex covering [77] frame- works in a range of contexts including machine learning, query optimisation and

147 148 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES software testing, hinting at the wide applicability of this problem. Given a universe of elements and a collection of sets which together cover that universe, the standard set cover model involves finding a minimum size collection of sets such that all elements are covered. For min-sum set cover (MSSC), the objective is to find an ordering of the sets such that the sum of coverage times for the individual elements is minimised.

Given an ordering Y = (A1, A2,..., Am) of sets covering a universe U of n ele- ments, let C(i) (the cover time of i) be the minimal index j such that Aj covers element i. As a reminder, we have the following definition:

Definition 38. Given a collection of sets S covering a universe U, the min-sum set cover problem involves finding an ordering of the sets in S such that the sum of cover times for the elements in the universe is minimised. The objective is to find a permutation of S such that

∑i∈U C(i) is minimised.

That is, we are concerned with minimising the individual latencies totalled across all of the elements – this relates to minimising the sum of completion times in a ma- chine scheduling setting, but with the additional complication of covering elements. Min-sum set cover admits a 4-approximation, using a greedy algorithm, which is best possible for this problem [77]. p We present a max(n, 2m)-approximation algorithm for precMSSC, along with improved approximations for some special cases. Furthermore, we prove that if we could provide better than polynomial approximations, then we would have a counter-argument to the planted dense subgraph conjecture.

Notation

We assume a universe U of n elements, and a collection of sets S = {S1, S2,..., Sm} such that ∪iSi = U. Let u ≺ v for u, v ∈ S mean that u precedes v, i.e., that u must occur before v in every feasible ordering of the sets. We define a precedence digraph G = (S, E), where there is a directed edge from u to v in E for every pair u, v ∈ S where u ≺ v. Note that G is a DAG, in which each set is represented as a vertex, and an edge from u to v indicating that u ≺ v. Let u  v indicate that there is a precedence path from u to v – that is, u is an ancestor of v. For an ordering Y = (A1, A2,..., An) 5.1. OUR CONTRIBUTIONS 149 of S and an item u ∈ U, let C(u, Y) be the minimal index j of the set Aj ∈ Y such that u ∈ Aj. That is, C(u, Y) is the first time at which the item u is covered in the schedule Y. The objective of the precedence-constrained min-sum set cover problem is to find an ordering Y = (A1, A2,..., Am) of S such that for all Ai, Aj ∈ Y with i ≤ j we have

Aj 6 Ai, and ∑i∈U C(i, Y) is minimised. We will use parents[v] to indicate the sets with direct precedences to v – that is, the set {u : u ≺ v} – and children[v] to indicate the set {u : v ≺ u}. Furthermore, the set of ancestors of v, including v, is denoted P[v], and the set of all descendants of v (the set {u : u  v}) is D[v].

5.1 Our Contributions

We introduce a procedure for precMSSC that greedily chooses components of the graph to be scheduled, based on maximising the “effectiveness” at each step. The value of an individual set is considered to be the number of “new” elements that it covers. This is similar to a weighted shortest processing time rule used in machine scheduling [163]. However, for our problem the “processing time” is given by the number of sets included in the collection, and the “weights” are based on the size of the union of these (possibly overlapping) sets. The greedy technique and the proof method we use are based on those of Feige et al. [77], extending their result to account for precedence constraints. We adapt their greedy method to select (sub)collections of sets rather than individual sets at each stage, as high-coverage sets might have precedences from lower coverage sets – there is a trade-off between the number of elements covered by the sets and the number of sets in the (sub)collection. We prove that precMSSC has a 4-approximation algorithm when the precedence graph is an in-tree, and an O(h)-approximation in the case of an out-tree (where h is the height of the tree). For the general problem of min-sum set cover with precedence constraints with m sets covering n elements, we present a greedy algorithm that p achieves a O( max(n, m))-approximation. When selecting a (sub)collection of sets, we ensure that the precedences for each set in the collection can be met by only considering (sub)collections that are precedence closed. 150 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES

Definition 39. A precedence closed (sub)collection of sets A ⊆ S has the property that, for all v ∈ A and u ∈ S, if u  v then u ∈ A.

That is, given a collection of sets S with precedences G = (S, E), a precedence- closed set A is a subcollection of S for which there exists no edge (u ≺ v) ∈ E such that u ∈ S\A and v ∈ A. Our greedy algorithm for the precedence-constrained min-sum set cover problem schedules the locally ‘best’ precedence-closed subgraph at each stage of the procedure. This is similar to the approach used by Chekuri & Mot- wani [46] for scheduling precedence constrained jobs to minimise average weighted processing time – here, we have the additional complication that the “weights” relate to set coverage. In place of weights and processing times, we instead measure the value of a given collection of sets in terms of both its coverage and its size.

Definition 40. cov(A) is the set of elements covered by a (sub)collection of sets A - given as ∪ Si∈ASi. A (A) = |A| A Definition 41. For a (sub)collection of sets , let r |cov(A)| denote the rank of . In some cases, it is simpler to consider the inverse of the rank of a subgraph – we refer to inverse rank as the density of a graph.

Definition 42. The density of a collection of sets A is the reciprocal of the rank, and is given |cov(A)| by δ(A) = |A| .

Finding a precedence-closed subgraph of minimum rank is equivalent to finding one of maximum density. We will refer to this as the max-Density Precedence-Closed Subgraph (maxDPCS) problem. Sometimes we are concerned about the coverage of a collection of sets on a specific domain – to describe this, we overload the cov(·) notation.

Definition 43. cov(A, Y) is the set of elements in Y ⊆ U that are covered by sets in A ∪ ∩ (sub)collection – that is, Si∈ASi Y.

Before describing our algorithm and hardness results for precMSSC, we will in- troduce a series of results useful in finding precedence-closed subgraphs of maximal density. These will provide the backbone of our procedure for finding solutions to precedence-constrained MSSC. 5.2. MAX-DENSITY PRECEDENCE-CLOSED SUBGRAPHS 151

5.2 Max-Density Precedence-Closed Subgraphs

The max-Density Precedence-closed Subgraph problem (maxDPCS) is related to the Partially Ordered Knapsack (POK) problem, which involves selecting a precedence- closed set of items with maximal weight and bounded size. The primary distinction between these problems is that POK has a fixed number k of sets that can be selec- ted, and value are additive rather than coverage-based. POK is a generalisation of densest-k-subgraph [126], for which there are no known constant-factor approxima- tions. We now consider the approximation ratio for maxDPCS for general precedence graphs. For every vertex Si ∈ S, we will use P[Si] to refer to the set of all ancest- ors of the set Si, including Si itself. For instances with m sets covering a universe p of n elements, we will prove that a greedy algorithm can achieve a max(n, 2m)- approximation for this problem. Later we will prove that the procedure described in Algorithm 13 will return optimal solutions for instances where the precedence graph has an in-tree structure.

Algorithm 13 greedy-subgraph(P = (S, E), U) 1: max ← δ(S), SOL ← S 2: for Si ∈ S do 3: if δ(P[Si]) ≥ max then 4: max ← δ(P[Si]) 5: SOL ← P[Si] 6: return SOL

p Theorem 44. Algorithm 13 is a max(n, 2m)-approximation to maxDPCS for general precedence graphs.

Proof. Consider again the set P[Si] of the ancestors of a set Si – the minimal subgraph including Si that ensures that all precedences of Si have been met. Note that P[Si] is precedence closed for all Si. Algorithm 13 describes a simple greedy procedure that

finds the set P[Si] such that the density is maximal over all such sets – let δmax denote this density. First we consider the case where n ≥ m – since the density of S is n/m, this means that δmax ≥ 1. Let OPT denote the collection of sets in the optimal solution, with 152 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES √ size |OPT| = k and vertices S ,..., S . If k > n, then since there are only n 1 k √ √ elements to cover we have |cov(OPT)|/|OPT| < n/ n = n; since even S is a √ n-approximation, certainly the greedy solution is. √ We now consider the case where k ≤ n. For all Si ∈ OPT, we know that

δ(P[Si]) ≤ δmax as otherwise Algorithm 13 would return P[Si]. A simple rearrange- ment gives |cov(P[Si])| ≤ δmax|P[Si]|. Since the precedences of Si ∈ OPT need to be met for OPT to be feasible, we know that P[Si] ⊆ OPT. If we sum over the vertices of OPT, we obtain the following relation:

k k k |cov(OPT)| = | ∪i=1 cov(|P[Si])| ≤ ∑ |cov(|P[Si])| ≤ δmax ∑ |P[Si]| . i=1 i=1 √ √ It follows that if k ≤ n, then |cov(OPT)| ≤ δ k2 ≤ δ n|OPT|, and there- √ max max √ fore that the density of OPT is at most δmax n , hence Algorithm 13 returns a n- approximation.

Now consider the case where δmax < 1, which can only occur when m > n. We can increase the value of δmax by replacing each element by dm/ne copies of itself.

This gives us an instance with n · dm/ne ≥ m elements, ensuring that δmax ≥ 1, so the proof above still holds. However, as there are now between m and 2m elements, √ the approximation ratio for such instances is ≤ 2m. Overall, we have proven that the approximation ratio of Algorithm 13 has a worst- p case bound of max(n, 2m). √ Lemma 45. There are tight instances of maxDPCS where Algorithm 13 returns an Ω( n)- approximation.

Proof. Let the sets a1,..., a` each cover nothing, sets y1,..., y` cover ` unique elements each, and set x covers a single element, also covered by one of the yi sets. So we have m = 2` + 1 and n = `2 for some value `. Assume the following precedence relations: ai ≺ yj, ∀i, j, so each yi set has ` precedences, while x has no precedences. This means that x would be chosen by Algorithm 13, with δmax = 1/1, as for each set yi we have

δ(P[yi]) = `/(` + 1) < 1. `·` However, OPT would be the collection of all y and a sets, with density = `/2. i j √ √`+` The total number of sets is m = ` + ` + 1, and since ` = n, δ(OPT) is n/2. The 5.2. MAX-DENSITY PRECEDENCE-CLOSED SUBGRAPHS 153 √ ratio of the density of OPT to the solution of Algorithm 13 is in n/2.

We can construct similar tight examples with no empty sets by allowing each set ai to cover the same element.

a1 b1

a2 b2

. . . .

ak−1 bk

x y1 y2 ... yn

Figure 5.1: A bad precedence graph when δ < 1.

√ Lemma 46. There are tight examples where Algorithm 13 returns an Ω( m)-approximation.

|cov(P[i])| Proof. When m > n and δ = max ∈S < 1, then the approximation ratio √ max i |P[i]| is at best in O( m). Note that this can only occur if there are empty sets – if there are any sets without precedences that cover elements, then the density of such a set is no less than one.

If m > n, let the k − 1 sets a1,..., ak−1 and the k sets b1 ... bk be empty, while the sets y1,..., yn all cover a single unique element, and let set x cover an element also covered by a set yi – so there are 2k + n sets covering n elements. Between these sets we have the constraints a1 ≺ a2 ≺ · · · ≺ ak−1 ≺ x; b1 ≺ b2 ≺ · · · ≺ bk, and for all 1 ≤ i ≤ n, bk ≺ yi (see Figure 5.1). The subcollection ending in a single node 154 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES with maximum density is a1, ··· , ak−1, x, with δ(P[x]) = 1/k – each yi subcollection has density 1/(k + 1) and all other sets have zero density. For this case OPT is the 2 collection of the bi and yi sets, which gives a density of n/(k + n) ≤ 1. If k = n , then 3 the ratio of the optimal density to δ is kn/(k+n) = n /(n2+n) → n as n → ∞. Since max √ m = 2k + n = 2n2 + n, as m, n → ∞, this ratio → m/2.

Further Greedy Algorithms for maxDPCS √ We find that attempts to improve on the O( n) result of greedy-subgraph give the same approximation guarantee. Algorithm 14 finds a solution to maxDPCS by greed- ily building a family of potential solutions, and returning the solution with the highest density. At each stage, we select a set that adds the most to the current solution – this allows us to account for overlap with elements in the existing solution set.

Algorithm 14 greedy-subgraph+(P = (S, E), U) 1: SOL = ∅, F ← ∅ 2: while cov(F) < |U| do 3: A ← arg max δ(P[S ] ∪ F) Si∈S\F i 4: F ← F ∪ P[A] 5: Add F to SOL 6: return arg max δ(F ) Fi∈SOL i

Algorithm 14 describes this procedure for finding a family of potential solutions by greedily adding to a solution F – of these solutions, the subcollection Fi that max- imises the density δ(Fi) is returned as our result. We can construct examples where this algorithm performs poorly, similar to the approach used earlier.

Again, we assume sets a1,..., a` each covering the same single element and sets y1,..., y` that each cover ` unique elements. Let sets x1,..., x`2 each cover one distinct 2 2 element – where these ` elements are the same ` elements covered by the yi sets.

For the ai and yi sets, assume the same precedences as before: ai ≺ yj, ∀i, j, and let all xi sets be free of precedences.

In the first iteration, this gives us δ(yi) = `/(` + 1) ≤ 1 and δ(xi) = 1/1 for all i – Algorithm 14 would therefore select P[xi], for some i, as the first collection.

In the following iterations, we have δ(yi) ≤ `/(` + 1) (decreasing as new elements 5.2. MAX-DENSITY PRECEDENCE-CLOSED SUBGRAPHS 155

2 are covered) and, for all xi not currently in the collection, δ(xi) = 1 – the next ` −

1 iterations will therefore select sets x2,..., x`2 , with the yi sets added last. The maximum density subgraph from the series of greedy solutions is δ(∪ixi) = 1 – at 2 each of the first ` steps, the density is one, and this decreases as the yi sets are added since they cover the same elements as the xi sets.

However, the optimal solution is given by all of the yi and ai sets, with none of the `2 xi sets – this has density δ(OPT) = `+` = `/2. So the optimal solution is ` times the greedy solution. The total number of sets is m = 3`, and the number of elements is n = `2 + 1 – so √ the approximation ratio is `/2 = n − 1/2. Since this algorithm is an extension of Algorithm 13, the previous approximation ratio still holds. √ Lemma 47. The upper bound of n on the approximation ratio for greedy-subgraph+ is tight up to factor 2.

We now consider two special cases in which we can obtain solutions in polynomial time. An out-tree is a directed graph in which the underlying graph is a tree with a fixed root vertex that all edges are directed away from. That is, every vertex in an out-tree has either exactly one parent, or is the root vertex. An in-tree is similar to an out-tree, but with all edges directed towards the root node. Every vertex in an in-tree either has exactly one parent, or is the root vertex.

In-tree Solutions

root

s4 s5 s6

s1 s2 s3

Figure 5.2: An example in-tree precedence graph. 156 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES

In this section, we will prove that Algorithm 13 for maxDPCS gives an optimal solution for precedence graphs with an in-tree structure.

Lemma 48. If the precedence graph is a forest of in-trees, then we can find a precedence-closed subgraph of maximal density in polynomial time.

Proof. For all vertices A and B in an in-tree precedence graph, we have either P[A] ⊆ P[B], P[B] ⊆ P[A] or P[A] ∩ P[B] = ∅. Consider OPT, an optimal solution to the problem. Let M(OPT) be the set of nodes in the optimal solution with no descendants in the optimal solution – so that OPT can be defined as ∪S∈M(OPT)P[S]. For a pair of vertices A and B in M(OPT), we must have P[A] ∩ P[B] = ∅ since no vertex in M(OPT) has a descendant in OPT. So the density of OPT can be bounded by:

| ∪S∈M(OPT) cov(P[S])| ∑S∈M(OPT) |cov(P[S])| δ(OPT) = ≤ . ∑S∈M(OPT) |P[S]| ∑S∈M(OPT) |P[S]|

If every set in M(OPT) has the same density, then we can select an arbitrary set S ∈ M(OPT) as our solution, since δ(P[S]) will also be optimal. If there is some set S ∈ M(OPT) such that δ(P[S]) is larger than the density of the other collections in M(OPT) then there is a contradiction – because these P[S] sets are disjoint, we could get a better solution than OPT by selecting only P[S]. So there is an optimal solution defined by a single vertex and all of its precedences. By checking the density of P[A] for every node A, and selecting the set with the largest density, we can find a max-density precedence-closed subgraph in polynomial time. This is equivalent to the solution given by Algorithm 13, which finds the set A that maximises the density δ(P[A]).

Out-tree Solutions

For instances of MaxDPCS where the precedence graph has an out-tree structure, we develop a recursive greedy approach whose solutions can be bounded in terms of the height h of the precedence tree. While the worst-case for this is O(n), the average p height of t-ary trees with n nodes is given by 2πnt/(t − 1) [78]. In general, we can expect the approximation ratio to be significantly better than the O(n) worst case 5.2. MAX-DENSITY PRECEDENCE-CLOSED SUBGRAPHS 157

– for many applications (such as test-case prioritisation [98]) precedence graphs are shallow and sparse, and the approximation ratio h will be relatively small.

root

s1 s2

s3 s4 s5 s6

Figure 5.3: An example out-tree precedence graph.

This procedure is based on that used by Chekuri and Kumar [45] for the maximum coverage problem with group budget constraints. Let δ∗ be the density of the optimal tree (starting from the root), and note that |cov(OPT)| − δ∗|OPT| = 0, by the definition of density. Given a tree T rooted at v, a constant σ and a set Rv ⊆ U, we inductively assume we can (in polynomial time) find a subtree AT of T such that

v ∗ v ∗ σ|cov(AT, R )| − δ |AT| ≥ |cov(MRv (T), R )| − δ |MRv (T)| , (5.1)

v ∗ where MRv (T) is the subtree of T, defined by arg maxt⊂T |cov(t, R )| − δ |t|. We construct this solution recursively, starting with small trees (the leaves) where the optimal density is known and with σ = 1; we will show how to build a general solution. Starting at a vertex v, where the set of elements we want to cover is Rv, we will return a subtree AT of T rooted at v. Let children[v] be the set of children of v, and cv = |children[v]|. At each step i of the algorithm, we make a series of recursive calls to find subtrees of each child of v (denoted A i ,..., A i ), and select from these the child A i with T1 Tcv Tj ∗ the maximal value of σ|cov(A i ∩ R)| − δ |A i |, to be added to the current solution. Tj Tj Note that at each iteration of step4 in Algorithm 15, the set of elements Rv that we are trying to cover can change based on the coverage of the previous iteration: Rv ← v R \cov(A i ). Tj 158 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES

Algorithm 15 subgraph-outtree(T = (S, E), v, Rv)

1: AT0 ← {v} 2: R ← Rv\cov(v) 3: SOLS ← {AT0 } 4: for iteration i ∈ 1 . . . cv do 5: F ← ∅ 6: for j ∈ children[v] do 7: if no subtree of Tj has yet been added to ATi−1 then 8: Add A i = subgraph-outtree(Tj, j, R) to F Tj ∗ 9: ATi ← ATi−1 ∪ {arg maxA ∈F σ|cov(ATi , R)| − δ |ATi |} Ti j j j 10: SOLS ← SOLS ∪ {ATi } 11: R ← R\cov(ATi ) 12: if all these values are negative then 13: return ∅ 14: else ∗ 15: return arg max σ|cov(A , R)| − δ |A | AT∈SOLS T T

Lemma 49. Assuming inequality 5.1, Algorithm 15 returns a set ATi such that:

v ∗ v ∗ (σ + 1)|cov(ATi , R )| − δ |ATi | ≥ |cov(MRv (T), R )| − δ |MRv (T)| .

Proof. We prove this inductively. For our base case, we note that if MRv (T) is the empty tree, then the right hand side of this inequality will be zero. As our algorithm has access to the empty tree, it is also able to return the empty set in this case. Now we prove the inductive case – for some vertex v with subtree T, if inequality 5.1 holds for the children of v, then will prove that it follows for v also. Assume that the children of v are numbered in the order that they are selected by Algorithm 15: this ensures that in iteration i the subtree returned is A i , which we can denote AT(i). Ti Furthermore, let Ri denote the set R during iteration i – that is, the set of elements yet to be covered after the first i − 1 iterations.

Moreover, every time we select a subtree from tree Ti rooted at child i, the locally ∗ ∗ optimal tree MRv (T) selected a subtree from child i , where i must be equal to i if the tree Ti contributed a subtree in MRv (T).

Let the projection of MRv (T) onto Ti be the subgraph of Ti induced by the vertices in MRv (T) ∩ Ti: denote this projection of MRv (T) onto Ti by MRv (T, Ti). 5.2. MAX-DENSITY PRECEDENCE-CLOSED SUBGRAPHS 159

There are two cases we need to consider:

1. The tree AT(i) is MRv (T, Ti).

2. AT(i) is not MRv (T, Ti).

If AT(i) is not the projection of MRv (T) onto Ti (case (2)), then it must be because

Algorithm 15 chose the subtree A i that maximises: Tj

∗ σ|cov(A i , Ri)| − δ |A i | Tj Tj across all of the subtrees found during the loop at step4. We can lower bound this value thus:

∗ ∗ σ|cov(AT(i), Ri)| − δ |AT(i)| ≥σ|cov(AT(i∗), Ri)| − δ |AT(i∗)| (5.2) ∗ ≥|cov(MRv (Ti∗ ), Ri)| − δ |MRv (Ti∗ )| (5.3) ∗ ≥|cov(MRv (T, Ti), Ri)| − δ |MRv (T, Ti)| (5.4) where inequality 5.2 follows from our choice of AT(i) in round i, inequality 5.3 fol- lows from induction and the definition of MRv (Ti∗ ), and inequality 5.4 follows from the optimality of MRv (Ti∗ ) on that particular input. As σ ≥ 1, these inequalities will also be true in the case that we do select MRv (T, Ti), covering case (1) described above. ∗ Denote the number of children of v that are in MRv (T) by cv, and consider the state ∗ ∗ = ∗ at iteration cv. The tree AT ATcv will be the union of the root v and the cv subtrees selected so far, and from the definition of how R was updated at each iteration, we have: ∗ cv v v [ cov(AT, R ) = cov(v, R ) ∪ cov(AT(i), Ri) . i=1 We also know that for every iteration i,

v cov(MRv (T, Ti), Ri) ⊇ cov(MRv (T, Ti), (R \cov(AT))) ,

v which follows from the definition of Ri – the set of elements in R not covered by ATi , 160 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES the subtree of AT at time i. Therefore:

∗ ∗ cv cv v v ∑ |cov(MRv (T, Ti)) ∩ R | ≥ ∑ |cov(MRv (T, Ti)) ∩ (R \cov(AT))| (5.5) i=1 i=1 ∗ cv v ≥| ∪i=1 cov(MRv (T, Ti)) ∩ (R \cov(AT))| (5.6) ∗ cv v v ≥| ∪i=1 cov(MRv (T, Ti)) ∩ R | − |cov(AT) ∩ R | . (5.7)

Combining the result from inequalities 5.2–5.4 with the relation described by in- equalities 5.5–5.7, we see that:

v ∗ σ|cov(AT, R )| − δ |AT| (5.8) ∗ cv v [ v ∗ =σ|cov(v, R ) ∪ cov(AT(i), R )| − δ |AT| (5.9) i=1 ∗ " cv # v ∗ v ∗ ≥[|cov(v, R )| − δ |v|] + ∑ |cov(MRv (T, Ti), R )| − δ |MRv (T, Ti)| (5.10) i=1 v ∗ v ≥[|cov(MRv (T), R )| − δ |MRv (T)|] − |cov(AT, R )| . (5.11)

The first relation arises from how the subtrees of AT being greedily selected from each of the subtrees – it therefore follows that we can bound the value of the pro- jection of MRv (T) on each subtree. The last inequality follows from application of inequalities 5.5–5.7, and the relation between the union of the MRv (T, Ti) subtrees and MRv (T). We conclude that

v ∗ v ∗ (σ + 1)|cov(At, R )| − δ |AT| ≥ |cov(MRv (T), R )| − δ |MRv (T)| .

∗ The number of children cv in the optimal solution MRv (T) is unknown; however, v ∗ the tree found by Algorithm 15 maximises (σ + 1)|cov(At) ∩ R | − δ |AT|, and is ∗ therefore at least as good as the one found after cv iterations. The tree from iteration ∗ cv is one of the potential solutions considered at step 15. Theorem 50. There is a h−1-approximation algorithm for maxDPCS on out-trees, where h is the height of the input tree. 5.3. AN ALGORITHM FOR PRECMSSC 161

Proof. Applying the result of Lemma 49 recursively, starting with σ = 1 for each of the leaves and adding one for each recursive call, we obtain a solution tree ST such that at the root

∗ h|cov(ST) ∩ U| − δ |ST| ≥ |cov(OPT) ∩ U| − δ|OPT| , (5.12) where h refers to the height of the input tree. Since the right hand side of inequality 5.12 will be zero (by definition of OPT), this ∗ can be rearranged to give |cov(ST)|/|ST| ≥ δ /h, so our solution has density no less than 1/h of the optimal solution. Note that we do not know the optimal density δ∗ – to find a solution guaranteed to meet this bound, we can try all feasible values of δ∗. This value must be in the set {a/b | a ∈ {1 . . . |U|}, b ∈ {1 . . . m}}, where m is the number of nodes in the tree (that is, the number of sets). This result also holds for the case where the precedence graph is a forest of out- trees – for such instances, we can convert the graph to a single out-tree by adding a dummy root with a precedence from each of the leaves.

5.3 An Algorithm for precMSSC

In this section we describe an algorithm mssc-greedy that, given some algorithm maxDPCS(·) that returns an α-approximation to the min-rank precedence-closed sub- graph problem, will return a 4α-approximate solution for the precedence-constrained min-sum set-cover problem. At each stage of the procedure, mssc-greedy selects the precedence-closed subgraph that minimises the rank over the set of currently uncovered elements, and adds it next in the schedule – the sets within the subgraph can be arbitrarily ordered such that the precedence constraints within the subgraph are met. Since we always choose a precedence-closed collection of sets, precedences outside the selected subgraph are already met.

Note that mssc-greedy finds a maximum density precedence-closed subgraph at most n times, and that topologically sorting each of these subgraphs takes polynomial time [115; 181]. 162 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES

Algorithm 16 mssc-greedy(S, P, U) 1: A ← ∅ 2: i ← 0, n ← |S|. 3: while |A| 6= n do 4: Ai = maxDPCS(S, P, U). 5: Topologically sort Ai and add the result to the end of A. 6: P = (S\Ai, E), the subgraph of P induced by the vertices in S\Ai. 7: S = S\Ai, U = U\cov(Ai), i ← i + 1. 8: return A

Lemma 51. Algorithm mssc − greedy runs in polynomial time.

Approximation Guarantee

Let Ai be the precedence-closed (sub)collection of sets found at iteration i of step4 in mssc-greedy, and let Xi denote the set of elements first covered by Ai. Let Ri be the set of elements not covered prior to iteration i in the greedy solution. Note that the |Ai| rank of Ai, r(Ai), is equal to . |Xi|

Theorem 52. Given an α-approximation to the precedence-closed min-rank subgraph prob- lem, mssc-greedy gives a 4α-approximation to the precedence constrained, weighted min- sum set cover problem.

Proof. Let OPT denote the optimal value of the min-sum set cover instance, subject to precedence constraints. Let SOLg denote the value of the solution returned by mssc-greedy.

Just before Xi was covered by Ai, there were |Ri| items remaining to be covered.

At most it takes |Ai| time to cover all of the |Xi| items covered by Ai, adding |Ai| to the processing time of all |Ri| items that had yet to be covered. So we can bound the solution to mssc-greedy by SOLg ≤ ∑i |Ri||Ai|. Figure 5.4a) represents the worst-case value of SOLg as the area under the histo- gram of cover-times for items. At each step i of mssc-greedy, all of the sets in Ri have at most |Ai| added to their possible completion time. In Figure 5.4b) we can see the equivalent diagram for OPT. For each set Ai in SOLg, we map the row of area |Ai||Ri| to a column of area |Ai||Ri|/4α, positioned between |Ri|/2 and |Ri+1|/2 units from the end of the plot for the schedule of OPT. This maps the area bounding SOLg to a 5.3. AN ALGORITHM FOR PRECMSSC 163

Figure 5.4: The sum of completion times for mssc-greedy is at most 4α times OPT. 164 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES space exactly 1/4α its initial size. The width of this column is |Xi|/2, so to achieve the given area the height must be hi = |Ai||Ri|/(2α|Xi|).

If the area representing 1/4α of SOLg is less than the area of OPT, then we have a th 4α-approximation. Let hi be the height of the i column in this re-sized version of

SOLg. Let r = |Ri|/2 be the distance of the left side of this column from the rightmost side of the diagram for OPT.

We now show that at time bhic, OPT has covered at most |Ri|/2 elements (leaving at least |Ri|/2 elements still uncovered), so that the histogram for OPT at height h cannot be far enough right to undercut the scaled SOLg.

At time bhc, mssc-greedy had just scheduled Ai−1, and the set of items yet to be covered was Ri. The precedence-closed set that mssc-greedy scheduled at time bhc,

Ai, has no more than α times the minimum rank among all available precedence- |Ai| closed subsets. The rank of Ai at this point is , as only the elements in |cov(Ai)∩Ri| Ri are still available to be covered at this point. Let OPT≤t denote the segment of OPT up to time t. The segment of the optimal schedule up to time bhc has rank bhc on set Ri, which we know to be more than 1/α times the rank of Ai |cov(OPT≤bhc)∩Ri| as otherwise find-greedy-subgraph would not have chosen Ai to be scheduled at step i.

bhc |Ai| So r(OPT≤bhc) = ≥ . This gives us |cov(OPT≤bhc)∩Ri| α|Xi|

α|Xi|bhc ≥ |cov(OPT≤bhc) ∩ Ri| . (5.13) |Ai|

|Ri| At height bhc, the column for SOLg is 2 from the rightmost side of the diagram for |Ri| |Xi| |Ai||Ri| α|Xi|bhc OPT. However, since αbhc ≤ |Ai||Ri|/2|Xi| we have = ≥ ≥ 2 |Ai| 2|Xi| |Ai| |cov(OPT≤bhc) ∩ Ri| (by inequality 5.13), which means that OPT≤bhc can not cover more than |Ri|/2 elements of Ri.

This means that at time bhc, there are at least |Ri|/2 elements that OPT has yet to cover, so at time bhc the curve for OPT is at least |Ri|/2 steps from the right. Therefore the area representing the size of scaled SOLg is no more than 4α times that of OPT, meaning that mssc-greedy is a 4α-approximation. 5.4. HARDNESS RESULTS 165

5.4 Hardness Results p Our algorithm for precMSSC gives a max(n, 2m)-approximation – we will now show that it is hard to obtain approximation ratios better than polynomial. In par- ticular, we will prove that if we could obtain better approximations for precMSSC, then we would have a counter-argument to the so-called Planted Dense Subgraph Conjecture [44]. The planted dense subgraph problem involves distinguishing a plain Erdos-R˝ enyi´ graph G(N, p) from one that has a planted dense component. In particular, an in- stance PDS(N, k, α, β) of the planted dense subgraph problem seeks an algorithm that, with high probability, is able to distinguish between a (so-called) unplanted ran- dom graph G(N, Nα−1), and a similar graph G(N, Nα−1) in which there is embedded a planted dense component G(k, kβ−1) [44]. The parameter N defines the order of the graph, k the order of the dense component, whereas α and β prescribe the log dens- ities of the two random graph components.

√ Conjecture 53 (Planted Dense Subgraph Conjecture [44]). For every ε > 0, k ≤ N, and β < α, no probabilistic polynomial time algorithm can, with advantage greater than ε, solve PDS(N, k, α, β).

We will show how identifying a sequence of jobs with a high rate of coverage corresponds to finding a subset of the vertex set with a high density, so that solv- ing precMSSC amounts to distinguishing between random graphs with or without planted dense components.

Theorem 54. Assuming the planted dense subgraph conjecture, it is not possible to approx- √ √ imate precMSSC to within O(20 m) or O(16 n) of optimality.

Proof. We reduce from a PDSC instance to a precMSSC instance. Assume we have N vertices in the PDSC instance, and let k = N1/2. We have a graph G(N, 1/N1/2), and we aim to determine whether there is a PDS G(N1/2, 1/N1/(4+γ)). Assume that vertices cover no elements; we define the coverage of each edge based on the incident vertices as follows. For each vertex v in the graph, we have a set U[v] of elements associated with v (this is distinct from the coverage of v, since cov(v) = ∅). Each element in the 166 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES universe U is associated with a vertex v independently with probability p – so the expected number of elements associated with each vertex is p|U|, and the expected number of vertices associated with an element is pN. That is, E(|U[v]|) = p|U|, and

E(Ni) = pN where Ni is the number of vertex sets that include element i. Assume that there are N elements in U, so E(|U[v]|) = pN. For a pair of vertices (u, v), we define the set cov((u, v)) to be U[v] ∩ U[u], the intersection of the sets associated with the vertices; note that E(|cov((u, v))|) = p2|U|. In the precMSSC instance, vertices have no precedences; for each edge (u, v) we have a precedence from both incident vertices u and v. The expected number of vertices that a given element is associated with is Np, and therefore the expected number of vertex pairs that cover the element, given a complete graph, would be ≈ N2 p2/2. We now elaborate on the cases with and without a planted dense component.

Planted Component In the planted component, we have N1/2 vertices, with the probability of an edge being 1/N1/(4+γ). Hence, for each item the expected number of edges that cover that item is ≈ (N1/2 p)2/N1/(4+γ) = N(3+γ‘)/4 p2, for some γ‘ – we want this close to 1, in which case we need p ≈ N(−3+γ‘)/8. This allows us to cover all elements (in expectation) using N1/2 vertices and N(3+γ‘)/4 edges, governing the MSSC cost. We will make this more precise below.

No Planted Component In the non-dense graph, the expected number of edges that cover an element is (Np)2/N1/2 = N3/4 p2. Considering the first N9/16 vertices in some ordering, the expected number of edges induced is N9/8/N1/2 = N5/8. The probability that an edge covers an element is p2 = N(−3+γ‘)/4, so the expected number of elements covered by each edge is |U| · N(−3+γ‘)/4 – since |U| = N, then the expected number of elements in each edge- set is about N(1+γ‘)/4. Suppose that the coverage sets for each edge are pairwise disjoint, which would give the greatest total coverage. In this case, if we only have N5/8 edges, the number of elements covered would be at most N(7+γ‘)/8 in expectation. This means that after N9/16 vertices are selected, there are still N − N(7+γ‘)/8 elements remaining to be 5.4. HARDNESS RESULTS 167 covered. The MSSC cost of this solution is no less than N9/16(N − N(7+γ‘)/8) = N25/16(1 − N(−1+γ‘)/8).

Proof Details Now that we have a framework for using precMSSC to find dense planted components, we will show that we can achieve this with high probability. Let p = c · N(−3+γ‘)/8 log N for some constant c, and assume there are ` copies of each vertex (with the initial precedence constraints duplicated across the copies, so each edge depends on all copies of the incident vertices). We require multiple copies of the vertices to ensure that the MSSC cost is dominated by the vertices in the solution – for each element, the cost will primarily be based on the number of vertices before it in the ordering. Let the combined copies of a given vertex only cover one item – all of these copies must be before the incident edges for an ordering to be valid. Assume that the ` copies of the N1/2 vertices in the dense component are first in the ordering; the number of the (initial) vertices represented in this part of the ordering 1/2 (1+γ‘)/8 that are associated with an element i (denote this Ni) is pN = c · N log N. In the dense component, we expect to be able to cover all elements with N1/2 vertices (though all ` copies of each vertex are required). The maximum cost per element is ` · N1/2 + N(3+γ‘)/4, where N(3+γ‘)/4 is the number of edges used. Since we assumed N elements, and all are covered by edges in the dense component, this gives our upper bound on the min-sum cost as ` · N3/2 + N(7+γ‘)/4. We now prove that, in fact, all elements are covered with high probability. We can use a Chernoff bound to find the probability that more than four times the expected number µ of vertex sets are associated with i:

− 3µ −c·N(1+γ‘)/8 log N −c·N(1+γ‘)/8 P[Ni ≥ 4µ] ≤ e 3 = e = N .

The probability that less than half the expected number of vertex sets are associated with i is − µ − c ·N(1+γ‘)/8 log N − c ·N(1+γ‘)/8 P[Ni ≤ µ/2] ≤ e 8 = e 8 = N 8 .

Both the bounds are very small, especially for c > 8. An element is covered by a vertex pair iff it is associated with both the incident vertex sets. Therefore the number of vertex pairs whose edge would cover element i 168 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES is highly concentrated between µ2/4 and 16µ2. Since µ = c · N(1+γ‘)/8 log N, we have µ2 = c2N(1+γ‘)/4 log2 N. The probability of each edge is N−1/4+γ, so the expected number of actual edge sets in the dense component containing i is N−1/4+γµ2 = c2N(1+γ‘)/4−1/(4+γ) log2 N. ( + ) − ( + ) Since (1+γ‘)/4 − 1/(4+γ) = (4γ‘+γ+γγ‘)/(16+4γ) > 0, we have N 1 γ‘ /4 1/ 4 γ > 1. 00 Letting γ = (4γ‘+γ+γγ‘)/(16+4γ), the probability that the number of edges covering i 2 2 −c2 N in the dense component is less than (c log N)/4 is no more than N γ“ log /8. For c2 ≥ 64, the probability that a given element i remains uncovered is very small – at most N−8. Let e(i) denote the number of edges in the dense component covering element i; taking the union bound over all N elements, the probability that some −8 −7 element is uncovered is ∑i∈U P[e(i) ≤ 1] ≈ N · N = N , which is still very small. So we expect to cover all elements with the N3/4 edges in the dense component, w.h.p..

Unplanted In the unplanted setting, consider the first N9/16 vertices scheduled. The expected number of edges is N5/8, and the expected coverage of each vertex pair (u, v) is Np2 = c2N(1+γ‘)/4 log2 N. The probability that a vertex pair covers more than four times this many elements is

− 3µ −c2·N(1+γ‘)/4 log2 N −c2·N(1+γ‘)/4 log N P[|cov((u, v))| ≥ 4µ] ≤ e 3 = e = N , which is tiny. Each of the N5/8 edges covers no more than 4c2 · N1+γ‘/4 log2 N elements w.h.p., so the total number of covered elements is no more than 4c2 · N(7+γ‘)/8 log2 N. This gives a lower bound on the precMSSC cost as

(`N9/16 + N5/8)(N − 4c2N(7+γ‘)/8 log2 N)

25/16 13/8 2 2 ≈ (`N + N )(1 − 4c log N/N1/8) , as for each uncovered element, we incur a cost equivalent to the number of edges and the ` vertex copies that are already in the ordering. In the planted component, we can cover all elements with N1/2 vertices and N3/(4+γ‘) edges w.h.p., so the precMSSC solution is upper bounded by N(` · N1/2 + N3/(4+γ‘)) ≈ `N3/2 + N7/4. 5.5. CONCLUSIONS 169

Letting ` = N1/4, we have a cost of about `N25/16 + N26/16 = N29/16 + N26/16 for the unplanted graph, and about `N24/16 + N28/16 = 2N28/16 for the dense compon- ent. If we can distinguish between solutions that differ by a Ω(N1/16) multiplicative factor, then we have contradicted the planted dense subgraph conjecture.

Since n, the number of elements in U in the precMSSC instance, is equal to N, the number of vertices in the PDSC instance, this implies that we can achieve an 1/16 approximation ratio no better than n /2 for precMSSC without providing a counter- argument to the planted dense subgraph conjecture. In terms of m, the number of sets, we have m = `N + E, where `N is the number of sets arising from vertices in the PDSC instance, and E is the number of edges. This gives us N ∈ O(m4/5), implying that we can not distinguish between precMSSC instances differing by O(m1/20).

5.5 Conclusions

Motivated by applications in test-case prioritisation, we have considered the prob- lem of min-sum set cover with the addition of precedence constraints (precMSSC). This problem blends precedence-constrained machine scheduling with the min-sum objective, allowing us to model scheduling where jobs have a coverage aspect. This problem relates to a broad range of optimisation models, in particular scheduling with AND/OR-precedences, a problem which is Label Cover-hard for the objectives of minimising makespan and average completion time. We prove that, given a max-density precedence-closed subgraph (maxDPCS), we can find a bounded solution to precMSSC. For the case where precedences are an in- tree, we have an exact algorithm for maxDPCS, and therefore a 4-approximation to precMSSC. If the precedences are an out-tree, our approximation ratio is proportional to the height of the tree – which can be small for many applications. In general, we √ have a O( n)-approximation algorithm for maxDPCS – and therefore also for the related precMSSC instance – along with proof that we can achieve no better than polynomial approximations for precMSSC without contradicting the dense planted subgraph conjecture. 170 CHAPTER 5. MIN-SUM SET-COVER WITH PRECEDENCES

Future Work

Assuming the veracity of the planted dense subgraph conjecture, there is limited room for improvements on the approximation ratio for precMSSC. However, there are special cases for which it would be of interest to explore possible improved solutions. For example, classes of (pseudo) series parallel graph, with either the source or sink node removed. There is potential that the proof for in-trees can be expanded to cover such graphs, as it only relies on the independence of nodes in subtrees – which could be achieved by contracting series parallel sections. It would also be of interest to explore the effect of precedence constraints on other variants of the min-sum set cover problem – such as with weighted sets, or restrictions on the elements in the sets. While we have focused on special cases related to the precedence constraints, there are many MSSC special cases that could be relevant in the precMSSC context. CHAPTER 6. SUMMARY

6 Summary

Throughout this thesis, we have considered a seemingly wide range of combinatorial optimisation problems. Broadly defined, the types of problems we have considered are NP-complete graph problems with three types of objective – clustering, covering and ordering. But while the objectives of these problems seem varied, we found that in many ways they are related – most of the problems we consider relate two or more of these classifications. Because these problems are fundamentally difficult, we have looked at a variety of approaches for efficiently obtaining solutions. In particular, we have considered two strategies to achieve this – approximation and parallelisation. We introduced approximation algorithms and heuristics for each of the problems considered, and in several cases demonstrated how these algorithms might be implemented in a parallel framework. Furthermore, we have looked at approaches for efficiently ordering test- suites to increase the effectiveness of software testing. For the k-centre problem, we provided a simple 4-approximation algorithm that typically runs in only two MapReduce rounds. We also demonstrated experimentally that it returns solutions that compare well to those of a sequential 2-approximation algorithm. This approach is compared to an alternative 10-approximation sampling- based MapReduce procedure. This has comparable effectiveness, but has signific-

171 172 CHAPTER 6. SUMMARY antly larger runtimes. To further improve runtimes without significantly impacting the solutions, we parameterised the sampling procedure and analyse the impact of this trade-off. We proved that we can bound the probability of bad solutions for a range of values of the tuning parameter. Indeed, the algorithm still returns good solutions even for values of the parameter below the provable bounds, while running faster. We demonstrated that the approximation factor of four for our k-centre algorithm is tight – there are graphs on which, with a malicious assignment to machines and choice of seedings for the farthest-point subprocedure, our algorithm MRG gives a 4-approximation. It is unclear how likely such cases are in practice. It would be of interest to find bounds on the probability that our algorithm gives a poor approxim- ation. Further analysing parallel graph algorithms, we introduced strategies for solving a range of novel graph covering problems with MapReduce – such as k-star and k- tree cover. These problems relate to a wide range of combinatorial models, but are primarily of interest for modelling network routing problems. Branching out from these graph covering models, we consider a variant on a hyper- graph vertex cover problem with a min-sum objective and precedences. Motivated by problems in test-case prioritisation, we analysed this problem under a scheduling model and introduced a series of heuristics for finding good test orderings. We consider instances where there exist hard precedence constraints between test suites. Experimentally, we tested our algorithms agaist two different objective functions with both complete data (defect-coverage) and approximate data (code-coverage). Our results showed that with accurate coverage data, our algorithms can perform exceed- ingly well, but this diminishes when the prior data does not accurately match the evaluation data. However, even when relying on code-coverage data to approximate defect locations we still out-performed both random and algorithmic baselines. On top of our two algorithms, we considered a range of additional measures for improv- ing solution quality. Based on our experimental results, we found that keeping an updated coverage matrix is the most effective approach for ensuring good solutions. Our analysis of lookahead measures found that, when relying on code-coverage data as an estimator, more complex approaches are prone to over-fitting to the input data, 173 increasing the disparity with the ground truth. For accurate prior information these additional measures can greatly improve the rate at which defects are identified, sug- gesting that these measures are of greater use in testing applications where coverage information has been established by previous rounds of testing. A more formalised model of these test-case prioritisation metrics led us to consider the addition of precedence constraints to the min-sum set cover problem. We design p a max(n, 2m)-approximation algorithm for this problem, where m is the number of sets, and n the size of the universe covered. We further prove that it is hard to design algorithms with better-than-polynomial approximations. 174 CHAPTER 6. SUMMARY BIBLIOGRAPHY

Bibliography

[1] Adolphson, D. (1977). Single machine job sequencing with precedence con- straints. SIAM Journal on Computing, 6(1), 40–54.

[2] Ahmadian, S., Behsaz, B., Friggstad, Z., Jorati, A., Salavatipour, M. R., & Swamy, C. (2014). Approximation algorithms for minimum-load k-facility location. In Leibniz International Proceedings in Informatics, volume 28: Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

[3] Aloise, D., Deshpande, A., Hansen, P., & Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2), 245–248.

[4] Ambuhl,¨ C. & Mastrolilli, M. (2009). Single machine precedence constrained scheduling is a vertex cover problem. Algorithmica, 53(4), 488.

[5] Ambuhl, C., Mastrolilli, M., & Svensson, O. (2007). Inapproximability results for sparsest cut, optimal linear arrangement, and precedence constrained scheduling. In 48th Annual IEEE Symposium on Foundations of Computer Science (pp. 329–337).: IEEE.

[6] Applegate, D., Cook, W., Dash, S., & Rohe, A. (2002). Solution of a min-max vehicle routing problem. INFORMS Journal on Computing, 14(2), 132–143.

[7] Archer, A., Levin, A., & Williamson, D. P. (2008). A faster, better approximation algorithm for the minimum latency problem. SIAM Journal on Computing, 37(5), 1472–1498.

[8] Arkin, E. M., Hassin, R., & Levin, A. (2006). Approximations for minimum and min-max vehicle routing problems. Journal of Algorithms, 59(1), 1–18.

[9] Arora, S., Babai, L., Stern, J., & Sweedyk, Z. (1993). The hardness of approxim- ate optima in lattices, codes, and systems of linear equations. In 34th Annual IEEE Symposium on Foundations of Computer Science (pp. 724–733).: IEEE.

175 176 BIBLIOGRAPHY

[10] Arthur, D. & Vassilvitskii, S. (2006). How slow is the k-means method? In Proceedings of the 22nd Annual Symposium on Computational Geometry (pp. 144– 153).: ACM.

[11] Arthur, D. & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the 18th annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1027–1035).: SIAM.

[12] Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., & Pandit, V. (2004). Local search heuristics for k-median and facility location problems. SIAM Journal on Computing, 33(3), 544–562.

[13] Ausiello, G., Crescenzi, P., Gambosi, G., Kann, V., Marchetti-Spaccamela, A., & Protasi, M. (2012). Complexity and approximation: Combinatorial optimization problems and their approximability properties. Springer Science & Business Media.

[14] Azar, Y. & Gamzu, I. (2011). Ranking with submodular valuations. In Proceedings of the 22nd annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1070–1079).: SIAM.

[15] Azar, Y., Gamzu, I., & Yin, X. (2009). Multiple intents re-ranking. In Proceedings of the 41st annual ACM Symposium on Theory of Computing (pp. 669–678).: ACM.

[16] Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream systems. In Proceedings of the 21st ACM SIGMOD-SIGACT- SIGART Symposium on Principles of Database Systems (pp. 1–16).: ACM.

[17] Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622–633.

[18] Bansal, N., Blum, A., & Chawla, S. (2004). Correlation clustering. Machine Learning, 56(1-3), 89–113.

[19] Bansal, N., Gupta, A., & Krishnaswamy, R. (2010). A constant factor approx- imation algorithm for generalized min-sum set cover. In Proceedings of the 21st annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1539–1545).: SIAM. BIBLIOGRAPHY 177

[20] Bansal, N. & Khot, S. (2010). Inapproximability of hypergraph vertex cover and applications to scheduling problems. In International Colloquium on Automata, Languages, and Programming (pp. 250–261).: Springer.

[21] Bansal, N. & Pruhs, K. (2010). The geometry of scheduling. In 51st Annual IEEE Symposium on Foundations of Computer Science (pp. 407–414).: IEEE.

[22] Bar-Noy, A., Bellare, M., Halldorsson,´ M. M., Shachnai, H., & Tamir, T. (1998). On chromatic sums and distributed resource allocation. Information and Computation, 140(2), 183–202.

[23] Bar-Noy, A., Halldorsson,´ M. M., & Kortsarz, G. (1999). A matched approx- imation bound for the sum of a greedy coloring. Information Processing Letters, 71(3-4), 135–140.

[24] Berend, D., Brafman, R., Cohen, S., Shimony, S. E., & Zucker, S. (2014). Optimal ordering of independent tests with precedence constraints. Discrete Applied Mathematics, 162, 115–127.

[25] Berenholz, U., Feige, U., & Peleg, D. (2006). Improved approximation for min-sum vertex cover. Technical report, Technical Report MCS06-07, Computer Science and Applied Mathematics, Weizmann Institute of Science.

[26] Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping Multidimensional Data (pp. 25–71). Springer.

[27] Bern, M. & Plassmann, P. (1989). The Steiner problem with edge lengths 1 and 2. Information Processing Letters, 32(4), 171–176.

[28] Bhaskara, A., Charikar, M., Chlamtac, E., Feige, U., & Vijayaraghavan, A. (2010). Detecting high log-densities: an O(n1/4) approximation for densest k-subgraph. In Proceedings of the 42nd annual ACM Symposium on Theory of Computing (pp. 201–210).: ACM.

[29] Blelloch, G. E., Peng, R., & Tangwongsan, K. (2011). Linear-work greedy parallel approximate set cover and variants. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (pp. 23–32).: ACM. 178 BIBLIOGRAPHY

[30] Blelloch, G. E. & Tangwongsan, K. (2010). Parallel approximation algorithms for facility-location problems. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures (pp. 315–324).: ACM.

[31] Bonchi, F., Gionis, A., & Ukkonen, A. (2011). Overlapping correlation cluster- ing. In 11th IEEE International Conference on Data Mining (pp. 51–60).: IEEE.

[32] Burge, J., Munagala, K., & Srivastava, U. (2005). Ordering pipelined query operators with precedence constraints. Technical report, Stanford.

[33] Byrka, J. & Aardal, K. (2010). An optimal bifactor approximation algorithm for the metric uncapacitated facility location problem. SIAM Journal on Computing, 39(6), 2212–2231.

[34] Byrka, J., Grandoni, F., Rothvoss, T., & Sanita,` L. (2013). Steiner tree approxim- ation via iterative randomized rounding. Journal of the ACM, 60(1), 6.

[35] Byrka, J., Pensyl, T., Rybicki, B., Srinivasan, A., & Trinh, K. (2015). An improved approximation for k-median, and positive correlation in budgeted optimization. In Proceedings of the 26th annual ACM-SIAM Symposium on Discrete Algorithms (pp. 737–756).: SIAM.

[36] Ceccarello, M., Pietracaprina, A., Pucci, G., & Upfal, E. (2015). Space and time efficient parallel graph decomposition, clustering, and diameter approxima- tion. Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, (pp. 182–191).

[37] Chakaravarthy, V. T., Pandit, V., Roy, S., Awasthi, P., & Mohania, M. (2007). De- cision trees for entity identification: approximation algorithms and hardness results. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (pp. 53–62).: ACM.

[38] Chakaravarthy, V. T., Pandit, V., Roy, S., & Sabharwal, Y. (2009). Approximating decision trees with multiway branches. In International Colloquium on Automata, Languages, and Programming (pp. 210–221).: Springer. BIBLIOGRAPHY 179

[39] Chakrabarty, D. & Swamy, C. (2016). Facility location with client latencies: LP- based techniques for minimum-latency problems. Mathematics of Operations Research, 41(3), 865–883.

[40] Charikar, M., Chekuri, C., Feder, T., & Motwani, R. (1997). Incremental clustering and dynamic information retrieval. In Proceedings of the 29th annual ACM Symposium on Theory of Computing (pp. 626–635).: ACM.

[41] Charikar, M. & Guha, S. (2005). Improved combinatorial algorithms for facility location problems. SIAM Journal on Computing, 34(4), 803–824.

[42] Charikar, M., Guha, S., Tardos, E.,´ & Shmoys, D. B. (1999). A constant-factor approximation algorithm for the k-median problem. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (pp. 1–10).: ACM.

[43] Charikar, M., Guruswami, V., & Wirth, A. (2005). Clustering with qualitative information. Journal of Computer and System Sciences, 71(3), 360–383.

[44] Charikar, M., Naamad, Y., & Wirth, A. (2016). On approximating target set selection. In Leibniz International Proceedings in Informatics, volume 60: Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

[45] Chekuri, C. & Kumar, A. (2004). Maximum coverage problem with group budget constraints and applications. In 7th International Workshop on Approx- imation Algorithms for Combinatorial Optimization Problems (pp. 72–83). Springer.

[46] Chekuri, C. & Motwani, R. (1999). Precedence constrained scheduling to minimize sum of weighted completion times on a single machine. Discrete Applied Mathematics, 98(1), 29–38.

[47] Chekuri, C., Motwani, R., Natarajan, B., & Stein, C. (2001). Approximation techniques for average completion time scheduling. SIAM Journal on Comput- ing, 31(1), 146–166.

[48] Cheung, M., Mestre, J., Shmoys, D. B., & Verschae, J. (2016). A primal-dual approximation algorithm for min-sum single-machine scheduling problems. arXiv preprint arXiv:1612.03339. 180 BIBLIOGRAPHY

[49] Cheung, M. & Shmoys, D. B. (2011). A primal-dual approximation algorithm for min-sum single-machine scheduling problems. In 14th International Work- shop on Approximation Algorithms for Combinatorial Optimization Problems (pp. 135–146). Springer.

[50] Chierichetti, F., Kumar, R., & Tomkins, A. (2010). Max-cover in map-reduce. In Proceedings of the 19th International Conference on World Wide Web (pp. 231–240).: ACM.

[51] Chleb´ık, M. & Chleb´ıkova,´ J. (2008). The Steiner tree problem on graphs: Inapproximability results. Theoretical Computer Science, 406(3), 207–214.

[52] Chudak, F. A. & Hochbaum, D. S. (1999). A half-integral linear programming relaxation for scheduling precedence-constrained jobs on a single machine. Operations Research Letters, 25(5), 199–204.

[53] Chudak, F. A. & Shmoys, D. B. (2003). Improved approximation algorithms for the uncapacitated facility location problem. SIAM Journal on Computing, 33(1), 1–25.

[54] Chvatal, V.(1979). A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3), 233–235.

[55] Cohen, E., Fiat, A., & Kaplan, H. (2003). Efficient sequences of trials. In Proceedings of the 14th annual ACM-SIAM Symposium on Discrete Algorithms, volume 12 (pp. 737–746).: SIAM.

[56] Cook, S. A. (1985). A taxonomy of problems with fast parallel algorithms. Information and control, 64(1-3), 2–22.

[57] Coolen, K., Leus, R., & Nobibon, F. T. (2013). Complexity analysis of the discrete sequential search problem with group activities. In IEEE International Conference on Industrial Engineering and Engineering Management (pp. 738–742).: IEEE.

[58] Correa, J. R. & Schulz, A. S. (2005). Single-machine scheduling with precedence constraints. Mathematics of Operations Research, 30(4), 1005–1021. BIBLIOGRAPHY 181

[59] Crescenzi, P. & Trevisan, L. (2000). On approximation scheme preserving reducibility and its applications. Theory of Computing Systems, 33(1), 1–16.

[60] Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R., & Von Eicken, T. (1993). LogP: Towards a realistic model of parallel computation. In ACM Sigplan Notices, volume 28 (pp. 1–12).: ACM.

[61] Dai, H., Wu, X., Chen, G., Xu, L., & Lin, S. (2014). Minimizing the number of mobile chargers for large-scale wireless rechargeable sensor networks. Com- puter Communications, 46, 54–65.

[62] Das, A. S., Datar, M., Garg, A., & Rajaram, S. (2007). Google news per- sonalization: scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web (pp. 271–280).: ACM.

[63] Dean, J. & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.

[64] Desrosiers, J., Soumis, F., & Desrochers, M. (1984). Routing with time windows by column generation. Networks, 14(4), 545–565.

[65] Dhillon, I. S. & Modha, D. S. (2002). A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining (pp. 245–260). Springer.

[66] Dinur, I. & Safra, S. (2005). On the hardness of approximating minimum vertex cover. Annals of Mathematics, (pp. 439–485).

[67] Do, H., Elbaum, S., & Rothermel, G. (2005). Supporting controlled experi- mentation with testing techniques: An infrastructure and its potential impact. Empirical Software Engineering: An International Journal, 10(4), 405–435.

[68] Drineas, P., Frieze, A., Kannan, R., Vempala, S., & Vinay, V. (2004). Clustering large graphs via the singular value decomposition. Machine Learning, 56(1-3), 9–33.

[69] Dyer, M. E. & Frieze, A. M. (1985). A simple heuristic for the p-centre problem. Operations Research Letters, 3(6), 285–288. 182 BIBLIOGRAPHY

[70] Edmonds, J. (1965). Paths, trees, and flowers. Canadian Journal of Mathematics, 17(3), 449–467.

[71] Elbaum, S., Malishevsky, A., & Rothermel, G. (2002). Test case prioritization: A family of empirical studies. IEEE Transactions on Software Engineering, 28(2), 159–182.

[72] Ene, A., Im, S., & Moseley, B. (2011). Fast clustering using MapReduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 681–689).

[73] Erlebach, T., Ka¨ab,¨ V., & Mohring,¨ R. H. (2003). Scheduling and/or-networks on identical parallel machines. In International Workshop on Approximation and Online Algorithms (pp. 123–136).: Springer.

[74] Even, G., Garg, N., Konemann, J., Ravi, R., & Sinha, A. (2003). Covering graphs using trees and stars. In 6th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, volume 2764: Springer.

[75] Fakcharoenphol, J., Harrelson, C., & Rao, S. (2003). The k-traveling repairman problem. In Proceedings of the 14th annual ACM-SIAM Symposium on Discrete Algorithms (pp. 655–664).: SIAM.

[76] Feige, U. (1998). A threshold of ln n for approximating set cover. Journal of the ACM, 45(4), 634–652.

[77] Feige, U., Lovasz,´ L., & Tetali, P. (2004). Approximating min sum set cover. Algorithmica, 40(4), 219–234.

[78] Flajolet, P. & Odlyzko, A. (1982). The average height of binary trees and other simple trees. Journal of Computer and System Sciences, 25(2), 171–213.

[79] Fokkink, R., Lidbetter, T., & Vegh,´ L. A. (2016). Submodular search is scheduling. arXiv preprint arXiv:1607.07598.

[80] Fujii, M., Kasami, T., & Ninomiya, K. (1969). Optimal sequencing of two equivalent processors. SIAM Journal on Applied Mathematics, 17(4), 784–789. BIBLIOGRAPHY 183

[81] Gabow, H. N. (1982). An almost-linear algorithm for two-processor scheduling. Journal of the ACM, 29(3), 766–780.

[82] Gairing, M., Monien, B., & Woclaw, A. (2007). A faster combinatorial ap- proximation algorithm for scheduling unrelated parallel machines. Theoretical Computer Science, 380(1), 87–99.

[83] Gallo, G., Grigoriadis, M. D., & Tarjan, R. E. (1989). A fast parametric maximum flow algorithm and applications. SIAM Journal on Computing, 18(1), 30–55.

[84] Gary, M. R. & Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NP-completeness.

[85] Gavril, F. (1977). Testing for equality between maximum matching and minimum node covering. Information processing letters, 6(6), 199–202.

[86] Ghasemzadeh, H. & Jafari, R. (2011). Physical movement monitoring using body sensor networks: A phonological approach to construct spatial decision trees. IEEE Transactions on Industrial Informatics, 7(1), 66–77.

[87] Giotis, I. & Guruswami, V. (2006). Correlation clustering with a fixed number of clusters. In Proceedings of the 17th annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1167–1176).: SIAM.

[88] Goemans, M. & Kleinberg, J. (1998). An improved approximation ratio for the minimum latency problem. Mathematical Programming, 82(1), 111–124.

[89] Goldwasser, M. & Motwani, R. (1997). Intractability of assembly sequencing: Unit disks in the plane. Algorithms and Data Structures, (pp. 307–320).

[90] Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 293–306.

[91] Graham, R. L. (1966). Bounds for certain multiprocessing anomalies. Bell Labs Technical Journal, 45(9), 1563–1581.

[92] Graham, R. L. (1969). Bounds on multiprocessing timing anomalies. SIAM journal on Applied Mathematics, 17(2), 416–429. 184 BIBLIOGRAPHY

[93] Graham, R. L., Lawler, E. L., Lenstra, J. K., & Kan, A. R. (1979). Optimization and approximation in deterministic sequencing and scheduling: a survey. Annals of Discrete Mathematics, 5, 287–326.

[94] Guha, S. & Khuller, S. (1998). Greedy strikes back: Improved facility location algorithms. In Proceedings of the 9th annual ACM-SIAM Symposium on Discrete Algorithms (pp. 649–657).: SIAM.

[95] Guha, S., Meyerson, A., Mishra, N., Motwani, R., & O’Callaghan, L. (2003). Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3), 515–528.

[96] Gupta, A., Nagarajan, V., & Ravi, R. (2010). Approximation algorithms for optimal decision trees and adaptive TSP problems. In International Colloquium on Automata, Languages, and Programming (pp. 690–701).: Springer.

[97] Gupta, A. & Tangwongsan, K. (2008). Simpler analyses of local search al- gorithms for facility location. arXiv preprint arXiv:0809.2554.

[98] Haidry, S. & Miller, T. (2013). Using dependency structures for prioritization of functional test suites. IEEE Transactions on Software Engineering, 39(2), 258–275.

[99] Hall, L. A., Schulz, A. S., Shmoys, D. B., & Wein, J. (1997). Scheduling to minimize average completion time: Off-line and on-line approximation algorithms. Mathematics of Operations Research, 22(3), 513–544.

[100] Hall, L. A., Shmoys, D. B., & Wein, J. (1996). Scheduling to minimize average completion time: Off-line and on-line algorithms. In Proceedings of the 7th annual ACM-SIAM Symposium on Discrete Algorithms, volume 96 (pp. 142–151).: SIAM.

[101] Hassin, R. & Levin, A. (2005). An approximation algorithm for the minimum latency set cover problem. In Proceedings of the 13th annual European Symposium on Algorithms (pp. 726–733).: Springer.

[102] Hochbaum, D. S. (1996). Approximation algorithms for NP-hard problems. PWS Publishing Co. BIBLIOGRAPHY 185

[103] Hochbaum, D. S. & Shmoys, D. B. (1985). A best possible heuristic for the k- center problem. Mathematics of Operations Research, 10(2), 180–184.

[104] Hochbaum, D. S. & Shmoys, D. B. (1988). A polynomial approximation scheme for scheduling on uniform processors: Using the dual approximation approach. SIAM journal on computing, 17(3), 539–551.

[105] Horn, W. (1972). Single-machine job sequencing with treelike precedence ordering and linear delay penalties. SIAM Journal on Applied Mathematics, 23(2), 189–202.

[106] Hsu, W.-L. & Nemhauser, G. L. (1979). Easy and hard bottleneck location problems. Discrete Applied Mathematics, 1(3), 209–215.

[107] Im, S. & Moseley, B. (2015). Brief announcement: Fast and better distributed MapReduce algorithms for k-center clustering. Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, (pp. 65–67).

[108] Im, S., Nagarajan, V., & van der Zwaan, R. (2012). Minimum latency submod- ular cover. In International Colloquium on Automata, Languages, and Programming (pp. 485–497).: Springer.

[109] Im, S., Sviridenko, M., & Van Der Zwaan, R. (2014). Preemptive and non- preemptive generalized min sum set cover. Mathematical Programming, 145(1-2), 377–401.

[110] Iwata, S., Fleischer, L., & Fujishige, S. (2001). A combinatorial strongly polynomial algorithm for minimizing submodular functions. Journal of the ACM, 48(4), 761–777.

[111] Iwata, S., Tetali, P., & Tripathi, P. (2012). Approximating minimum linear ordering problems. In 15th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (pp. 206–217). Springer.

[112] Jain, K., Mahdian, M., Markakis, E., Saberi, A., & Vazirani, V. V. (2003). Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. Journal of the ACM, 50(6), 795–824. 186 BIBLIOGRAPHY

[113] Jain, K., Mahdian, M., & Saberi, A. (2002). A new greedy approach for facility location problems. In Proceedings of the 34th annual ACM symposium on Theory of computing (pp. 731–740).: ACM.

[114] Jain, K. & Vazirani, V. V. (2001). Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. Journal of the ACM, 48(2), 274–296.

[115] Kahn, A. B. (1962). Topological sorting of large networks. Communications of the ACM, 5(11), 558–562.

[116] Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). A local search approximation algorithm for k-means clustering. In Proceedings of the 18th Annual Symposium on Computational Geometry (pp. 10–18).: ACM.

[117] Kaplan, H., Kushilevitz, E., & Mansour, Y. (2005). Learning with attribute costs. In Proceedings of the 37th annual ACM Symposium on Theory of Computing (pp. 356–365).: ACM.

[118] Karloff, H., Suri, S., & Vassilvitskii, S. (2010). A model of computation for MapReduce. In Proceedings of the 21st annual ACM-SIAM Symposium on Discrete Algorithms (pp. 938–948).: SIAM.

[119] Karp, R. M. (1972). Reducibility among combinatorial problems. In Complexity of computer computations (pp. 85–103). Springer.

[120] Katsavounidis, I., Jay Kuo, C.-C., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters, 1(10), 144–146.

[121] Kaufman, L. & Rousseeuw, P. (1987). Clustering by means of medoids. North- Holland.

[122] Khani, M. R. & Salavatipour, M. R. (2014). Improved approximation algorithms for the min-max tree cover and bounded tree cover problems. Algorithmica, 69(2), 443–460. BIBLIOGRAPHY 187

[123] Khot, S. & Regev, O. (2008). Vertex cover might be hard to approximate to within 2- ε. Journal of Computer and System Sciences, 74(3), 335–349.

[124] Khuller, S., Vishkin, U., & Young, N. (1994). A primal-dual parallel approxima- tion technique applied to weighted set and vertex covers. Journal of Algorithms, 17(2), 280–289.

[125] Kim, D., Uma, R., Abay, B. H., Wu, W., Wang, W., & Tokuta, A. O. (2014). Minimum latency multiple data mule trajectory planning in wireless sensor networks. IEEE Transactions on Mobile Computing, 13(4), 838–851.

[126] Kolliopoulos, S. G. & Steiner, G. (2007). Partially ordered knapsack and applications to scheduling. Discrete Applied Mathematics, 155(8), 889–897.

[127]K onig,¨ D. (1916). Uber¨ graphen und ihre anwendung auf determinantentheorie und mengenlehre. Mathematische Annalen, 77(4), 453–465.

[128] Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7(1), 48–50.

[129] Lagoudakis, M. G., Markakis, E., Kempe, D., Keskinocak, P., Kleywegt, A. J., Koenig, S., Tovey, C. A., Meyerson, A., & Jain, S. (2005). Auction-based multi- robot routing. In Robotics: Science and Systems, volume 5 (pp. 343–350).

[130] Lattanzi, S., Moseley, B., Suri, S., & Vassilvitskii, S. (2011). Filtering: A method for solving graph problems in MapReduce. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (pp. 85–94).: ACM.

[131] Lawler, E. & Sivazlian, B. (1978). Minimization of time-varying costs in single- machine scheduling. Operations Research, 26(4), 563–569.

[132] Lawler, E. L. (1978). Sequencing jobs to minimize total weighted completion time subject to precedence constraints. Annals of Discrete Mathematics, 2, 75–90.

[133] Lenstra, J. K., Kan, A. R., & Brucker, P. (1977). Complexity of machine scheduling problems. Annals of Discrete Mathematics, 1, 343–362. 188 BIBLIOGRAPHY

[134] Lenstra, J. K., Shmoys, D. B., & Tardos, E.´ (1990). Approximation algorithms for scheduling unrelated parallel machines. Mathematical Programming, 46(1-3), 259–271.

[135] Levi, R., Shmoys, D. B., & Swamy, C. (2012). LP-based approximation al- gorithms for capacitated facility location. Mathematical Programming, 131(1-2), 365–379.

[136] Li, C.-L., Simchi-Levi, D., & Desrochers, M. (1992). On the distance constrained vehicle routing problem. Operations Research, 40(4), 790–799.

[137] Li, S. (2011). A 1.488 approximation algorithm for the uncapacitated facility location problem. International Colloquium on Automata, Languages, and Program- ming, (pp. 77–88).

[138] Li, S. & Svensson, O. (2013). Approximating k-median via pseudo- approximation. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (pp. 901–910).: ACM.

[139] Li, Z., Harman, M., & Hierons, R. (2007). Search algorithms for regression test case prioritization. IEEE Transactions on Software Engineering, 33(4), 225–237.

[140] Lichman, M. (2013). UCI Machine Learning Repository.

[141] Lin, J.-H. & Vitter, J. S. (1992). ε-approximations with minimum packing constraint violation. In Proceedings of the twenty-fourth annual ACM Symposium on Theory of Computing (pp. 771–782).: ACM.

[142] Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.

[143] Ma, A. & Sethi, I. K. (2007). Distributed K-median clustering with application to image clustering. In Proceedings of the 7th International Workshop on Pattern Recognition in Information Systems (pp. 215–220).

[144] MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge university press. BIBLIOGRAPHY 189

[145] Mahajan, M., Nimbhorkar, P., & Varadarajan, K. (2009). The planar k-means problem is NP-hard. In International Workshop on Algorithms and Computation (pp. 274–285). Springer.

[146] Mahdian, M., Ye, Y., & Zhang, J. (2006). Approximation algorithms for metric facility location problems. SIAM Journal on Computing, 36(2), 411–432.

[147] Malkomes, G., Kusner, M. J., Chen, W., Weinberger, K. Q., & Moseley, B. (2015). Fast distributed k-center clustering with outliers on massive data. In Advances in Neural Information Processing Systems (pp. 1063–1071).

[148] Marcus, D., Harwell, J., Olsen, T., Hodge, M., Glasser, M., Prior, F., Jenkinson, M., Laumann, T., Curtiss, S., & Van Essen, D. (2011). Informatics and data mining tools and strategies for the human connectome project. Frontiers in Neuroinformatics, 5, 4.

[149] Margot, F., Queyranne, M., & Wang, Y. (2003). Decompositions, network flows, and a precedence constrained single-machine scheduling problem. Operations Research, 51(6), 981–992.

[150] Mazdeh, M. M., Zaerpour, F., Zareei, A., & Hajinezhad, A. (2010). Parallel machines scheduling to minimize job tardiness and machine deteriorating cost with deteriorating jobs. Applied Mathematical Modelling, 34(6), 1498 – 1510.

[151] Mestre, J. & Verschae, J. (2014). A 4-approximation for scheduling on a single machine with general cost function. arXiv preprint arXiv:1403.0298.

[152] Micali, S. & Vazirani, V. V. (1980). An O(V1/2E) algoithm for finding maximum matching in general graphs. In 21st Annual IEEE Symposium on Foundations of Computer Science (pp. 17–27).: IEEE.

[153] Miller, T. & Strooper, P. (2012). A case study in model-based testing of specifications and implementations. Software Testing, Verification, and Reliability, 22(1), 33–63. 190 BIBLIOGRAPHY

[154] Munagala, K., Babu, S., Motwani, R., & Widom, J. (2005). The pipelined set cover problem. In International Conference on Database Theory (pp. 83–98).: Springer.

[155] Nagarajan, V. & Ravi, R. (2006). Minimum vehicle routing with a common deadline. In 9th International Workshop on Approximation Algorithms for Combin- atorial Optimization Problems (pp. 212–223). Springer.

[156] Nagarajan, V. & Ravi, R. (2012). Approximation algorithms for distance constrained vehicle routing problems. Networks, 59(2), 209–214.

[157] Nemhauser, G. L., Wolsey, L. A., & Fisher, M. L. (1978). An analysis of approximations for maximizing submodular set functions - I. Mathematical Programming, 14(1), 265–294.

[158] Papadimitriou, C. H. & Yannakakis, M. (1991). Optimization, approximation, and complexity classes. Journal of computer and system sciences, 43(3), 425–440.

[159] Pasqualetti, F., Franchi, A., & Bullo, F. (2010). On optimal cooperative patrolling. In 49th IEEE Conference on Decision and Control (pp. 7153–7158).: IEEE.

[160] Phillips, C., Stein, C., & Wein, J. (1998). Minimizing average completion time in the presence of release dates. Mathematical Programming, 82(1), 199–223.

[161] Pisaruk, N. (1992). The boundaries of submodular functions. Computational Mathematics and Mathematical Physics, 32(12), 1769–1783.

[162] Pisaruk, N. (2003). A fully combinatorial 2-approximation algorithm for precedence-constrained scheduling a single machine to minimize average weighted completion time. Discrete Applied Mathematics, 131(3), 655–663.

[163] Posner, M. E. (1990). Reducibility among single machine weighted completion time scheduling problems. Annals of Operations Research, 26(1), 90–101.

[164] Potts, C. N. (1980). An algorithm for the single machine sequencing problem with precedence constraints. Combinatorial Optimization II, (pp. 78–87). BIBLIOGRAPHY 191

[165] Prim, R. C. (1957). Shortest connection networks and some generalizations. Bell Labs Technical Journal, 36(6), 1389–1401.

[166] Qu, B., Nie, C., & Xu, B. (2008). Test case prioritization for multiple processing queues. In International Symposium on Information Science and Engineering, volume 2 (pp. 646–649).: IEEE.

[167] Rothermel, G., Untch, R. H., Chu, C., & Harrold, M. J. (2001). Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering, 27(10), 929–948.

[168] Rothkopf, M. H. (1966). Scheduling independent tasks on parallel processors. Management Science, 12(5), 437–447.

[169] Russell, S. & Norvig, P. (2010). Artificial intelligence: a modern approach. Prentice Hall.

[170] Schulz, A. S. (1996). Scheduling to minimize total weighted completion time: Performance guarantees of LP-based heuristics and lower bounds. In International Conference on Integer Programming and Combinatorial Optimization (pp. 301–315).: Springer.

[171] Selim, S. Z. & Ismail, M. A. (1984). k-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, (1), 81–87.

[172] Shayman, M. A. & Fernandez-Gaucherand, E. (2001). Risk-sensitive decision- theoretic diagnosis. IEEE Transactions on Automatic Control, 46(7), 1166–1171.

[173] Shmoys, D. B. & Tardos, E.´ (1993). An approximation algorithm for the generalized assignment problem. Mathematical Programming, 62(1-3), 461–474.

[174] Sidney, J. B. (1975). Decomposition algorithms for single-machine sequencing with precedence relations and deferral costs. Operations Research, 23(2), 283– 298. 192 BIBLIOGRAPHY

[175] Silva, J. A., Faria, E. R., Barros, R. C., Hruschka, E. R., de Carvalho, A. C., & Gama, J. (2013). Data stream clustering: A survey. ACM Computing Surveys, 46(1), 13.

[176] Skutella, M. & Williamson, D. P. (2011). A note on the generalized min-sum set cover problem. Operations Research Letters, 39(6), 433–436.

[177] Slav´ık, P. (1996). A tight analysis of the greedy algorithm for set cover. In Proceedings of the twenty-eighth annual ACM Symposium on Theory of Computing (pp. 435–441).: ACM.

[178] Smith, W. E. (1956). Various optimizers for single-stage production. Naval Research Logistics Quarterly, 3(1-2), 59–66.

[179] Swamy, C. (2004). Correlation clustering: maximizing agreements via semidef- inite programming. In Proceedings of the 15th annual ACM-SIAM Symposium on Discrete Algorithms (pp. 526–527).: SIAM.

[180] Swamy, C. (2016). Improved approximation algorithms for matroid and knapsack median problems and applications. ACM Transactions on Algorithms, 12(4), 49.

[181] Tarjan, R. E. (1976). Edge-disjoint spanning trees and depth-first search. Acta Informatica, 6(2), 171–185.

[182] Valiant, L. G. (1990). A bridging model for parallel computation. Communica- tions of the ACM, 33(8), 103–111.

[183] Williamson, D. P. & Shmoys, D. B. (2011). The design of approximation algorithms. Cambridge University Press.

[184] Woeginger, G. J. (2001). On the approximability of average completion time scheduling under precedence constraints. In International Colloquium on Auto- mata, Languages, and Programming (pp. 887–897). Springer.

[185] Yang, W.-H. (2009). Scheduling jobs on a single machine to maximize the total revenue of jobs. Computers & Operations Research, 36(2), 565 – 583. Scheduling for Modern Manufacturing, Logistics, and Supply Chains. BIBLIOGRAPHY 193

[186] Zhang, S., Jalali, D., Wuttke, J., Mus¸lu, K., Lam, W., Ernst, M. D., & Notkin, D. (2014). Empirically revisiting the test independence assumption. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (pp. 385– 396).: ACM.

[187] Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means clustering based on MapReduce. In IEEE International Conference on Cloud Computing (pp. 674–679).: Springer.

[188] Zheng, X., Koenig, S., Kempe, D., & Jain, S. (2010). Multirobot forest coverage for weighted and unweighted terrain. IEEE Transactions on Robotics, 26(6), 1018– 1031.

Minerva Access is the Institutional Repository of The University of Melbourne

Author/s: McClintock, Jessica

Title: Practical approximation algorithms for clustering and covering

Date: 2017

Persistent Link: http://hdl.handle.net/11343/214762

File Description: Practical Approximation Algorithms for Clustering and Covering

Terms and Conditions: Terms and Conditions: Copyright in works deposited in Minerva Access is retained by the copyright owner. The work may not be altered without permission from the copyright owner. Readers may only download, print and save electronic copies of whole works for their own personal non-commercial use. Any use that exceeds these limits requires permission from the copyright owner. Attribution is essential when quoting or paraphrasing from these works.