Chapter

Set cover and its application to

shortest sup erstring

Understanding the area of approximation algorithms involves identifying cornerstone problems

problems whose study leads to discovering techniques that b ecome general principles in the area

and problems that are general enough that other problems can b e reduced to them Problems such

as maximum ow shortest path and minimum spanning tree are cornerstone problems

in the design of exact algorithms In approximation algorithms the picture is less clear at present

Even so problems such as minimum set cover and minimum Steiner tree can already b e said to

o ccupy this p osition

In this chapter we will rst analyse a natural for the minimum set cover

problem We will then show an unexp ected use of set cover to solve the minimum sup erstring

problem An algorithm with a much b etter approximation guarantee will b e presented in Chapter

for the latter problem the p oint here is to illustrate the wide applicabili ty of the set cover problem

Minimum set cover

Problem Minimum set cover Given a universe U of n elements and a collection of

subsets of U S S with nonnegative costs sp ecied the minimum set cover problem asks for

1 k

a minimum cost collection of sets whose union is U

Perhaps the rst algorithm that comes to mind for this problem is one based on the greedy

strategy of iteratively picking the most costeective set and removing the covered elements until

all elements are covered Let C b e the set of elements already covered at the b eginning of an

iteration During this iteration dene the costeectiveness of a set S to b e the average cost at

cost(S )

Dene the price of an element to b e the average cost at which it covers new elements ie

jS C j

which it is covered Equivalently when a set S is picked we can think of its cost b eing distributed

equally among the new elements covered to set their prices

Set cover and its application to shortest superstring

Algorithm Greedy set cover algorithm

C

while C U do

Find the most costeective set in the current iteration say S

cost(S )

ie the costeectiveness of S Let

jS C j

Pick S and for each e S C price e

Output the picked sets

Numb er the elements of U in the order in which they were covered by the algorithm resolving

ties arbitrarily Let e e b e this numb ering

1 n

OPT

Lemma For each k f ng price e

k

nk +1

Pro of In any iteration the left over sets of the optimal solution can cover the remaining

elements at a cost of at most OPT Therefore there must b e a set having costeectiveness at most

OPT

In the iteration in which element e was covered C contained at least n k elements

k

jC j

Since e was covered by the most costeective set in this iteration it follows that

k

OPT OPT

price e

k

n k

jC j

2

From Lemma we immediately obtain

Theorem The greedy algorithm is an H factor for the minimum set

n

1 1

cover problem where H

n

2 n

Pro of Since the cost of each set picked is distributed among the new elements covered the

P

n

total cost of the set cover picked is equal to price e By Lemma this is at most

k

k =1

1 1

OPT 2

2 n

Example Following is a tight example

... 1+ε

1/n 1/(n-1) 1

The greedy algorithm outputs the cover consisting of the n singleton sets since in each iteration

some singleton is the most costeective set So the algorithm outputs a cover of cost

H

n

n n

On the other hand the optimal cover has a cost of 2

Surprisingly enough the obvious algorithm given ab ove is essentially the b est one can hop e for

for the minimum set cover problem it is known that an approximation guarantee b etter than ln n

is not p ossible assuming P NP

In Chapter we p ointed out that nding a go o d lower b ound on OPT is a basic starting p oint

in the design of an approximation algorithm for a minimization problem At this p oint the reader

may b e wondering whether there is any truth to this claim We will show in Chapter that the

correct way to view the greedy set cover algorithm is in the setting of LPduality theory this

will not only provide the lower b ound on which this algorithm is based but will also help obtain

algorithms for several generalizations of this problem

Exercise The is the following Given a universe U of n elements

with nonnegative weights sp ecied a collection of subsets of U S S and an integer k pick k

1 l

sets so as to maximize the weight of elements covered Show that the obvious algorithm of greedily

picking the b est set in each iteration until k sets are picked achieves an approximation factor of

k

k e

Solving shortest sup erstring via set cover

Let us motivate the shortest sup erstring problem The human DNA can b e viewed as a very

long string over a four letter alphab et Scientists are attempting to decipher this string Since it

is very long several overlapping short segments of this string are rst sequenced Of course the

lo cations of these segments on the original DNA are not known It is hyp othesised that the shortest

string which contains these segments as substrings is a go o d approximation to the original DNA

string

Problem Shortest sup erstring Given a nite alphab et and a set of n strings



S fs s g nd a shortest string s that contains each s as a substring Without loss

1 n i

of generality we may assume that no string s is a substring of another string s j i

i j

This problem is NPhard Perhaps the rst algorithm that comes to mind for nding a short



sup erstring is the following greedy algorithm Dene the overlap of two strings s t as the

maximum length of a sux of s that is also a prex of t The algorithm maintains a set of strings

T initially T S At each step the algorithm selects from T two strings that have maximum

overlap and replaces them with the string obtained by overlapping them as much as p ossible After

n steps T will contain a single string Clearly this string contains each s as a substring

i

This algorithm is conjectured to have an approximation factor of To see that the approximation

k k

factor of this algorithm is no b etter than consider an input consisting of strings ab b c and

k +1

b If the rst two strings are selected in the rst iteration the greedy algorithm pro duces the

k k +1 k +1

string ab cb This is almost twice as long as the shortest sup erstring ab c

We will obtain a H factor approximation algorithm by a reduction to the minimum set cover

n

problem The set cover instance denoted by S is constructed as follows For s s S and k

i j

if the last k symb ols of s are the same as the rst k symb ols of s let b e the string obtained

i j ij k

by overlapping these k p ositions of s and s Let I contain the strings for all valid choices of

i j ij k

i j k The set cover instance S consists of S as the universal set The sp ecied subsets of S are

for each string S I dene set fsj s S s is a substring of g The cost of this set is