Set Cover Algorithms For Very Large Datasets ∗ Graham Cormode Howard Karloff Anthony Wirth AT&T Labs–Research AT&T Labs–Research Department of Computer
[email protected] [email protected] Science and Software Engineering The University of Melbourne
[email protected] ABSTRACT 1. INTRODUCTION The problem of Set Cover—to find the smallest subcol- The problem of Set Cover arises in a surprisingly broad lection of sets that covers some universe—is at the heart number of places. Although presented as a somewhat ab- of many data and analysis tasks. It arises in a wide range stract problem, it captures many different scenarios that of settings, including operations research, machine learning, arise in the context of data management and knowledge planning, data quality and data mining. Although finding an mining. The basic problem is that we are given a collec- optimal solution is NP-hard, the greedy algorithm is widely tion of sets, each of which is drawn from a common universe used, and typically finds solutions that are close to optimal. of possible items. The goal is to find a subcollection of sets However, a direct implementation of the greedy approach, so that their union includes every item from the universe. which picks the set with the largest number of uncovered Places where Set Cover occurs include: items at each step, does not behave well when the in- • In operations research, the problem of choosing where put is very large and disk resident. The greedy algorithm to locate a number of facilities so that all sites are must make many random accesses to disk, which are unpre- within a certain distance of the closest facility can be dictable and costly in comparison to linear scans.