Sorting Noisy Data with Partial Information
Total Page:16
File Type:pdf, Size:1020Kb
Sorting Noisy Data with Partial Information Konstantin Makarychev Yury Makarychev∗ Aravindan Microsoft Research TTIC Vijayaraghavan CMU ABSTRACT of the vertices, which minimizes the number of backward In this paper, we propose two semi-random models for the edges (according to the ordering). Minimum Feedback Arc Set Problem and present approxi- Definition 1.1 Given a directed graph G(V,E) and a ver- mation algorithms for them. In the first model, which we call tex ordering π : V 1,...,n , we say that an edge (arc) the Random Edge Flipping model, an instance is generated (u, v) E is a forward→ { edge w.r.t} π if π(u) < π(v); other- as follows. We start with an arbitrary acyclic directed graph wise, we∈ say that the edge is a backward edge w.r.t. π. and then randomly flip its edges (the adversary may later un-flip some of them). In the second model, which we call Definition 1.2 The Minimum Feedback Arc Set problem is the Random Backward Edge model, again we start with an as follows. Given a directed graph G =(V,E), find a vertex arbitrary acyclic graph but now add new random backward ordering π : V 1,...,n that minimizes the number of edges (the adversary may delete some of them). For the first backward edges w.r.t.→ { π. The} value of the problem equals the model, we give an approximation algorithm that finds a so- number of backward edges w.r.t. the optimal ordering. lution of cost (1 + δ) opt-cost +n polylog n, where opt-cost is the cost of the optimal solution. For the second model, The history of the problem dates back to as early as we give an approximation algorithm that finds a solution of 1957, when Unger [Ung57] considered it in his paper on cost O(planted-cost)+n polylog n, where planted-cost is the asynchronous logical feedback networks. The interest in cost of the planted solution. FAS stems from its numerous applications in simplifying Additionally, we present an approximation algorithm for the analysis of feedback systems (removing feedback loops) semi-random instances of Minimum Directed Balanced Cut. [Ung57, SR61], finding inconsistencies and cyclic dependen- cies [You63], and ranking based on pairwise information [KS63, ACN08]. It is the complementary problem of the Categories and Subject Descriptors Maximum Acyclic Subgraph [BS90, CMM07, GMR08], which F.2 [Theory of Computation]: Analysis of Algorithms asks to find the largest acyclic graph (w.r.t. the number of edges) in a given digraph. Keywords The problem has been extensively studied in the litera- ture. Karp [Kar72] showed that the problem is NP-hard Minimum Feedback Arc Set, semi-random model, average- in 1972, and Kann [Kan92] showed that the problem is case analysis, approximation algorithm APX-hard in 1992. Furthermore, Guruswami, Manokaran, and Raghavendra [GMR08] showed that there is no con- 1. INTRODUCTION stant factor approximation for FAS, assuming the Unique Games Conjecture. In their seminal paper, Leighton and The Minimum Feedback Arc Set (FAS) problem is one Rao [LR99] gave the first polylogarithmic approximation al- of the most basic optimization problems in the context of gorithm for the problem with approximation factor O(log2 n). directed graphs. The problem asks to find the minimum set This result was later improved by Seymour [Sey95], who of edges (arcs) whose removal makes a given graph a directed used a clever divide and conquer approach to get an approxi- acyclic graph (DAG). As any directed acyclic graph can be mation factor of O(log n log log n). His technique was gener- topologically sorted, this is equivalent to finding an ordering alized by Even, Naor, Rao, and Schieber to a more general family of ordering problems by the method of spreading- ∗Supported in part by NSF CAREER grant CCF-1150062. metrics [ENRS00]. Several researchers studied the special case of tournaments, directed graphs in which every pair of vertices is connected by a directed edge [ACN08, CFR06, Permission to make digital or hard copies of all or part of this work for KMS07]; Kenyon-Mathieu and Schudy gave a polynomial- personal or classroom use is granted without fee provided that copies are time approximation scheme (PTAS) for FAS in tournaments not made or distributed for profit or commercial advantage and that copies [KMS07]. To summarize, despite our best efforts we only bear this notice and the full citation on the first page. To copy otherwise, to know polylogarithmic approximation algorithms for the prob- republish, to post on servers or to redistribute to lists, requires prior specific lem, and, assuming the Unique Games Conjecture, we can- permission and/or a fee. ITCS'13, January 9–12, 2012, Berkeley, California, USA. not get a constant factor approximation for worst-case in- Copyright 2013 ACM 978-1-4503-1859-4/13/01 ...$15.00. stances. However, our primary interest in such optimization prob- pages, based on the user feedback1. It is natural to assume lems comes from many different applications they arise in. that there is a true underlying order among these web pages, The instances that we encounter in the real world are not and our goal is to find it. The pairwise comparisons that we worst-case instances. Besides, instances that are produced see are likely to be noisy perturbations of the true ordering. in hardness reductions are often contrived, and we are un- Note that it is unrealistic to expect that we have compar- likely to encounter them in practice. Also it may be too isons for all pairs of web pages. hard, if not impossible, to design an algorithm that gives This motivates the following formal definition for our first a good approximation guarantee for every instance of FAS. semi-random model: This motivates the study of the average-case complexity of Definition 1.3 (Random Edge Flipping Model) the problem. We are given a set of vertices V of size n, and a parameter Question 1 Can we design algorithms with better provable ε. A semi-random instance of the Random Edge Flipping approximation guarantees for realistic average-case models model is generated as follows. for the Minimum Feedback Arc Set problem? 1. Adversary chooses a directed acyclic graph on V , i.e. Braverman and Mossel [BM08] studied the problem of she chooses a set of edges E so that G(V,E) is acyclic. sorting in the presence of random noise. The problem corre- 2. Every edge in E is flipped (its direction is reversed) sponds to FAS on tournaments with random noise. In their with probability ε independently at random. The set of random model, the instance is generated as follows. There flipped edges is denoted by E . is a planted (complete) ordering of the vertices. An edge is R added between every pair of vertices: the edge is directed to 3. The adversary chooses a subset of edges in ER (pos- 1 respect the ordering with probability 2 + ε and it is directed sibly empty) and un-flips them (restores the original 1 against the ordering with probability 2 ε. Braverman and direction). Mossel show that with high probability,− we can find the op- Our second semi-random model for FAS is called the Ran- timal ordering, and the optimal ordering is at most O(n) far dom Backward Edge model. Informally speaking, the in- from the planted ordering (in ` distance). This result re- 1 stance consists of an adversarial directed acyclic graph and quires that we have complete (but noisy) information; that a set of random backward edges, i.e. random edges directed is, the digraph must be a tournament. The result does not opposite to the ordering. The planted solution to FAS con- apply to more general instances of FAS. sists of these random backward edges. Our goal is to find a In this paper, we propose and study two new semi-random solution whose cost is comparable to the cost of the planted models for FAS, which we believe capture many properties of solution. real world instances. We develop new algorithms for semi- random instances of FAS which give (1 + δ) and constant Definition 1.4 (Random Backward Edge Model) We factor approximations in these two models (see below for are given a set of vertices V of size n, and a parameter ε. A the precise statements). In the process, we also develop a semi-random instance of the Random Backward Edge model constant factor approximations for a natural semi-random is generated as follows. model for Minimum Directed Balanced Cut. Our models are geared towards the two chief applications 1. The adversary chooses an ordering π of V . of FAS: noisy topological sorting and breaking cyclic de- 2. Nature adds a random set of backward edges to G: for pendencies. In both of these semi-random models, a semi- every pair (u, v) with π(u) >π(v), nature adds an edge random instance can be seen as an acyclic digraph (a FAS (u, v) to G with probability ε. The choices for all edges instance with cost 0), augmented with random noise. are independent. The set of random edges is denoted 1.1 Semi-random Models by ER. We now describe the two semi-random models for FAS 3. The adversary adds an arbitrary set of forward edges that we study in this paper. w.r.t. π. She also removes some set (possibly empty) Our first semi-random model is called the Random Edge of backward edges added by nature. Flipping model. It models the problem of topological sort- (Note that this model is stronger than a model where the ing in the presence of a noisy oracle.