Open Main.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School EFFICIENT COMBINATORIAL METHODS IN SPARSIFICATION, SUMMARIZATION AND TESTING OF LARGE DATASETS A Dissertation in Computer Science and Engineering by Grigory Yaroslavtsev c 2014 Grigory Yaroslavtsev Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2014 The dissertation of Grigory Yaroslavtsev was reviewed and approved∗ by the following: Sofya Raskhodnikova Associate Professor of Computer Science and Engineering Chair of Committee and Dissertation Advisor Piotr Berman Associate Professor of Computer Science and Engineering Jason Morton Assistant Professor of Mathematics and Statistics Adam D. Smith Associate Professor of Computer Science and Engineering Raj Acharya Head of Department of Computer Science and Engineering ∗Signatures are on file in the Graduate School. Abstract Increasingly large amounts of structured data are being collected by personal computers, mobile devices, personal gadgets, sensors, etc., and stored in data centers operated by the government and private companies. Processing of such data to extract key information is one of the main challenges faced by computer scientists. Developing methods for construct- ing compact representations of large data is a natural way to approach this challenge. This thesis is focused on rigorous mathematical and algorithmic solutions for sparsi- fication and summarization of large amounts of information using discrete combinatorial methods. Areas of mathematics most closely related to it are graph theory, information theory and analysis of real-valued functions over discrete domains. These areas, somewhat surprisingly, turn out to be related when viewed through the computational lens. In this thesis we illustrate the power and limitations of methods for constructing small represen- tations of large data sets, such as graphs and databases, using a variety methods drawn from these areas. The primary goal of sparsification, summarization and sketching methods discussed here is to remove redundancy in distributed systems, reduce storage space and save other resources by compressing the representation, while preserving the key structural properties of the original data or system. For example, the size of the network can be reduced by removing redundant nodes and links, while (approximately) preserving the distances or connectivities between the nodes. This thesis serves as an overview of results in [31, 30, 152, 38, 171, 33]. It also contains an extension of results in [152] obtained jointly with David Woodruff. iii Table of Contents List of Figures viii List of Tables ix Acknowledgmentsx Chapter 1 Introduction1 1.1 Organization ................................... 1 1.2 Overview ..................................... 2 I Approximate Sparse Network Design6 Chapter 2 Approximate Sparsification of Directed Graphs7 2.1 Introduction.................................... 7 2.1.1 Relation to Previous Work ....................... 8 2.1.2 Our Techniques.............................. 8 2.1.3 Directed Steiner Forest ......................... 9 2.1.4 Organization ............................... 10 p 2.2 An O~( n)-Approximation for Directed k-Spanner . 10 2.2.1 Sampling ................................. 11 2.2.2 Randomized Rounding.......................... 11 2.2.3 Antispanners............................... 12 2.2.4 LP, Separation Oracle and Overall Algorithm............. 13 2.2.4.1 Separation Oracle ....................... 14 2.2.4.2 Overall Algorithm for Directed k-Spanner . 16 2.3 LP and Rounding for Graphs with Unit-Length Edges............ 16 iv 2.4 An O~(n1=3)-Approximation for Directed 3-Spanner with Unit-Length Edges ....................................... 18 2.5 An O(n2=3+)-Approximation for Directed Steiner Forest . 24 2.6 Conclusion .................................... 27 Chapter 3 Sparsification of Node-Weighted Planar Graphs 28 3.1 Preliminaries ................................... 30 3.1.1 Uncrossable families of cycles and proper functions.......... 31 3.2 Algorithm..................................... 33 3.2.1 Generic local-ratio algorithm ...................... 33 3.2.2 Minimal pocket violation oracles .................... 34 3.3 Proof of 18/7 approximation ratio with pocket oracle............. 35 3.3.1 Complex witness cycles and decomposition of the debit graph . 36 3.3.2 Pruning.................................. 37 3.3.3 Envelopes................................. 38 3.3.4 Tight examples.............................. 40 II Concise Representations of Real Functions in Property Testing 42 Chapter 4 Concise Representations of Submodular Functions 43 4.1 Introduction.................................... 43 4.1.1 Related work............................... 46 4.2 Structural result ................................. 48 4.3 Generalized switching lemma for pseudo-Boolean DNFs ........... 50 4.4 Learning pseudo-Boolean DNFs......................... 54 Chapter 5 Transitive-Closure Spanners and Testing Functions on Hypergrids 59 5.1 Introduction.................................... 59 5.1.1 Results .................................. 61 5.1.1.1 Steiner 2-TC-spanners of Directed d-dimensional Grids. 61 5.1.2 Applications ............................... 63 5.2 Definitions and Observations .......................... 64 5.3 Lower Bound for 2-TC-spanners of the Hypergrid............... 67 5.4 Our Lower Bound for k-TC-spanners of d-dimensional Posets for k > 2 . 70 5.4.1 The Case of d = 2 ............................ 71 5.4.2 The Case of Constant d ......................... 73 v III Communication Complexity Methods in Summarization 76 Chapter 6 Lower bounds for Testing of Functions on Hypergrids 77 6.1 Introduction.................................... 77 6.2 Preliminaries ................................... 81 6.3 Lower bounds on the line ............................ 82 6.3.1 Monotonicity............................... 83 6.3.2 Convexity................................. 84 6.3.3 The Lipschitz property.......................... 86 6.4 Lower bounds on the hypergrid......................... 87 6.4.1 Monotonicity............................... 89 6.4.2 Convexity................................. 90 6.4.3 The Lipschitz property.......................... 92 Chapter 7 Beyond Direct-Sum with Application to Sketching 94 7.1 Introduction.................................... 94 7.2 The Direct Sum Theorem ............................100 7.3 Lower Bounds for Protocols with Abortion ..................105 7.3.1 Equality Problem.............................105 7.3.2 Augmented Indexing...........................106 7.4 Applications....................................109 7.4.1 Hard Problem...............................109 7.4.2 Estimating Multiple `p Distances....................110 7.4.3 Other Applications............................113 Chapter 8 Optimal Round-Complexity of the Set Intersection Problem 116 8.1 Definitions and preliminaries ..........................120 8.2 Upper bound ...................................121 8.2.1 Auxiliary protocols............................121 8.2.2 Main protocol...............................122 Bibliography 125 Chapter A Appendix 139 A.1 Concentration results...............................139 A.2 Omitted proofs..................................140 A.2.1 Lemma A.2.1...............................140 A.2.2 Analysis of the generic local-ratio algorithm . 141 A.2.3 Uncrossing proper sets (Lemma 3.1.2) . 141 vi A.3 Proof of 12/5 approximation ratio with triple pocket oracle . 142 A.4 Converting a learner into a proper learner...................143 A.5 Information Cost When Amplifying Success Probability . 143 A.6 Auxiliary Results for Lower Bounding Applications . 144 A.6.1 Generic Indexing problems .......................144 A.6.2 Encoding of Indexing Over Augmented Set Indexing . 146 A.7 Proof for Other Applications ..........................148 A.7.1 Proof of Theorem 7.4.10.........................148 A.7.2 Proof of Theorem 7.4.11.........................149 A.7.3 Proof of Theorem 7.4.12.........................150 A.7.4 Proof of Theorem 7.4.13.........................150 A.8 Lowerp bound ...................................151 A.9 O( k)-round protocol with linear communication . 153 A.10 Communication in the Set Intersection protocol . 154 vii List of Figures 2.1 Linear program for the arbitrary-length case, LP-A. A is the set of all minimal antispanners for thin edges....................... 14 2.2 Linear program for the unit-length case, LP-U................. 17 2.3 Linear program LP-DSF for the case jD − Cj > jDj=2 ............ 26 3.1 Family of instances of Subset Feedback Vertex Set with approximation factor 18=7 for the primal-dual algorithm with oracle Minimal-Pocket- Violation .................................... 41 3.2 Family of instances of Subset Feedback Vertex Set with approxima- tion factor 12=5 for primal-dual algorithm with oracle Minimal-3-Pocket- Violation .................................... 41 5.1 Box partition BP(2) and jumps it generates................... 72 viii List of Tables 3.1 General graphs.................................. 29 3.2 Planar graphs................................... 29 5.1 The size of the sparsest Steiner k-TC-spanner for d-dimensional posets on n vertices for d ≥ 2 ............................... 62 d 6.1 Query complexity bounds for testing properties of the function f :[n] ! Z (top) and of the function f :[n] ! [r] (bottom). All the bounds are for nonadaptive tests with two-sided error unless marked otherwise. 78 ix Acknowledgments First and