Efficient and Private Distance Approximation in The

Efficient and Private Distance Approximation in the Communication and Streaming Models by David P. Woodruff Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2007 © Massachusetts Institute of Technology 2007. All rights reserved. Author............................................................................ Department of Electrical Engineering and Computer Science August 31, 2007 Certified by........................................................................ Piotr Indyk MIT Professor Thesis Supervisor Accepted by....................................................................... Arthur C. Smith Chairman, Department Committee on Graduate Students Efficient and Private Distance Approximation in the Communication and Streaming Models by David P. Woodruff Submitted to the Department of Electrical Engineering and Computer Science on August 31, 2007, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This thesis studies distance approximation in two closely related models - the streaming model and the two-party communication model. In the streaming model, a massive data stream is presented in an arbitrary order to a randomized algorithm that tries to approximate certain statistics of the data with only a few (usually one) passes over the data. For instance, the data may be a flow of packets on the internet or a set of records in a large database. The size of the data necessitates the use of extremely efficient randomized approximation algorithms. Problems of interest include approximating the number of distinct elements, approximating the surprise index of a stream, or more generally, approximating the norm of a dynamically-changing vector in which coordinates are updated multiple times in an arbitrary order. In the two-party communication model, there are two parties who wish to efficiently compute a relation of their inputs. We consider the problem of approximating Lp distances for any p ≥ 0. It turns out that lower bounds on the communication complexity of these re- lations yield lower bounds on the memory required of streaming algorithms for the problems listed above. Moreover, upper bounds in the streaming model translate to constant-round protocols in the communication model with communication proportional to the memory required of the streaming algorithm. The communication model also has its own applica- tions, such as secure datamining, where in addition to low communication, the goal is not to allow either party to learn more about the other's input other than what follows from the output and his/her private input. We develop new algorithms and lower bounds that resolve key open questions in both of these models. The highlights of the results are as follows. 1. We give a 1-pass O~(m1−2=p)poly(1/)-space streaming algorithm for (1±)-approximating the Lp norm of an m-dimensional vector presented as a data stream for any p ≥ 2. This improves the previous O~(m1−1=(p−1))poly(1/) bound, and is optimal up to poly- logarithmic factors. As a special case our algorithm can be used to approximate the frequency moments Fp of a data stream with the same optimal amount of space. This resolves the main open question of the 1996 paper by Alon, Matias, and Szegedy. 2. We give an Ω(1/2) lower bound for approximating the number of distinct elements of a data stream in one pass to within a (1 ± ) factor with constant probability, as well as the p-th frequency moment Fp for any p ≥ 0. This is tight up to very small factors, and greatly improves upon the earlier Ω(1/) lower bound for these problems. It also 2 gives the same quadratic improvement for the communication complexity of 1-round protocols for approximating the Lp distance for any p ≥ 0. 3. In the two-party communication model, we give a protocol for privately approximating the Euclidean distance (L2) between two m-dimensional vectors, held by different parties, with only polylog m communication and O(1) rounds. This tremendously improves upon the earlier protocol of Feigenbaum, Ishai, Malkin, Nissim, Strauss, p and Wright, which achieved O( m) communication for privately approximating the Hamming distance only. This thesis also contains several previously unpublished results concerning the second item above, including new lower bounds for the communication complexity of approximating the Lp distances when the vectors are uniformly distributed and the protocol is only correct for most inputs, as well as tight lower bounds for the multiround complexity for a restricted class of protocols that we call linear. Thesis Supervisor: Piotr Indyk Title: MIT Professor 3 Acknowledgments First, I would like to thank my advisor Piotr Indyk for making all of this possible. Piotr has always helped me find important and interesting problems and has had very insightful comments about all of my work. His constant presence in the lab was extremely useful whenever I had a question. He was always willing to listen to my ideas, even if they were not related to his main research interests. Next, I would like to thank my Master's thesis advisor Ron Rivest and my coauthor Marten van Dijk. Marten and Ron introduced me to cryptographic research and shaped me in my early days. This influence has helped me ever since, and kept my passion for cryptography alive. I am also very grateful for a few fantastic internships. I thank Jessica Staddon at PARC for introducing inference control to me and putting up with my endless emails on privacy- preserving data mining. I also thank the tag team Craig Gentry and Zulfikar Ramzan at DoCoMo Labs. With their talent and contagious enthusiasm, we made a lot of progress on broadcast encryption that summer. I was also extremely fortunate to visit Andy Yao for a year at Tsinghua University in Beijing. The hospitality and support of Andy was perfect for my development. The students there - Hongxu Cai, Lan Liu, Daniel Preda, Xiaoming Sun, Hoeteck Wee, Jing Zhang, and many others - made for an amazing learning experience. On a personal note, I would like to thank Andris, Hoeteck, Lan, and especially Zhang Yi for inspiring me and helping me stay alive in Beijing. I would like to thank all of my coauthors for their hard work - Arnab Bhattacharyya, Marten van Dijk, Craig Gentry, Robert Granger, Elena Grigorescu, Piotr Indyk, Kyomin Jung, Dan Page, Zully Ramzan, Sofya Raskhodnikova, Karl Rubin, Alice Silverberg, Jessica Staddon, Martijn Stam, Xiaoming Sun, Sergey Yekhanin, and Hanson Zhou. I also thank many others for enlightening discussions - Alex Andoni, Victor Chen, Khanh DoBa, Nick Harvey, Mohammad Taghi Hajiaghayi, Yuval Ishai, Jon Kelner, Swastik Kop- party, Silvio Micali, Vahab Mirrokni, Payman Mohassel, Jelani Nelson, Mihai Patrascu, Chris Peikert, Seth Pettie, Benny Pinkas, Ronitt Rubinfeld, Tasos Sidiropoulos, Adam Smith, Madhu Sudan, Vinod Vaikuntanathan, Grant Wang, and those whose names I can- not remember at the moment. 4 Finally, I'd like to thank my parents, David and Marsha, for unflagging encouragement and support. Financially, I would like to thank Akamai and the Department of Defence for fellow- ships (Presidential and NDSEG, respectively) that gave me the freedom to pursue my own interests as a graduate student. 5 Contents 1 Introduction 8 2 Streaming Algorithms 16 3 Upper Bounds for Lp Distances 19 3.1 Preliminaries . 20 3.2 The Idealized Algorithm . 23 3.3 Analysis . 25 3.4 A 2-pass Algorithm . 32 3.4.1 CountSketch ................................ 32 3.4.2 The new algorithm . 37 3.4.3 Conditioning on a few natural events . 38 3.4.4 Walking through the previous proofs . 39 3.5 The 1-pass Algorithm . 44 3.6 Reducing the Randomness . 47 4 Lower Bounds for Lp Distance and Frequency Moments 49 4.1 Communication Complexity . 49 4.2 The Gap Hamming Problem . 54 4.3 The Randomized 1-way Lower Bound . 58 4.4 A Lower Bound for the Uniform Distribution . 61 4.5 A Multiround Lower Bound for Linear Protocols . 67 4.5.1 Linear-algebraic Lemmas . 70 4.5.2 The Lower Bound for Linear Protocols . 71 6 5 Private Protocol for the Euclidean Distance 78 5.1 Cryptographic Tools . 79 5.2 Privacy . 80 5.3 Private Euclidean Distance . 82 6 Private Protocols for Efficient Matching 89 6.1 Exact Problems . 90 6.1.1 Private Near Neighbor Problem . 90 6.1.2 Private Near Neighbor for `2 and Hamming Distance . 91 6.1.3 Private Near Neighbor for Generic Distance Functions . 92 6.1.4 Private Near Neighbor for n =1 .................... 92 6.1.5 Private All-Near Neighbors . 93 6.2 Approximate Near Neighbor Problems . 96 6.2.1 Private c-approximate Near Neighbor Problem . 96 6.2.2 Reducing the Dependence on d for Private c-approximate NN . 97 6.2.3 c-approximate NN Protocol . 98 6.2.4 Private c-approximate NN Based on Locality Sensitive Hashing . 101 6.2.5 c-approximate NN Protocol Leaking k Bits . 105 7 Chapter 1 Introduction Consider the following scenario: there are two players, Alice and Bob, holding inputs x and y respectively, who wish to compute a function f(x; y). To do this, they need to communicate with each other. They'd like to do this by transmitting as few bits as possible. This is the classical two-party communication model introduced by Yao [69, 53]. Depending on the function f and the resources available, this task may require a lot of communication or very little. Examples of such resources include space and time complexity, as well as the ability to flip random coins. One class of functions of considerable interest is the class of distance functions. In this case, x and y are finite strings of length m, with characters drawn from an alphabet Σ, and f(x; y) measures how similar the strings are.

Efficient and Private Distance Approximation in The

The One-Way Communication Complexity of Hamming Distance

Fault Tolerant Computing CS 530 Information Redundancy: Coding Theory

Sequence Distance Embeddings

Dictionary Look-Up Within Small Edit Distance

Searching All Seeds of Strings with Hamming Distance Using Finite Automata

Distance-Sensitive Hashing∗

Arxiv:2102.08942V1 [Cs.DB]

Are All Distributions Easy?

B-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches

SIGMOD Flyer

Lower Bounds on Lattice Sieving and Information Set Decoding

Large Scale Hamming Distance Query Processing