I dedicate this thesis to my husband, Amiel, for his constant support and unconditional love.

Acknowledgements

I would like to express my special appreciation and thanks to my PhD advisors, Professors Shlomi Dolev and Zvi Lotker, for supporting me during these past five years. You have been a tremendous mentors for me. I would like to thank you for encouraging my research and for allowing me to grow as a research scientist. Your scientific advice and knowledge and many insightful discussions and suggestions have been priceless. I would also like to thank my committee members, professor Eitan Bachmat, professor Chen Avin, professor Amnon Ta-Shma for their helpful comments and suggestions in general. A heartfelt thanks to the really supportive and active BGU community here in Beer-Sheva and all my friends who made the research experience something special, in particular, Ariel, Dan, Nisha, Shantanu, Martin, Nova, Eyal, Guy and Marina. Special thanks to Daniel and Elisa for proof reading my final draft. A special thanks to my family. Words cannot express how grateful I am to my mother-in law, father-in-law, my mother, and father for all of the sacrifices that youve made onmy behalf. Finally, I would like to acknowledge the most important person in my life my husband Amiel. He has been a constant source of strength and inspiration. There were times during the past five years when everything seemed hopeless and I didnt have any hope. I can honestly say that it was only his determination and constant encouragement (and sometimes a kick on my backside when I needed one) that ultimately made it possible for me to see this project through to the end.

Abstract

The abundance in the amount of data is forcing us to redefine many scientific and techno- logical fields, with the affirmation of any environment of Big Data as a potential source of data.The advent of Big Data is introducing important innovations: the availability of additional external data sources, dimensions previously unknown and questionable consis- tency poses new challenges to computer scientists, demanding a general reconsideration that involves tools, software, methodologies and organizations. This thesis investigates the problem of big data abstraction in the scope of exploration, in- terpolation and extrapolation. The driving vision of data abstraction is to turn the information overload into an opportunity: the goal of the abstraction is to make our way of processing data and information transparent for an analytic discourse as well as a tool for complete missing information, predicate unknown features and filter noise and outliers. We confront three aspects of the abstraction problems with gradual levels of generaliza- tion. First we focus on a specific data type and propose a novel solution for exploring the connectivity threshold of wireless data when the number of the sensors approach infinity. Second, we consider how to use polynomials to effectively and succinctly interpolate general data functions that tolerate noise as well as bounded number of maliciously corrupted outliers. Third, we show how to represent a high-dimensional data set with incomplete information that fulfills the demand of predictive modeling. Our main contribution lies in rethinking these problems in the context of massive amounts of data which dictate large volumes and high dimensionality. Information extraction, exploration and extrapolation have a major impact on our society. We believe that the topics investigated in this thesis can potentially influence greatly and practically.

Table of contents

List of figures ix

1 Introduction1 1.1 The Information Age ...... 1 1.2 Big Data Abstraction ...... 3 1.2.1 Big Data Exploration ...... 3 1.2.2 Big Data Interpolation ...... 4 1.2.3 Big Data Extrapolation ...... 4

2 Probabilistic Connectivity Threshold for Directional Antenna Widths7 2.1 Introduction ...... 7 2.2 Preliminaries ...... 9 2.2.1 Notations ...... 11 2.2.2 Probability and the relation between Uniform and Poisson distributions 12 2.2.3 Covering and Connectivity ...... 12 2.3 Centered Angles ...... 13 2.3.1 Finding the Connectivity Threshold ...... 17 2.4 Random Angle Direction ...... 19 2.5 Discussion ...... 23 2.6 Appendix ...... 24

3 Big Data Interpolation using Functional Representation 27 3.1 Introduction ...... 27 3.2 Discrete Finite Noise ...... 30 3.2.1 Handle the discrete noise ...... 30 3.2.2 Multidimensional Data...... 32 3.3 Random Sample with Unrestricted Noise ...... 36 3.3.1 Polynomial fitting to noisy data ...... 36 viii Table of contents

3.3.2 Byzantine Elimination...... 39 3.4 Discussion ...... 40

4 Mending Missing Information in Big-Data 43 4.1 Introduction ...... 43 4.2 Preliminaries ...... 45 4.3 k-flats Clustering ...... 47 4.4 Algorithm ...... 53 4.5 Experimental Studies of k-Flat Clustering ...... 55 4.6 Clustering with different group sizes ...... 55 4.7 Sublinear and distributed algorithms ...... 58 4.8 Discussion ...... 59 4.9 Conclusion ...... 61 4.10 Appendix: The probability of flats intersection ...... 62

5 Conclusions 67

Nomenclature 69

References 71 List of figures

2.1 Directional antenna model...... 9 2.2 The communication graph over the disk and the disk’s boundary...... 10 2.3 Covering vs. connectivity problems...... 13 2.4 Project nodes from the disk onto antipodal pair on the boundary...... 14 2.5 Transforming the antipodal pair to a node on the boundary...... 16 2.6 Projection a node from the boundary to a node on the disk...... 17 2.7 A node and its intercepted arc...... 18 2.8 The disk’s cover expansion...... 18 2.9 Represent the three dimensional variable of the graph using a torus. . . . . 20 2.10 Represent the minimal coverage area by an annulus...... 21 2.11 The possible directions that induce adjacency...... 22 2.12 Generalize to convex fat objects with curvature> 0 ...... 24

4.1 Two dimensional pair of flats intersecting a disk ...... 52 4.2 The distance between the midpoint and the ball’s center ...... 53 4.3 Eliminate the irrelevant midpoints ...... 56 4.4 Almost orthogonal flats pairwise intersection ...... 60

Chapter 1

Introduction

1.1 The Information Age

Almost 35 years ago, Alvin Toffler54 [ ] published his book “The Third Wave” where he described three phases of human society’s development based on the concept of ‘waves’, with each wave pushing the older societies and cultures aside. According to Toffler, civilization can be divided into three major phases: The First Wave, referred to as the settled Agricultural society, and which replaced the first hunter-gatherer cultures. The symbol of this ageis the hoe, and the profile of the wealthy person is the land owner. Battles were typically carried out with swords. The Second Wave is the Industrial age society, symbolized by the machine, beginning with the industrial revolution. At this time, the wealthy were the factory owners and machines were used during times of war - tanks, aircrafts etc. The Third Wave is the post-industrial society. Toffler says that since the late 1950s, most countries havebeen transitioning into the Information age. The symbol now is obviously the computer. The wealthy and powerful people are those that develop or collect the data and sell others the privilege to use it, and one of the main threats in this modern age is the cyber attack.

At the beginning of the 80’s, no one could have imagined the significance and power that data would play in everyday life. Today, Big data - a large pool of data that can be captured, communicated, aggregated, stored and analyzed, is part of every sector and function of the global economy [36]. The use of big data can create significant value for the world economy, enhancing the productivity and competitiveness of companies and the public sector, and creating a substantial economic surplus for consumers. 2 Introduction

Big data challenges.

There are many different definitions for the term ‘Big Data’. Generally, this term refers toa massive amount of data, the size of which is beyond the ability of typical database software tools to capture, store, manage, and analyze. A popular abbreviation which is commonly used to characterize it is the three V’s: Volume, Variety and Velocity. By Volume, we usually mean the sheer size of the data, that is of course the major challenge, and the most easily recognized. By Variety, we mean heterogenity of data types, representation, and semantic interpretation. The meaning of Velocity is both the rate at which data arrives and the speed in which it needs to be processed- for example, to perform fraud detection at a sale point. Another important feature of big data includes not only the huge number of items but also their ‘wideness’, i.e., each item maintains many fields. Hence, it is common to describe these items by objects in a high-dimensional space. High-dimensional objects have a number of unintuitive properties that are sometimes referred to as the ‘curse of dimensionality’ [6]. Multiple dimensions are hard to think in, impossible to visualize, and due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality. One manifestation of the ‘curse’ is that in high dimensions, almost all pairs of points are equally far away from one another and almost any two vectors are nearly orthogonal. Another manifestation is that high dimensional functions tend to have more complex features than low-dimensional functions, and are hence harder to estimate. Moreover, in order to obtain a statistically sound and reliable result, e.g., to estimate multivariate functions with the same accuracy as functions in low dimensions, we require that the sample size grow exponentially with the dimension. The emerging field of data science relates those aspects of big data and provides them with different solutions which are fundamentally multidisciplinary. The platform’s engi- neers try to build parallel data processing platforms where their goal is to make it easy for developers of big data applications to write programs, as they would on a single node computational environment, and to rapidly deploy those applications on tens or hundreds of nodes. From the Machine Learning and Understanding point of view, the challenge is to develop classification and clustering applications in a wide spectrum of domains. The privacy and security challenges encourage the development of technologies and policies for protecting and allowing people to retain control over their data [1]. In the algorithmic field, scientists design algorithms that can deal with very large volumes of data. These include parallel implementations of a range of known algorithms including matrix computations, as well as statistical operations like regression and optimization methods [35]. There is a need to produce new algorithms for encoding, comparing, and searching massive data sets. Another algorithmic aspect is the need for sublinear algorithms. Here, the goal is to develop 1.2 Big Data Abstraction 3 powerful algorithmic sampling techniques which allow one to estimate parameters of the data by viewing only a small portion of it. Such parameters may be combinatorial, algebraic, or distributional [21, 46, 47].

1.2 Big Data Abstraction

This thesis examines another aspect of the big data challenge which is part of the data modeling idea. Modeling is the process of creating a simpler representation of something or the design of representative structures at various levels of abstraction from conceptual to physical. In big data abstraction, we try to capture the essence of the data using a succinct representation which maintains the main feature of the data. We identify three different aspects of the abstraction- Exploration, Interpolation and Extrapolation-each approached in a different manner. In the Exploration part, we deal with a specific data aggregation task establishing com- munication in the scope of two dimensional wireless sensors that randomly scatter over the unit disk, where our goal is to cover the disk with minimal communication resources. In the Interpolation part we relate to more general data, multi-variate functions that can be approximated using polynomials. The interpolation provides concise representation as well as estimates unknown values within the range of the known points. Finally, at the Extrapolation phase, we model general high dimensional data using algebraic objects in a way that helps us estimate beyond the original observation range of unknown features of one object, on the basis of its relationship with another object. Below we give a short description of these three parts and summarize our main results, while in the following three chapters (Chapters2,3,4) we describe the entire study in detail. Finally in Chapter5, we give our concluding remarks.

1.2.1 Big Data Exploration

Sensor networks and IoT are a growing source for big data. The communication network coverage should be the most important issue in the network design, in order to allow accumu- lation of the data. This part of the research is focused on establishing connected computer networks that cover the entire relevant space using minimal communication resources. Consider the task of maintaining connectivity in a wireless network where the network nodes are equipped with directional antennas. Nodes correspond to points on the unit disk and each uses a directional antenna covering a sector of a given angle α. 4 Introduction

To calculate the width required for a connectivity problem we have to find out the necessary and sufficient conditions of α that guarantee connectivity when an antenna’s location is uniformly distributed and the orientation of the antenna’s sector is either random or fixed. We show that when the number of network nodes is big enough, the required α ap- proaches zero. Specifically, on the unit disk, assuming uniform orientation, it holdswith q 4 logn high probability that the threshold for connectivity is α = Θ( n ). This is shown by the use of Poisson approximation and geometrical considerations. Moreover, when the model is relaxed, assuming that the antenna’s orientation is directed towards the center of the disk, we logn demonstrate that α = Θ( n ) is a necessary and sufficient condition. This work, described in the following chapter, was also presented at the 20th International Colloquium in Structural Information and Communication Complexity (SIROCCO) [14] and published in Theoretical Computer Science journal [15].

1.2.2 Big Data Interpolation

Given a large set of data measurements, in order to identify a simple function that captures the essence of the data, we suggest representing the data by an abstract function, in particular by polynomials. We interpolate the datapoints to define a polynomial that would represent the data succinctly. The interpolation is challenging, since in practice the data can be noisy and even Byzantine, where the Byzantine data represents an adversarial value that is not limited to being close to the correct measured data. We present two solutions, one that extends the Welch-Berlekamp technique [56] to eliminate the outliers appearance in the case of multidimensional data, and copes with discrete noise and Byzantine data, and the other is based on the Arora and Khot [4] method to handle noisy data, generalize it in the case of multidimensional noisy and Byzantine data. The minute details of this part are presented in Chapter3, while an extended abstract of this study was published in the International Symposium on Algorithms and Experiments for Sensor Systems, Wireless Networks and Distributed Robotics (ALGOSENSORS) [13] and now under review at the ACM Transactions on Knowledge Discovery from Data.

1.2.3 Big Data Extrapolation

Consider a high-dimensional data set, in which for every data-point there is incomplete information. Each object in the data set represents a real entity, which is described by a point in high-dimensional space. We model the lack of information for a given object as affine d subspace in R with dimension k. 1.2 Big Data Abstraction 5

Our goal in this part is to find clusters of objects. The main problem is to cope with partial information. We studied a simple algorithm we call Data clustering using flats minimum distances, using the following assumptions: (i) There are m clusters. (ii) Each cluster is d modeled as a ball in R . (iii) All k-dimensional affine subspaces, which belong in the same cluster, are intersected with the ball of the cluster. (iv) Each k-dimensional affine subspace, that belong to a cluster, is selected uniformly among all k-dimensional affine subspaces that intersect the ball’s cluster. A data set that satisfies these assumptions will be called separable data. Our suggested algorithm calculates pair-wise projection of the data. We use probabilis- tic considerations to prove the algorithm’s correctness. These probabilistic results are of independent interest, and can serve to better understand the of high dimensional objects. Chapter4 concludes this work, while an abstract of this work was published in The Haifa 2nd Security Research Seminar 2015 and the whole study is under review at the IEEE Transactions on Big Data.

Chapter 2

Probabilistic Connectivity Threshold for Directional Antenna Widths

2.1 Introduction

Communication among wireless devices is of great interest in the current wireless technology, where devices are part of sensor networks, mobile ad-hoc networks and RFID devices that take part in the emerging ubiquitous computing, and even satellite networks. These communication networks are usually extremely dynamic, where devices frequently join and leave (or crash) and, therefore, require probabilistic techniques and analysis. Imagine, for example, sensor networks that use directional antennas (saving energy and increasing communication capacity) among the sensors that should be connected even though they are deployed by an airplane that drops them from the air (just as in a smart dust scenario). What is the density of those sensors needed to ensure their connectivity? Is there a way to renew connectivity after some portion of the sensors stops functioning- maybe by deploying only an additional fraction, uniformly distributed in the area with random orientation of the antennas? In this work, we try for the first time to suggest and analyze ways to ensure connectivity in such probabilistic scenarios. Namely, we have studied the problem of arranging randomly scattered wireless sensor antennas in a way that guarantees the connectivity of the induced communication graph. The main challenge here is to minimize energy consumption while preserving node connectivity. In order to save power, increase transmission capacity and reduce interference [33], the unbounded range antennas do not communicate information in all direction but rather inside a wedge-shaped area. Namely, the coverage area of a directional antenna located at point p 8 Probabilistic Connectivity Threshold for Directional Antenna Widths of angle α, is two opposite sectors of angle α of the unit disk centered at p (see Figure 2.1). Throughout the chapter we will call α the communication angle. The smaller the angle is, the better it is in terms of energy saving. When knowing nothing about the future positioning of the antennas, each antenna may be directed to a random direction that may stay fixed forever. Therefore, we wish to find the minimum α > 0 so that no matter what finite set of locations the antennas are given, with high probability theycan communicate with each other. Our goal is to specify necessary and sufficient conditions for the width of wireless antennas that enable one to build a connected communication network when antennas locations and directions are randomly and uniformly chosen. Throughout this chapter, we refer to an undirected graph where the nodes are the antennas, and two nodes are connected by an edge if and only if their corresponding antennas are located in each other’s transmission area. However, our calculations hold for the directed case as well. Specifically, Theorems2 and3 hold for both cases, and the result proven by Theorems4 and5 also implies a connectivity threshold for the directed graph case. Previous results that handle wireless directional networks [2, 12] assume coordinated locations and orientations for the antennas. They show that a connected network can be built with antennas of width α = π/3. The same model’s assumptions were used by [5] to study graph connectivity in the presence of interference and in [31] to optimize the transmission range as well as the hop-stretch factor of the communication network. A different model of a directed graph of directional antennas of bounded transmission range was studied in [11, 16]. In contrast to the above worst case approaches, to the best of our knowledge, we consider for the first time the connectivity problem from a probabilistic perspective. Namely, weare interested in the minimal communication angle that implies high probability for the graph to be connected as a function of the number of nodes. This approach significantly reduces the required communication angle and is more general in the sense that we don’t have to make directing procedures as the procedures employed in [2,5, 12] algorithms. The probabilistic setting of the problem is related to other research in the field of contin- uum percolation theory [37]. The model for the points here is a Poisson point process, and the focus is on the existence of a connected component under different models of connections. For example, [57] studied the number of neighbors that implies connectivity. Papers [23, 40] are focused on the minimum number r such that two points are connected if and only if their metric distance is ≤ r. In [30] the authors generalized the results in [23, 40] and proved that d logn 1/d for a fractal in R , it holds with high probability that r ≈ ( n ) , where ≈ means that the ratio of the two quantities is between two absolute constants. Our main results (summarized in Theorems2,3,4 and5) discuss two different models. The first is related to the case where all the antennas are directed to one referencepoint 2.2 Preliminaries 9

(specifically, we used the center of a disk). The second model generalizes the resultsby dealing with randomly chosen locations and directions. Assuming that the number of nodes is big enough, we show in both cases that with high probability, the threshold approaches zero. We believe that these results are important for both their combinatorial and geometric perspectives and in their implications on the design of wireless networks. This work was published in [14, 15].

2.2 Preliminaries

Let P = {p1(x,y,θ),..., pn(x,y,θ)} be a set of n points (or nodes). The point location (x,y) is chosen independently from the uniform distribution over the unit disk D (or over the unit disk boundary ∂D) in the plane. The antenna direction θ of each point might be fixed or to be chosen independently and uniformly at random from [0,π], and the antenna range is unbounded. Each point represents a communication station by a pair of opposite wedges (since when assuming unbounded range the sector becomes a wedge) of angle α with direction θ at each node (see Figure 2.1).

푝푖(푥, 푦, 휃) 휽

Fig. 2.1 Fixing at each point pi two opposite wedges of angle α with direction θ.

Given nodes u and v with communication angle α, we say that u sees v if u lies in the the coverage area (on the intercepted arc) of v.

Definition 1 (The communication graph) The communication graph G(P,E,α) is an undi- rected graph that consists of the set P, its communication angle α and the set of edges E = {(u,v)|u sees v and v sees u}. (Since E is defined by P(x,y,θ) and α, we omit E in the sequel from the graph notation, i.e., G(P,E,α) = G(P,α)).

Figure 2.2 illustrate examples of the communication graph over the disk and the disk’s boundary. Our model assumes that the communication between two nodes is limited only due to the angle aperture (and not the distance), hence, when α = 2π it is clear that the induced 10 Probabilistic Connectivity Threshold for Directional Antenna Widths

u 훼 u 훼

훼 훼 훼 v w v 훼 w

Fig. 2.2 The communication graph G = (V,E) = ({u,v,w},{(u,v)}) on the disk (on the left) and on the boundary (on the right). Note that G is not connected.

graph is connected with probability 1. In contrast, when α = 0 the probability of achieving a connected graph equals zero. While increasing α, the probability that the induced graph is connected monotonically increases.

Definition 2 (The critical communication angle) Given the set P, the critical communica- tion angle αˇ of the graph G(P,α) is the minimal angle such that G will be connected with probability 1 − o(1) as n tends to infinity iff α ≥ αˇ .

Having these definitions we are ready to present the main results we obtained inthis chapter:

1. Let P = {p1(x,y,θ),..., pn(x,y,θ)} be a set of n points such that their location (x,y) is chosen independently from the uniform distribution over the unit disk D, and their orientation θ is directed to the center of the disk. Then, there exist two constants C > c > 0, such that for all ε > 0 and for any sufficiently large n (depending on ε):

logn (a) if α > C n then the graph G(n,α) is connected with probability at least 1 − ε. logn (b) if α < c n then the graph G(n,α) is connected with probability at most ε.

2. Let P = {p1(x,y,θ),..., pn(x,y,θ)} be a set of n points such that their location (x,y) is chosen independently from the uniform distribution over the unit disk D, and their orientation θ is chosen independently and uniformly over [0,π]. Then, there exist two constants C > c > 0, such that for all ε > 0 and for any sufficiently large n (depending on ε): q 4 logn (a) if α > C n then the graph G(n,α) is connected with probability at least 1 − ε. 2.2 Preliminaries 11

q 4 logn (b) if α < c n then the graph G(n,α) is connected with probability at most ε. Remark 1 From the calculation in the proofs of Theorems2,3,4, and5, one can choose any c < 1/2 and C > π2.

Remark 2 The connectivity threshold in the random direction settings (Theorems4 and5) q 3 logn holds for an undirected graph. We also prove that for a directed graph, α = Θ( n ) is the necessary and sufficient threshold for graph connectivity, though not necessarily for graph strong connectivity.

2.2.1 Notations

• The intercepted arc of a point u with angle α is the part of a that lies between the two rays of α that intersect it.

• Let arc(u) be the intercepted arc of u and let |arc(u)| denote its length.

• Let wedge(u) be the coverage area of u.

• Let bisector(u) be the line that passes through the apex of wedge(u), that divides it into two equal angles.

• Let ∗ be an equivalence relation such that for a point u ∈ P, u∗ is the antipodal point of u (i.e., u and u∗ are opposite through the center).

• Throughout, we use the term w.h.p. as a shortcut for “with probability 1 − o(1) as n tends to infinity.”

• Following we sometimes use the term “random point” instead of “random variable.”

• Let D denote the unit disk (in R2) and let ∂D denote its boundary.

∗ • Let ∂D2 be the space of pairs {u,u } of antipodal points in the unit disk boundary ∂D.

• Let X = {X1,...,Xn} be a set of uniform random variables defined over D.

• Let Y = {Y1,...,Yn} be a set of uniform random variables defined over ∂D.

2 • Let Z = {Z1,...,Zn} be a set of uniform random variables defined over ∂D .

2 • Let (Y1,R1),...,(Yn,Rn) be pairs of uniform random variables defined over ∂D ×[0,1].

2 • Let (Z1, χ1),...,(Zn, χn) be pairs of uniform random variables defined over ∂D × {0,1}. 12 Probabilistic Connectivity Threshold for Directional Antenna Widths

2.2.2 Probability and the relation between Uniform and Poisson distri- butions

Throughout this chapter we use standard tools from continuum percolation and refer the reader to [37, 38] for a general introduction of the topic. The problem of finding the critical communication angle can be translated into the mathematical framework of “balls and bins”. We have n balls that are thrown independently and uniformly at random into b bins. The “balls” represent the set P of n nodes distributed over the disk (the disk’s boundary), and the “bins” are slices of the disk area (boundary) defined by the coverage area (by the intercepted arc) of the nodes. Note that the bins in this setting are not disjoint, and we will refer to this later. The distribution of the number of balls in a given bin is approximately a Poisson variable with a density parameter λ = n/b. Moreover, [38] shows (Chapter 5, Theorems 5.6 − 5.10) that the joint distribution of the number of balls in all the bins is well approximated by assuming the load at each bin is an independent Poisson random variable with λ = n/b (the precise definition and properties of the Poisson distribution also appear in that chapter). Let us call the scenario in which the number of balls in the bins are taken to be independent Poisson random variables with mean λ = n/b the Poisson case, and the scenario where n balls are thrown into b bins independently and uniformly at random the exact case. We justify the use of Poisson approximation instead of calculating the exact uniform case by using the following Theorem:

Theorem 1[38] Let Λ be an event whose probability is either monotonically increasing or monotonically decreasing in the number of balls. If Λ has probability p in the Poisson case, then Λ has probability at most 2p in the exact case.

The theorem and its proof appear at [38] as Corollary 5.11. Another basic probability tool we going to use is the “coupon collector” principle. Using Theorem 5.13 of [38] we get that the number of balls that need to be thrown until all the b bins have at least one ball w.h.p. is blogb. In the sequel, we will use this result in a variety of settings.

2.2.3 Covering and Connectivity

The final issue we would like to relate in this preliminary, is the connection betweenthe problem of covering the disk (or the disk boundary) and the problem of ensuring a connected graph. The two problems are defined below: 2.3 Centered Angles 13

Problem 1 Covering problem: We would like to find the minimal communication angle α¯ , such that the coverage area (intercepted arc) induced by the set P defined over D (over [ [ ∂D), covers the whole disk (boundary): wedge(u) = D (or arc(u) = ∂D)) with high u∈P u∈P probability.

Problem 2 Connectivity problem: We would like to find the minimal communication angle αˇ , such that the nodes’ coverage area (intercepted arc) induces a connected graph with high probability.

The disk cover is a necessary but not a sufficient condition for the graph to be connected, as illustrated in Figure 2.3. The relation between the covering and the connectivity problems

푎 푏

푑 푐

Fig. 2.3 The disk is covered by the nodes, however, the induced graph is not connected (the graph contain only two edges). is given by the following Lemma which is explicitly proven in Lemma 2.2 of [30].

Lemma 1[30] Given α¯ which is the minimal angle that induces a cover, with high probabil- ity, αˇ = 3α¯ is the expected connectivity threshold.

2.3 Centered Angles

In this section, we consider the case where the antennas’ communication angle α is directed to the center of the disk. We define three different models, prove their equivalence and use one of them to resolve the connectivity threshold. The diagram below illustrates the way we have proven that the three models are equivalent up to O(·) notation: The equivalence of these three models implies the following observation:

• Let αˇ G(Y,α) be the critical α of G(Y,α), then there exists a constant c > 0 such that the critical α of G(X,α) is αˇ G(X,cα) = Θ(αˇ G(Y,α)). We prove each of the three equivalences separately, producing a new set of points from the given one, as follows: 14 Probabilistic Connectivity Threshold for Directional Antenna Widths

Lemma2 G(X,Θ(α)) G(Z,Θ(α))

Lemma4 Lemma3

G(Y,Θ(α))

If G(X,α) is connected ⇒ G(Z,α) is connected

Definition 3 Let φ : D −→ ∂D2 be a function defined as follows. φ projects every point ∗ u ∈ D to the antipodal pair {u¯,u¯ } ∈ ∂D2 located on the intersection of ∂D and the line goes through u and o (the center of the disk). Note that φ is not defined over the center of the disk, i.e., the center is not transformed to any place on ∂D. This does not change the correctness of the function since the probability that a random point will fall exactly on the center is 0.

−1 Claim 1 Given the set {X1,...,Xn}, one can produce the set {Z1,...,Zn} by Zi ≡ Xi ◦ φ , 2 which implies that Zi is independently identically uniformly distributed over ∂D .

Proof The claim is immediate from the definitions. Precisely, the set {X1,...,Xn} is defined 2 over D, i.e., Xi : D −→ R. Similarly, Zi : ∂D −→ R. One can produce the set {Z1,...,Zn} by −1 2 φ Xi −1 Zi : ∂D −→ D −→ R, which is Zi ≡ Xi ◦ φ .

Claim 2 φ defines for every edge (u,v) ∈ G(X,α) a connected path: ∗ ∗ ∗ ∗ Pu¯,v¯ = ((u¯,v¯ ),(v¯ ,v¯),(v¯,u¯ ),(u¯ ,u¯)) ∈ G(Z,α).

∗ 푣 푢 ∗

푢 푣

Fig. 2.4 The nodes u,v ∈ G(X,α) and their projected nodes u¯,u¯∗,v¯,v¯∗ ∈ G(Z,α) defined by φ. Note that we did not sketch wedge(u¯∗) and wedge(v¯) due to visualization considerations. 2.3 Centered Angles 15

Proof The existence of the edges (u¯,u¯∗),(v¯,v¯∗) is immediate from the antipodal definition. We show that (u¯,v¯∗) is in G(Z,α), and the proof for (v¯,u¯∗) is symmetric. W.l.o.g. we say that dist(u¯∗,v) ≤ dist(u¯,v) and dist(v¯∗,v) ≤ dist(v¯,v) (see Figure 2.4). Since u¯ lies on bisector(u), we get that wedge(u) ⊆ wedge(u¯). Notice that (u,v) ∈ G(X,α) implies u sees v; hence, u¯ sees v. Since v and v¯∗ share the same bisector, it must hold that u¯ sees v¯∗, i.e., (u¯,v¯∗) is in G(Z,α). Note that for any nodes u,v in G(Z,α) (as well as in G(X,α) or in G(Y,α)) if u sees v it must hold that v sees u since all the angles are directed to the center).

−1 Lemma 2 Given the communication graphs G(X,α) and G(Z,α) such that Zi = Xi ◦ φ , then if G(X,α) is connected, it implies that G(Z,α) is connected.

Proof Given that G(X,α) is connected, it implies that for every two nodes u,v ∈ G(X,α) there exists a path Puv = ((u,x),...,(y,v)) ∈ G(X,α) that connects them. To show that G(Z,α) is connected, we provide a path Pu¯,v¯ by replacing every edge (x,y) ∈ Pvu by the path Px¯,y¯ ∈ G(Z,α) using Claim2, i.e., Pu¯,v¯ = (Pu¯,x¯,...,Py¯,v¯).

If G(Z,α) is connected ⇒ G(Y,3α) is connected

∗ Definition 4 Let ϕ : ∂D2 × {0,1} −→ ∂D be a function that maps the pair {u,u } ∈ ∂D2 ′ ∗ ∗ ∗ and a bit b to one node u ∈ ∂D from the pair, e.g., ϕ({u,u ,0}) = u and ϕ({u,u ,1}) = u .

Claim 3 Given the set {(Z1, χ1),...,(Zn, χn)}, one can produce the set {Y1,...,Yn} by Yi ≡ −1 (Zi, χi) ◦ ϕ , which implies that Yi is independently identically uniformly distributed.

(The proof is similar to the proof of Claim1.)

Lemma 3 Given the communication graphs G(Z,α) and G(Y,cα), for a constant c ≥ 3, −1 such that Yi = (Zi, χi) ◦ ϕ . If G(Z,α) is connected, then w.h.p G(Y,cα) is connected.

Proof We will show that every covered arc in G(Z,α) is covered in G(Y,3α). Given the nodes u,u∗ ∈ G(Z,α), ϕ produces (w.l.o.g.) the node u′ = u ∈ G(Y,3α). arc(u) is covered at G(Y,3α) since u′ ∈ G(Y,3α). However, since u∗ ̸∈ G(Y,3α), we need to show that arc(u∗) is covered in G(Y,3α). ′ ∗ ′ ′ Partition arc(u ) and arc(u ) into two bins each, denoted (from left to right) by bℓ,br,bℓ,br ′ respectively (see Figure 2.5). Note that if there exists a node ℓ at bℓ, then br is covered, ′ and if there exists a node r at br, then bℓ is covered. We prove that b and r do exist using probabilistic considerations. 16 Probabilistic Connectivity Threshold for Directional Antenna Widths

퓁 푢∗ 푟 푏′퓁 푏′푟

푏퓁 푏푟 푢 = 푢′

Fig. 2.5 Transforming the points u,u∗ ∈ G(Z,α) to the point u′ ∈ G(Y,3α). In G(Y,3α) the ∗ nodes ℓ,r cover arc(u ) = bℓ ∪ br.

After throwing n balls at random into b bins, the distribution of the number of balls in a given bin is approximately Poisson with mean n/b (see Section 2.2.2). The balls in our case are the nodes and the bins are the arcs. Hence, the density of a given bin (i.e., #balls 2n arc(u)) in G(Z,α) is λZ = #bins = π/α = 2nα/π. The density of a given bin of G(Y,3α) is n λY = π/3α = 3nα/π. To finish the proof, we argue that ℓ and r do exist at G(Z,α); otherwise, u is connected only to u∗ which implies that the graph is not connected (or contains only one antipodal ′ ′ pair). Since λZ < λY , we get that w.h.p. at G(Y,3α) there exists a node at bℓ and at br, which implies a cover.

If G(Y,α/2) is connected ⇒ G(X,α) is connected

Definition 5 Let ψ : ∂D × [0,1] −→ D be a function defined as follows. For every point u¯ ∈ ∂D, ψ maps a radius ru to the point u ∈ D located on the line going through u¯ and o, √ such that dist(u,u¯) = ru.

Claim 4 Given the set (Y1,R1),...,(Yn,Rn), one can produce the set X1,...,Xn by Xi ≡ −1 (Yi,Ri) ◦ ψ , which implies that Xi is independently identically uniformly distributed over D.

(The proof is similar to the proof of Claim1.)

Lemma 4 Given the communication graphs G(Y,α/2) and G(X,α) such that Xi = (Yi,Ri)◦ ψ−1, and given that G(Y,α/2) is connected, then G(X,α) is connected. 2.3 Centered Angles 17

푢 푢

푢 푢 푟푢

Fig. 2.6 The point u¯ and its projected point u defined by ψ illustrated at the left. Note that √ the distance between the two points is ru for uniformly distributed ru. At the right one can observe that ifu ¯ seesv ¯, then u sees v.

Proof Let EY and EX denote the edges sets of G(Y,α/2) and G(X,α) respectively, we prove the connectedness dependency by establishing that EY ⊆ EX . Given the nodes u¯,v¯ ∈ Y(α/2) such that (u¯,v¯) ∈ EY , let u,v ∈ D be the nodes produced by applying ψ on u¯,v¯, respectively (see Figure 2.6). Due to the geometrical relation between angles and intercepted arcs, we know that |arc(u¯)| = α, and α ≤ |arc(u)| ≤ 2α. Since u¯ lies on bisector(u), we get that arc(u¯) ⊆ arc(u). However, v¯ lies in arc(u¯) (because

(u¯,v¯) ∈ EY ); hence, u sees v¯. Since v and v¯ share the same bisector, then u sees v. Symmetric considerations imply that v sees u; hence, the edge (u,v) is in EX .

2.3.1 Finding the Connectivity Threshold

Given the set L of n uniformly distributed points on ∂D, such that the angle α of every node is directed to the center, we show that the threshold for G(Y,α) is αˇ (Y) = Θ(logn/n) as presented in the following two lemmas.

Theorem 2 (Sufficient condition for connectivity) Given the set (Y,α), there exists a con- logn stant c > 0 such that when α ≥ c n , the communication graph G(Y,α) is connected w.h.p.

Proof The proof provides a cover of the disk’s boundary, and then by Lemma1 we can achieve the expected connectivity. Given a node u ∈ G(Y,α) we partition arc(u) into three equal length bins denoted (from left to right) by bℓ,bmid and br. Since |arc(u)| = 2α, the length of each bin is 2α/3. Let ℓ be a node at bℓ, and uℓ be a node that lies at arc(ℓ). The angles are directed to the center; hence, the nodes u,ℓ and uℓ are connected. In the same way, we expand the cover to the left (with relation to arc(u)), See Figure 2.8. The same considerations are valid for the right side of the 18 Probabilistic Connectivity Threshold for Directional Antenna Widths

푢 푢

훼 훼

푎푟푐 푢 = 2훼 푣

Fig. 2.7 The node u with the angle α and its intercepted arc of length 2α. Note that every node v ∈ arc(u) is connected to u, i.e., (u,v) ∈ E.

arc and for u,r and ur, respectively. Note that the existence of ℓ,r,uℓ and ur is an outcome of the connectedness of G(Y,α). By the coupon collector principle (see Section 2.2.2), we know that the number of balls that need to be thrown until all bins have at least one ball with high probability is blogb, where b is the number of bins. Similarly, when placing n balls, n/logn bins suffice to guarantee that in each bin, there will be at least one ball. 2πr 3π Dividing the circumference to 2α/3 cells, we have 2α/3 = α bins (note that r = 1). 3π n Assigning α = logn bins, we get that α = O(logn/n) as expected.

푢 푢퓁 푢

퓁 퓁 푟 2훼

3

Fig. 2.8 The left figure demonstrate that the existent of a node at the left side of the intercepted arc (like ℓ), expand the cover to the right (in respect to u). We partition each cell of length 2α into 3 segments to ensure that the left node ℓ and the right node r exist, for expand the cover to the right and the left respectively. 2.4 Random Angle Direction 19

Theorem 3 (Necessary condition for connectivity) Given the set (Y,α), there exists a con- logn stant c > 0 such that if α < c n , then w.h.p. the induced communication graph G(Y,α) is not connected.

logn logn Proof When α < c n , every node induces an intercepted arc of length ≤ n . We will show that there exist an arc that does not contain any node, hence, its antipodal arc is not covered, which yields by Lemma1 that G(Y,α) is not connected. n clogn n n Let I = {Ii}i=1 be a set of clogn disjoint arc intervals induced by clogn nodes of G(Y,α). The existence of I is proven in Claim9 in the Appendix.

The load in every bin i is an independent Poisson random variable Xi with mean λ = n/|A| (see explanation at Section 2.2.2). Let χi be an indicator random variable that is 1 when the ith n bin is empty and 0 otherwise. The density parameter of X is λ = n/|I| = n/(clogn) = clogn. −clogn 1 c Thus, the probability that a bin i is empty is Pr(χi = 1) = Pr(Xi = 0) = e = n . By 1 c the union bound, we get that the probability that a bin i is not empty is Pr(χi = 0) = 1− n . Using the independency property of the Poisson variables, the probability that in all the bins there is at least one ball is

|I| n  1c  1c clogn Pr χ = 0 ∩ χ = 0 ∩ ... ∩ χ = 0 = 1 − = 1 − (2.1) 1 2 |A| n n

n  1 c clogn When setting 0 < c < 1, we get that limn→∞ 1 − n = 0, which implies that w.h.p. there exists an empty bin, i.e., an arc fragment that is not covered.

2.4 Random Angle Direction

Given the set P of n uniform point distributed over the unit disk, we now assume that the direction of the antenna is a random uniform variable θi. Hence, each point can be represented by three parameters (x,y,θ) where x and y indicate the point location, distributed over D, and θ distributed over [0,π] is the direction of the antenna. Since the problem has three dimensions, it makes sense to do a transformation from two dimensions to three dimensional space. Observe the set P lies over a torus T in R3, such that the unit disk is swept around an axis with length 2π (all the possible directions1), see Figure 2.9. At this setting, the points are 2 2 distributed over the volume of T which is VT = πr 2πR = {R = r = 1} = 2π . To achieve a probability space, we normalize this number to be equal to one.

1Note that in this setting, θ is distributed over [0,2π] instead of [0,π]. However, these are equal in O(·) notation. 20 Probabilistic Connectivity Threshold for Directional Antenna Widths

Note that the transformation from the planar disk to the 3D torus does not refer to the center of the disk, i.e., the center did not transform to any place in the torus. This does not change the correctness of the transformation since the probability that a random point will fall exactly on the center is 0.

Fig. 2.9 The torus T ∈ R3. The blue circle (with the “minor radius”) is swept around the axis defining the red circle. The radius of the red circle R, (the “major radius”) is the distance from the center of the tube to the center of the torus.

Our goal is to find the minimal angle that suffice to guarantees that the induced com- munication graph is connected; hence, we would like to find the set of points that in- duces the minimal coverage area and ensures that these points induce a cover (which in turn yields a connected graph, see Lemma1). Observing that when the node is located on the boundary and the node’s direction is close to the tangent direction, the cover- age area is minimal, we focus on the set B of points which is defined as follows: B =

{(xi,yi,θi) ∈ Tex ∩ P : θi is the tangent direction}, where Tex is the external ring (a.k.a. annu- lus) of T (see Figure 2.10). Namely, B is the set of points that their location is closed to the disk’s boundary and their direction is closed to the tangent’s direction.

Claim 5 For any constant c ≥ 4 there exists natural n0, such that for all n ≥ n0, tangent angle induce a coverage area of size ≤ c(α(n))3.

Proof Calculating the area of the circle’s sector and subtracting the unnecessary area (namely the ∆uot in Figure 2.10) results 1/2(2α − sin2α)r2. Since the radius is r = 1 and α3 sinα ≤ α − 3! (by Taylor expansion), then for any sufficiently large n, we get 1/2(2α − (2α)3 3 (2α − 3! )) ≤ cα .

Claim 6 For any constant c ≥ 2 there exists natural n0 such that for all n ≥ n0, B lies over a 2 volume VB ≥ c(α(n)) . 2.4 Random Angle Direction 21

1 − 훼 표

훼 푇푖푛 푡

푇푒푥 훼 퓁 = 2훼 푢

Fig. 2.10 When the node is located on the boundary and the node’s direction is close to the tangent direction, the coverage area is minimal. We calculate their volume by eliminating the interior torus with r = 1 − α from VT.

Proof We are interested in the points that located close to the boundary and their direction is close to the tangent direction. Namely, in the three-dimensional representation, these are the points lying on the external ring of T. Denote the interior torus with radius = 1 − α by Tin and the external ring (a.k.a. annulus) by Tex such that T = Tex ∪ Tin. The volume 2  2 2 2 of Tex is VTex = VT −VTin = 1 − π(1 − α) 2πR /(2π ) = 1 − (1 − 2α + α ) = 2α − α . We multiply this by α to insert the direction constraint (see Claim8 below) and achieve 2 3 2 VB = 2α − α = Ω((α(n)) ) (Note that α is a function of n, such that when n → ∞ then α → 0).

Each node i = (xi,yi,θi) ∈ B defines a ball (spherical cap) Hi of the set of nodes that can  communicate with i: Hi = (x j,y j,θ j) ∈ P ∩ Tex : θ j ∈ [θi − α,θi + α] (i.e., Hi is the 3D shape symmetric to the 2D sector defined by wedge(i)).

Claim 7 For any constant c ≥ 4 there exists natural n0 such that for all n ≥ n0, the volume 4 of the ball Hi is VHi ≤ c(α(n)) .

Proof From Claim5, we find that the area of wedge(i) is ≤ cα3 (for c ≥ 4). We multiply this by α to insert the direction constraint (see Claim8) and achieve the expected volume.

For the directed communication settings, we can avoid multiplying by α since the nodes in the ball Hi may not be directed to the nodes that reside in B. Hence, the volume VH is q i 3 3 logn ≤ α , and the connectivity threshold becomes αˇ = Θ( n ).

Claim 8 Given the nodes u,v each with a communication angle of size α, the possible directions that induce adjacency between v and u is α. 22 Probabilistic Connectivity Threshold for Directional Antenna Widths

훼 푢 훼 푣

훼 훼 푢 푣

훼 훼 푢 푣

Fig. 2.11 The possible directions that induce adjacency between v and u is α.

Proof Fix u’s direction and turn v around. We can observe that v’s direction range that preserves adjacency of v and u (i.e., u and v sees each other) is [0,α].

Theorem 4 (Sufficient condition for connectivity) Given n nodes that are uniformly dis- q 4 logn tributed on the unit disk, there exists a constant c ≥ 4 such that if α ≥ c n , then w.h.p. the induced communication graph is connected.

Proof The set B represents the nodes that induce the minimal communication volume

(VHi ). To achieve the space cover, we can partition it into disjoint cells of size VHi and use probabilistic consideration to guarantee that w.h.p. there exists at least one ball in every cell. 4 Claim7 implies that for a constant c ≥ 4, VH ≤ cα ; hence, we partition the whole volume i q into 1 cells. By the coupon collector principle (see Section 2.2.2) since for any ≥ c 4 logn cα4 α n the number of nodes n is bigger then 1 log 1 , it yields that with high probability there is cα4 cα4 no empty cell.

Theorem 5 (Necessary condition for connectivity) Given n nodes that are uniformly dis- q 4 logn tributed on the unit disk, there exists a constant c < 4 such that if α < c n , then w.h.p. the induced communication graph is not connected. q 4 logn Proof We show that when α = c n , the graph G is not connected since B induces a volume that is not covered w.h.p. 4 From Claim7 we get that each node induce a ball Hi with volume ≤ α . Claims 10 √ S and 11 argue that there exist n such disjoint balls. Let H = Hi be the set of these disjoint Hi∩Hj=/0 2.5 Discussion 23 balls. To complete the proof, we show that w.h.p. there exists a ball Hi ∈ H that does not contain any node j, i.e., the node i ∈ B is not connected. When throwing n balls (nodes) at random into b bins (balls, the set H), the distribution of the number of balls in a given bin is approximately Poisson with λ = n/b (see Section 2.2.2).

Let χi denote an indicator random variable that is 1 when the ball Hi is empty and 0 otherwise. Let Xi be a Poisson random variable over the number of nodes in Hi. The density parameter 4 n c of Xi is λ = n/VHi = n/α = clogn/n = logn . Therefore, the probability that a given ball Hi is empty is c c −λ − 2 Pr(χi = 1) = Pr(Xi = 0) = e = e logn = (1/n) log n

Using the independency property of the Poisson variable, we get that the probability that all of the balls in H are not empty is

 √  √ Pr χ1 = 0 ∩ χ2 = 0 ∩ ... ∩ χ n = 0 = Pr(χ1 = 0)Pr(χ2 = 0)...Pr(χ n = 0) = (2.2) √ c ! n 1 log2 n = 1 − (2.3) n

√  c  n 1  log2 n Since limn→∞ 1 − n = 0, it implies that, with high probability, there exists an empty ball. Hence, the graph is not connected.

2.5 Discussion

In this study, we have analyzed the connectively threshold for directional antennas. Our results show that if one can adjust the direction of the antennas, then in order to guarantee  log(n)  the network connectivity, the angle should be Θ n . In contrast, if the direction of the q  4 log(n) antenna is a random variable, then the angle should be Θ n . This gives a polynomial gap between the two models. One of the simplest ways to increase the capacity of the network is by directional antennas. Our work defines theoretical bounds on how small the angle of the directional antennas can be in order to maintain connectivity. If we compare the classical results on connectivity in the unit disk graph model to our results, we find two main differences. The first difference is that in the unit disk graph model, the minimal graph that maintains the connectivity of the network is the Euclidean minimal spanning tree. In the directional antennas, the minimal graph is closer to the Hamiltonian cycle in nature. More accurately, if our points are located on the boundary of a disk, then it is the Hamiltonian cycle. Moreover, our analysis indicates 24 Probabilistic Connectivity Threshold for Directional Antenna Widths that the network is totally different near the critical angle from a usual network graph implied by the unit disk; while in the unit disk graph model the communication moves over short distances, on the directional antenna model, the communication prefers long distances, i.e., forms a point p to the antipodal point p∗. Throughout the study, we have assumed that the wireless antennas are scattered on a unit disk. We believe that the disk assumption can be relaxed to accommodate a more general and realistic setting, namely; our results hold for any convex fat object with curvature> 0, denoted by S. The fatness constraint means that there are concentric balls Bin ⊂ S ⊂ Bout, such that diam(Bin)/diam(Bout) ≥ β, where diam(A) = diameter o f A. see Figure 2.12. Given that the antennas-nodes uniformly distributed on S, the fatness property provides that the number nin of nodes in Bin, is proportional to the number nout of points that lie at Bout. Namely, there exists a constant c > 0 such that Bin = cBout. The convexity property of S implies that every two nodes u,v that lie in S see each other if each lies in the wedge of the other. We believe that these three properties yield that the connectivity threshold we presented for the unit disk D, suffice for S.

퐵표푢푡 S

퐵푖푛

Fig. 2.12 The concentric balls Bin ⊂ S ⊂ Bout. If S is fat, then diam(Bin)/diam(Bout) ≥ β.

2.6 Appendix

n logn Claim 9 Let I = {Ii}i=1 be a set of arc intervals of length |Ii| ≤ n . There exists a positive ′ constant c and natural n0 such that for all n ≥ n0 and for all I, there exist subset I = n  clogn n Ii j j=1 ⊆ I of clogn disjoint intervals induced by the nodes of G(Y,α).

′ Proof Given the set of intervals I, we partition the set into 1/(c|Ii|) disjoint cells. Let I ⊆ I denote the set of arcs that fall in the non-neighboring cells of I. By the construction, it holds ′ ′ n that I ’s arcs are disjoint. We argue that I contains clogn disjoint fragments. After throwing n balls at random into b bins, the distribution of the number of balls in a given bin is approximately Poisson with mean n/b (see Section 2.2.2). Let Xi be the Poisson 2.6 Appendix 25

random variable over the number of balls in bin i. Let χi denote an indicator random variable that is 1 when the ith bin is empty and 0 otherwise. n The density parameter of Xi is λ = = clogn; hence, the probability that a given 1/(c|Ii|) e−λ λ 0 −λ c bin i is empty is Pr(χi = 1) = Pr(Xi = 0) = 0! = e = (1/n) . By union bound, the probability that in the ith bin there is at least one ball is Pr(χi = 1 c 0) = (1 − ( n ) ). Using the independency property of the Poisson variables, the probability that there at least one ball in all the bins is   c|I′|  1 Pr χ = 0 ∩ χ = 0 ∩ ... ∩ χ ′ = 0 = 1 − 1 2 |I | n n  1c clogn = 1 − (2.4) n

n  1 c clogn For every c > 1, limn→∞ 1 − n = 1, meaning that w.h.p. there exist one ball n (node of G) in all the clogn disjoint arcs intervals bins.

Claim 10 There exist a constant c > 0 and natural n0 such that for all n ≥ n0, there are at least cn1/2 nodes in the set B.

2 Proof Claim6 implies that B lie at a volume of VB = Ω((α) ). The assumption of Poisson distribution implies the independent parts of the volume, hence, we can multiply this volume with n, achieving that there exist a constant c > 0 such that the number of points at this volume is p ncα2 = nc(logn/n)1/2 = cn1/2 logn = Ω(n1/2)

2 p √ Claim 11 Given the volume VB = α = logn/n and partition this volume into n balls 4 with volume VH α = logn/n, there exist a constant c > 0 and natural n > 0 such that for i √ 0 all n ≥ n0, with high probability there is a subset of at least c n disjoint balls induced by the graph nodes.

Proof After throwing n balls at random into b bins, the distribution of the number of balls in a given bin is approximately Poisson with mean n/b (see Section 2.2.2). Let Xi be a Poisson √ 4 √ random variable with parameter λ = c n/(1/V ) = clogn/ n and χi be an indicator that is 1 when the ith bin is empty and 0 otherwise. The probability that a given bin i is empty is √ −λ c/ n Pr(χi = 1) = Pr(Xi = 0) = e ≤ (1/n) . 26 Probabilistic Connectivity Threshold for Directional Antenna Widths

√ 1 c/ n By the union bound, the probability that the bin i is not empty is Pr(χi = 0) = 1− n . The probability that all bins are not empty is

  √c !|A| √ 1 n  − √c  n Pr χ = 0 ∩ χ = 0 ∩ ... ∩ χ = 0 = 1 − = 1 − n n (2.5) 1 2 |A| n

√  − √c  n √ Since for every c, limn→ 1 − n n = 0, w.h.p. our chosen O( n) cells are not ∞ √ empty, i.e., there exist nodes in the graph that induced these O( n) disjoint balls. Chapter 3

Big Data Interpolation using Functional Representation

3.1 Introduction

Consider the task of representing information in an error-tolerant way, such that it can be formulate even if it contains noise or even if the data is partially corrupted and destroyed. In this chapter we present the concept of data interpolation in the scope of data aggregation and representation, as well as the new big data challenge, where abstraction of the data is essential in order to understand the semantics and usefulness of the data. For example, a sensor network may interact with the physical environment, while each node in the network may sense the surrounding environment (e.g., temperature, humidity etc). The environmentally measured values should be transmitted to a remote repository or remote server. The concept of ‘smart dust’ [28] where there are plenty of tiny devices scattered around, is a motive for representing information in a concise manner. Note that the environmental values usually contain noise and there can be malicious inputs, i.e., part of the data may be corrupted. In contrast to distributed data aggregation where the resulting computation is a function such as count, sum and average (e.g. [20, 27, 43]), in distributed data interpolation, our goal is to represent every value of the data by an abstracting function. Our computational model consists of sampling the data and estimating the missing information using polynomial manipulations. Our motivation comes from the big data systems management where the big data abstrac- tion becomes one of the most important tasks in the presence of the enormous amount of data currently produced. The popular abbreviation ‘VVV’ summarizes the main challenges 28 Big Data Interpolation using Functional Representation that arise while handling big data: Volume, Velocity and Variety (see e.g. [8]). Volume emphasizes the big mass of information that is clearly the major challenge, and is the one that is most easily recognized. Velocity refers to is both the rate at which data arrives and the quick time it needs to be processed. By variety, we mean heterogeneity of data types and their representation as well as noise and outliers. The idea of a concise function that captures the essence of the information faces these three aspects as well; there is no need to maintain the whole data set, as the correct information is succinctly represented by the function. This function can serve as a description of the essence semantics of the data. Most of all, we address the problem of noisy, corrupted, and adversarial data. An extended abstract of this work was published at the International Symposium on Algo- rithms and Experiments for Sensor Systems, Wireless Networks and Distributed Robotics [13].

Polynomial Fitting to Noisy and Adversarial Data.

Generally, this study proposes the use of polynomials to represent large amounts of data. The process works by sampling the data, then using this sample to construct a polynomial whose distance (according to the ℓ∞ metric) from the polynomial constructed using the whole data N set is small. Our model assumes that we are given a set of points {(x1i ,...,xki )}i=1 in a finite k k domain [a,b] ⊆ R . Additionally, to every point we get a corresponding real value of an unknown continuous function yi = f (x1i ,...,xki ). Let S denote the set of pairs (x¯i,yi) where x¯ is the vector (x1,...,xk). Formally we examine the following problem in our study:

Definition 6 (Polynomial Fitting to Noisy and Byzantine Data Problem) Given the set S as defined above, a noise parameter δ > 0 and a threshold ρ > 0, we would like to find a polynomial p of total degree d satisfying

p(x¯) ∈ [y − δ,y + δ] for at least ρ fraction of S (3.1)

Such a polynomial will be called a (ρ,δ)-fitting polynomial.

Given that the function f is real-valued and continuous on the finite domain by the Weierstrass approximation theorem [41], we know that for any given ε > 0, there exists a polynomial p such that ∥ f − p∥∞ < ε for all x ∈ [a,b]. This tells us that our desired polynomial p exists (i.e., ε = δ,ρ = 1, satisfying eq.3.1). One obvious candidate to construct an approximating polynomial is interpolation at equidistant points. However, the sequence of interpolation polynomials does not converge uniformly to f due to Runge’s phenomenon [17]. Chebyshev interpolation (i.e., interpolate f using the points defined by the Chebyshev polynomial) minimizes Runge’s oscillation, but it forces us to know the value in a specific 3.1 Introduction 29 location, which is not reasonable in the underlying model assumptions. Taylor polynomials are also not appropriate; even setting aside questions of convergence, they are applicable only to functions that are infinitely differentiable, and not to all continuous functions. Considering other classical curve-fitting and approximation theories45 [ ], most research has used the ℓ2 norm of noise like the least error method. These attitudes do not suffice for the adversarial noise we have assumed here. To our knowledge, only Aroraand

Khot [4] referred the ℓ∞ noise that fits our considered problem. In Section 3.3 we introduce the polynomial fit generalization, where we provide a polynomial-time algorithm dealing with multi-variate data, based on Arora and Khot’s results. The polynomial fitting problem as defined above can also be approached byError- Correcting Code Theory. From that point of view, extensive literature exists dealing with an outlier’s appearance but noise-free data (i.e., δ = 0 and ρ < 1). In Section 3.2 we present an algorithm that handles Byzantine noisy data based on the Welch-Berlekamp [56] error- elimination method, where we suggest a means to deal with corrupted data appearing at multi-dimensional inputs.

Our contributions.

The polynomial fitting to noisy and byzantine data problem, as defined above, lies between the error-correcting and classical theory of approximation and curve fitting fields. As such, studying the polynomial fitting problem properties and solutions intersect with both fields’ aspects, especially in the scope of big data research. The primary results of the chapter can be summarized in the following theorems, while the precise definitions and the detailed proofs are presented in the sequel.

Theorem 6 There is an algorithm that reconstructs (ρ,δ)-fitting k dimensional polynomials for discrete value noise. The algorithm reconstructs the unknown total degree d polynomial within ℓ∞ error in time polynomial in δ,d and k.

Theorem 7 There is an algorithm that reconstructs (ρ,δ)-fitting k dimensional polynomial where ρ bounded away from 1/2. The algorithm reconstructs the unknown total degree d polynomials within ℓ∞ error with high probability over sample choice in time polynomial in δ,d and k.

Basic Definitions.

k k • For two multivariate polynomials p : [a,b] → R and q : [a,b] → R, their ℓ∞ distance

is defined to be ∥p − q∥∞ = supx¯∈[a,b]k |p(x¯) − q(x¯)|. 30 Big Data Interpolation using Functional Representation

α1 α2 αn • A monomial in a collection of variables xi,...,xn, is a product x1 x2 xn where αi are non-negative integers.

• The total degree of a multivariate polynomial p is the maximum degree of any term in p, where the degree of particular term is the sum of the variable exponents.

• A polynomial q is a δ-approximation to p if ∥p − q∥∞ ≤ δ.

• Let ∆ : R2 → {0,1} denote the difference between two values define as follows:  1 x=y ∆(x,y) = 0 otherwise

3.2 Discrete Finite Noise

In this section, we will study one aspect of the polynomial fitting problem posed in Defini- tion6, where the data contains a bounded number of Byzantine values (denoted by ρ) and we relaxed the noise constraint (denoted by ∆) to be discrete. First, we relate to two-dimensional data using the Welch-Berlekamp (WB) algorithm [56] and extend it to suffice for (discrete) noisy data. In the second part of this section, we generalize the WB algorithm to handle multidimensional data.

3.2.1 Handle the discrete noise

Welch and Berlekamp (WB) related the problem of polynomial reconstruction in their decoding algorithm for Reed Solomon codes [56]. The main idea of their algorithm is to describe the received (partially corrupted) data as a product of polynomials. Their solution holds for noise-free cases and a limited fraction of the corrupted data (δ = 0,ρ > 1/2). Almost 30 years later, Sudan’s list decoding algorithm [52] relaxed the Byzantine constraint (δ = 0,ρ can be less than 1/2) by using bivariate polynomial interpolation. Those concepts do not hold up well in the noisy case since they use the roots of the polynomial and the divisibility of one polynomial by other methods that are problematic for noisy data (as shown in [4], Section 1.2). Here, we will use the WB algorithm [56] as a “black box" to obtain an algorithm that handles the discrete-noise notation of the polynomial-fitting problem. N Given a data set (x,y) = (xi,yi)i=1 that is within a distance of t = (1 − ρ)N from some N polynomial p(x) of degree < d (i.e., ∑i=1 ∆(yi, p(xi) ≤ t), the WB approach to eliminate the irrelevant data is to use the roots of an object called the error-locating polynomial, e. In other words, we want e(xi) = 0 whenever p(xi) ̸= yi. This is done by defining another polynomial 3.2 Discrete Finite Noise 31 q(x) ≡ p(x)e(x). To resolve these polynomials we need to solve the linear system,

q(xi) = yie(xi) for all i. (3.2)

Welch and Berlekamp show that e(x)|q(x) and p(x) can be found by the ratio p(x) = q(x)/e(x) at O((d + 2t + 1)ω ) running time (where ω is the matrix multiplication complex- ity). In Algorithm1, we are use the WB method as a subroutine to manage the noisy-corrupted data.

Algorithm 1 Reconstruct (ρ,δ)-fitting polynomial where δ is discrete

Input: d,S,t,δ,{∆i} Output: a polynomial pi(x) which is the (ρ,δ)-fitting polynomial i ← 0 repeat i ← i + 1 Si ← S + ∆i pi(x) ← WB(Si,d,t) until pi(x) ∈ [y − δ,y + δ] for at least ρ fraction of S Return pi(x)

The algorithm receives the degree d > 0 of the goal polynomial; a set S of N = d +2t +1 N datapoints (xi,yi)i=1; the Byzantine appearance bound 0 < t = (1 − ρ)N; the noise bound δ  and the set of vectors are {∆i} = (a1,...,ad+2t+1) : a j ∈ [−δ,δ] of all possible (discrete) noise .

At every step i, we add S different values of noise ∆i and reconstruct the polynomial pi using the WB algorithm. The resulting polynomial pi is tested where the criteria is that pi is within δ from all nodes except the Byzantine nodes (according to the maximal number of Byzantine values as defined by ρ). Note that if the desired polynomial’s degree d is not given, we can search for the minimal degree of a polynomial that fits the δ and ρ in a binary search fashion. In addition, note that Ar et al. [3] also suggest a way to reconstruct (ρ,δ)-fitting polyno- mial where δ is discrete. They also insert all possible (discrete) noise sequentially, but use a different ‘black box’ procedure and do not deal with multivariate polynomials as we present in the sequel.

3.2.2 Multidimensional Data.

To generalize WB algorithm to handle multidimensional data, there is a need to formalize the algorithm to deal with multivariate polynomials. This is a challenging task due to the infinite 32 Big Data Interpolation using Functional Representation

roots those polynomials may have (for example, P(x1,x2) = x1x2 has infinitely many roots in R) and as previously mentioned, the WB method is strictly based on the polynomials’ roots. A suggested method to handle three dimensional data is to assume that the values of datapoints in one direction (e.g., x-direction) are distinct. This assumption helps us to define the error locating polynomial e in the x-direction only (or symmetrically over the y-axis). That assumption can be achieved by using i.i.d samples or by enlarging the sample size from N to N2 which ensures that (using the pigeonhole principle) we have at least N distinct values. The 3-dimensional polynomial reconstruction is described in Algorithm2.

Algorithm 2 Reconstruct the polynomial p(x,y) representing the true data N Input: d,t,(xi,yi,zi)i=1 Output: Polynomial p(x,y) of total degree at most d or fail. Step 1: Compute a non-zero univariate polynomial e(x) of degree exactly t and a bivariate polynomial q(x,y) of total degree d +t such that: zie(xi) = q(xi,yi), 1 ≤ i ≤ N. if e(x) and q(x,y) do not exist then output fail. end if End

Step 2: if e(x) does not divide q(x,y) then output fail. else q(x,y) compute p(x,y) = . e(x) end if if ∑∆(zi, p(xi,yi)) > t then output fail. else output p(x,y). end if End

The algorithm receives the total degree d > 0 of the goal polynomial, the Byzantine appearance bound 0 < t = (1 − ρ)N and a set of N triples (xi,yi,zi) with distinct xi’s. At the first step we define the error location polynomial e and the multiplication polynomial q. At the next step we check if e divides q and if the distance between the resulting polynomial ′ and zis does not exceed the bound t. Next we argue that this algorithm fulfills its goal, and illustrate its behavior in an example. 3.2 Discrete Finite Noise 33

Theorem 8 Let p be an unknown d total degree polynomial with two variables. Given a 3 d+t+2 threshold t > 0 and a sample S ∈ R of N = d+2 +t distinct points (xi,yi,zi) such that ω zi ̸= p(xi,yi) for at most t values. The algorithm above reconstructs p in O(N ) running time (where ω is the matrix multiplication complexity).

Proof The proof of the Theorem above follows from the subsequent claims.

Claim 12 (Correctness) There exist a pair of polynomials e(x) and q(x,y) that satisfy Step 1 such that q(x,y) = p(x,y)e(x).

Taking the error locator polynomial e(x) and q(x,y) = p(x,y)e(x), where deg(q) ≤ deg(p) + deg(e) ≤ t + d. By definition, e(x) is a degree t polynomial with the following property:

e(x) = 0 iff zi ̸= p(x,y)

We now argue that e(x) and q(x,y) satisfy the equation zie(xi) = q(xi,yi). Note that if e(xi) = 0, then q(xi,yi) = zie(xi) = 0. When e(xi) ̸= 0, we know p(xi,yi) = zi and so we still have p(xi,yi)e(xi) = zie(xi), as desired.

Claim 13 (Uniqueness) If any two distinct solutions (q1(x,y),e1(x)) ̸= (q2(x,y),e2(x)) sat- q (x,y) q (x,y) isfy Step 1, then they will satisfy 1 = 2 . e1(x) e2(x)

It suffices us to prove that q1(x,y)e2(x) = q2(x,y)e1(x). Multiply this with zi and sub- stitute x,y with xi,yi, respectively, q1(xi,yi)e2(xi)zi = q2(xi,yi)e1(xi)zi We know, ∀i ∈ [N] q1(xi,yi) = e1(xi)zi and q2(xi,yi) = e2(xi)zi If zi ̸= 0, then we are done. Otherwise, if zi = 0, then q1(xi,yi) = q2(xi,yi) = 0 ⇒ q1(xi,yi)e2(xi) = q2(xi,yi)e1(xi). Since these polynomi- als agree on more points than their degree, they are identical. Note that as e1(x) ̸= 0 and q1(x,y) q2(x,y) e2(x) ̸= 0 this implies that = as desired. e1(x) e2(x)

d+t+2 Claim 14 (Time complexity) Given N = t + d+t data samples, we can reconstruct p(x,y) using O(N)ω ) running time.

d+k Generally, for k variate polynomial with degree d, there are d terms [48]; thus, it is a d+t+2 necessary condition that we have N = t + d+t distinct points for q and e to be uniquely defined. We have N linear equations in at most N variables, which we can solve e.g., by Gaussian elimination in time O(Nω ) (where ω is the matrix multiplication complexity). Finally, Step 2 can be implemented in time O(N2) by “long division” [55]. Note that the general problem of deciding whether one multivariate polynomial divides another is related 34 Big Data Interpolation using Functional Representation to computational algebraic geometry (specifically, this can be done using the Gröbner base). However, since the divider is a univariate polynomial, we can mimic long division, where we consider x to be the “variable" and y to just be some “number."

Example 1. Suppose the unknown polynomial is p(x,y) = x + y. Given the parameters: d = 1 (degree of p), m = 2 (number of p’s variables) and t = 1 (number of corrupted inputs) d+t+2 and the set of t + d+t = 7 points:

(1,2,2),(2,2,4),(6,1,7),(4,3,7),(8,2,0),(9,1,10),(3,7,10) that lie on z = p(x,y). Following the algorithm, we define: deg(e) = 1,deg(q) = 2 and

2 2 qi = α1xi + α2xiyi + α3yi + α4xi + α5yi + α6 = zi(xi + α7) for coefficients α1,...,α6,β and 1 ≤ i ≤ 12. Note that e(x) defined to be monic (i.e., its leading coefficient is equal to 1) to ensure that it will be different from zero. Thus, wederive the linear system:

α1 + α2 + α3 + α4 + α5 + α6 = 2β + 2

4α1 + 4α2 + 4α3 + 2α4 + 2α5 + α6 = 4β + 8

36α1 + 6α2 + α3 + 6α4 + α5 + α6 = 7β + 42

16α1 + 12α2 + 9α3 + 4α4 + 3α5 + α6 = 7β + 28

64α1 + 16α2 + 4α3 + 8α4 + 2α5 + α6 = 0

81α1 + 9α2 + α3 + 9α4 + α5 + α6 = 10β + 90

9α1 + 21α2 + 49α3 + 3α4 + 7α5 + α6 = 10β + 30

Solving the system gives: q(x,y) = x2 + xy − 8x − 8y and e(x) = x − 8. Dividing those polynomials, we get the solution: q(x,y)/e(x) = p(x) = x + y. corollary 1 (Multivariate Polynomial Reconstruction) Let p be an unknown d total de- gree polynomial with k variable. Given a threshold ρ > 0, a noise parameter δ and a sample k N S ∈ R of N distinct points (x1i ,...,xmi ,yi)i=1 such that yi ∈ [p(x1i ,...,xki )−δ, p(x1i ,...,xki )+ d+k+ρk δ] for at least ρ fraction of S. p can be reconstructed using N = d+k + ρk datapoints.

Proof Following Theorem8, we can rewrite its proof for the multidimensional generalization. Following Algorithm2, we define the error location polynomial to contain only one variable, such that at Step 2, we divide multivariate polynomial with univariate one, an action that cost O(N logN) time (again, by mimic long division). 3.2 Discrete Finite Noise 35

An interesting question is if there is any advantage to define the error-location polynomial e to be multi-variate (instead of univariate as we previously presented). The size of the input data is strictly defined by the given bound on the corruptedt data( = (1 − ρ)N) and the goal polynomial degree (d). Thus, there is no sense if the unknown coefficient comes from a higher degree polynomial or from a higher dimension polynomial, i.e., both options require the same size of sample, as illustrated in the example below.

Polynomial Reconstruction Examples

Given that the goal unknown polynomial p has m = 2 variable, deg(p) = 1 and that the data contain t = 2 Byzantine appearance, we can define the error-correcting polynomial e to be univariate polynomial, deg(e) = 2 and get the linear equation:

3 2 2 3 2 2 2 α1x + α2x y + α3xy + α4y + α5x + α6xy + α7y + α8x + α9y + α10 = z(x + β1x + β2) (3.3) when substitute the given data

(1,2,2),(-2,6,0;),(2,2,4),(6,1,7),(4,3,7),(9,1,10),(3,7,10),(5,7,12),(7,4,11),(10,3,13),(11,2,13),(12,4,16) at (eq.3.3) we get:

2 2 3 q1(x,y) = xy − 2y − 2x + x y + x + x 2 e1(x) = x + x − 2

Or, by defining e to be bivariate polynomial,deg(e) = 1:

3 2 2 3 2 2 α1x + α2x y + α3xy + α4y + α5x + α6xy + α7y + α8x + α9y + α10 = z(x + β1y + β2) which its solution is:

2 2 q2(x,y) = x + 7xy/4 − 5x/2 + 3y /4 − 5y/2

e2(x,y) = x + 3y/4 − 5/2

At both cases the polynomial devision result is equals and gives the expected solution:

q1(x,y)/e1(x,y) = q2(x,y)/e2(x,y) = x + y = p(x). 36 Big Data Interpolation using Functional Representation

3.3 Random Sample with Unrestricted Noise

Motivated by applications in vision, Arora and Khot [4] studied the univariate polynomial fitting to noisy data. They examine the problem of finding all (ρ,δ)-fitting polynomials (see Definition6), and show that it has no polynomial-time algorithm. However, for a given random set of points (xi,yi) such that yi ∈ [p(xi) − δ, p(xi) + δ] where p(x) is an unknown polynomial, [4] suggests O(d2) algorithm that reconstruct a δ-approximation polynomial to p(x) (where d is the given polynomial degree). In this section, we generalized this results to k-dimensional data and give a solution for the Byzantine appearance, which completes the proof of Theorem7.

At the first step, we will focus on bivariate polynomial reconstruction and generalize the results to the multivariate case afterward. The underlying assumption here is that xi,yi ∈ [−1,1]. Also, since we solve the problem for a fixed value of δ we need a restriction on the ℓ∞ norm of the polynomials. Here we assume that ∥ f (xi,yi)∥∞ ≤ 1. Alternatively, given that ∥ f (xi,yi)∥∞ = α we can set the noise parameter to αδ.

3.3.1 Polynomial fitting to noisy data

N Given the datapoints (xk,yk,zk)k=1, such that zk = f (xk,yk) ± δ we approach the recon- struction problem by defining a linear programming system with the fitting polynomial as i j its solution, i.e., zk − δ ≤ ∑ ci jx y + δ ≤ zk k = 1,...,N, where ci j are the polynomial i+ j≤d coefficients.

Unfortunately, feasible solutions to this LP need not give δ-approximation to the un- known polynomial. We have to incorporate the constraint that the unknown polynomial must take values of [−1,1]. To do this we move to Chebyshev’s representation of the polyno- mial and bound its coefficients i.e., p(x,y) = ∑ ci jTi(xk)Tj(yk) where Ti(·),Tj(·) are the √ i+ j≤d Chebyshev’s polynomial and ci j ≤ 2.

Let I be a set of d6 equally spaced points that cover the [−1,1] × [−1,1] (where d is the given total degree of the unknown polynomial). Given the random sample S,N = d2 d δ log( δ ), we define the following LP minimization: 3.3 Random Sample with Unrestricted Noise 37

minimize δ s.t.

zk − δ ≤ ∑ ci jTi(xk)Tj(yk) ≤ zk + δ, k = 1,...,N i+ j≤d √ |ci j| ≤ 2, i + j ≤ d

∑ ci jTi(x)Tj(y) ≤ 1, ∀x,y ∈ I (3.4) i+ j≤d

The following Theorem presents our main result for solving the polynomial fitting problem:

Theorem 9 Let f be an unknown d total degree polynomial with two variables, such that f (x,y) ∈ [−1,1] when x,y ∈ [−1,1]. Given a noise parameter δ > 0 and a sample S of d2 d N = O( δ log( δ )) random points xi,yi,zi ∈ [−1,1] such that zi ∈ [ f (xi,yi) − δ, f (xi,yi) + δ]. 1 With probability at least 2 (over the choice of S), any feasible solution p to the above LP is 3δ-approximation of f .

Proof For our proof, we need the Bernstein-Markov inequality which we state below.

Theorem 10 (Bernstein-Markov [18]) For a polynomial Pd of total degree d, a direction ξ and a bounded convex set S ⊂ Rk

∂ 2 Pd ≤ cSd ∥Pd∥∞ (3.5) ∂ξ ∞

where cS is independent of d (and dependent on the geometric structure of S).

Let p = p(x,y) be the d total degree polynomial obtained from taking any solution to the above LP. We know p exists, i.e., the LP is feasible because the coefficient of f satisfies it. Let n and m denote the polynomial degree in the x and y direction, respectively. Note that n + m ≤ d

n3m+m3n Claim 15 ∥p∥∞ ≤ 1 + O( |I| ).

′ 2 For x ∈ [−1,1], since ∥Ti(x)∥∞ = 1 using Bernstein-Markov theorem we have |Ti (x)| ≤ O(i ). Thus: 38 Big Data Interpolation using Functional Representation

n m ′ ′   px ≤ ∑∑|ai j| Ti(x)Tj(y) i j x n m √ ≤ ∑∑ 2O(i2) ≤ O(n3m) i j

From symmetric consideration, we get: p′ ≤ O(nm3). y √ The distance between successive points of I is 2 2/|I| (I is equidistant). By construction, p takes all values in [−1,1] for all points in I. The derivative p′ by definition gives the rate 3 3 of change in p, hence, ∥p∥ ≤ 1 + O( n m+m n ). (Namely, ∥p∥ is between two successive ∞ |I| ∞ √ points of I that their values ≤ 1 where the possible change at the interval of length 2 2/|I| is p′ = O(n3m + m3n)).

Claim 16 ∥p′ ∥ , p′ ≤ O(d2). x ∞ y ∞

The claim follows from the Bernstein-Markov (Theorem 10) and the estimate ∥p∥∞ = 1 + O(1). To complete the proof of Theorem9 we denote by ε the largest distance between two successive points out of (x1,y1),...,(x|S|,y|S|). Every interval of size ε contains at least one δ of the datapoints (forming ε-net). With high probability, ε = O(logN/N) = O( d2 ). Now, p and f are functions satisfying ∥p′ ∥ ,∥ f ′∥ , p′ , f ′ ≤ O(d2) (from the previous x ∞ x ∞ y ∞ y ∞ claim); hence, ∥(p − f )′ ∥ , (p − f )′ ≤ O(d2). Due to the LP constraint, p, f differs by x ∞ y ∞ at most δ on the points in the ε-net, so we get

2 ∥(p − f )∥∞ ≤ 2δ + O(εd ) = 3δ as needed.

Remarks:

• When the sample S is uniformly spaced to cover [−1,1] × [−1,1], then we only need N > Ω(d2/δ).

′ ′ • If we know that the derivative is bounded by C (i.e., fx, fy ≤ C), the above proof implies C C that N > C log δ suffice.

• The Bernstein-Markov theorem also holds for multivariate trigonometric polynomials (see [18]), thus, we can generalize the above proof also for this class of function. This 3.3 Random Sample with Unrestricted Noise 39

generalization is important in the scope of wireless sensor networks since the use of trigonometric function is the appropriate way to represent the sensor data behavior (e.g., temperature).

• If f (x,y) were not in [−1,1] and ∥ f ∥∞ = L then the size of the sample N has to be increased by a factor of L.

• The presented method holds only when we assume equidistance or random sampling (as opposed to Section 3.2 that handles any given sample). Otherwise, when the dataset is dense (since we allow δ perturbation of the data), it can cause a sharp slope in the resulting function, though the original data is close to the constant at the sampling interval.

d2 d corollary 2 Given the set S of N = O( δ log( δ )) k-dimensional random datapoints and a constant c(S,k) dependent only on the geometry and the dimension of the data, we can reconstruct the unknown polynomial within c(S,k)δ error in ℓ∞ norm with high probability over the choice of the sample.

Proof The two-dimensional proof holds for the general dimension, where the approximation accuracy dependent on the constant c(S,k) comes from Theorem 10. This constant is independent of the polynomial degree, but dependent on the set of the data points (see [18]). Note that c(S,k) increases exponentially when increasing the dimension.

3.3.2 Byzantine Elimination.

The method suggested above (Theorem9) does not deal with Byzantine inputs, i.e., outliers that are not bound by the δ ℓ∞ noise parameter; however, the method [4] presented in Section 6 can be rewritten to eliminate corrupted data such that the input datapoints will contain only true (but noisy) values.

Theorem 11 There exists a polynomial-time algorithm to reconstruct (ρ,δ)-fitting polyno- mial when ρ is bounded away from 1/2.

Proof Assume that ρ fraction of the data is uncorrupted. For any point xi,yi ∈ [−1,1], δ δ δ δ consider a small square Λ = [xi − d3 ,xi + d3 ] × [yi − d3 ,yi + d3 ] (where d is the total degree 4 log(1/δ) of the polynomial we need to find) for a sample of d δ points, with high probability Ω(log(d)) of the samples lie in this square. We are given that ρ fraction of these sample points give an approximate value of f (xi,yi), i.e., the correct value lies in the interval [ f (xi,yi) − δ, f (xi,yi) + δ] and the rest of the sample is corrupted, i.e, NOT in [ f (xi,yi) − 40 Big Data Interpolation using Functional Representation

2 δ, f (xi,yi)+δ]. As shown in Claim 16, the derivatives are bounded by O(d ); thus, the value of the polynomial is essentially constant over Λ. Hence, at least ρ fraction of the values seen in this square will lie in [ f (xi,yi) − δ, f (xi,yi) + δ] and the rest is irrelevant corrupted data. Thus, at every point (xi,yi), we can reconstruct f (xi,yi). The sample is large enough so that we can reconstruct the values of the polynomial at say, d2/δ equally spaced points. Now, applying the techniques presented in Section 3.3 enables us to recover the polynomial. Note that we need the assumption that ρ is bounded away from 1/2 for breaking the symmetry between the true and the Byzantine data.

To conclude this section, we summarize the presented results in the following Algorithm:

Algorithm 3 Reconstruct the polynomial p(x,y) representing the true data Input: S,ρ,d,δ S′ ← /0 ′ for each point (xi,yi,zi) ∈ S such that (xi,yi,zi) ∈/ S do δ δ δ δ Λ ← [xi − 3 ,xi + 3 ] × [yi − 3 ,yi + 3 ]  d d d d Si ← (x j,y j,z j)|(x j,y j) ∈ Λ,z j ∈ [zi − δ,zi + δ] if |Si| ≥ ρ fraction of the points in Λ then ′ ′ S ← S ∪ Si end if p(x,y) ←LP minimization using the set S′ Return p(x,y)

4 log(1/δ) The algorithm requires the data sample S of d δ points, the true-data fraction ρ, the total degree bound d and the noise parameter δ. At every step we are looking in a small square around the point (xi,yi) and collect all the points that are close to the zi value, ensuring that we have at least ρ fraction of the points in the square (if not, it means that the point is Byzantine and we ignore it). We repeat this process until we collect enough true-datapoints, d2 ′ i.e., at least Ω( δ ) points. The collected set (sign as S in the algorithm) is the input for the linear-programming equations which finally give us the expected polynomial as in the proof of Theorem9.

3.4 Discussion

We have presented the concept of data abstraction in the scope of data interpolation and representation, as well as the new big data challenge, where abstraction of the data is essential in order to understand the semantics and usefulness of the data. Interestingly, we found 3.4 Discussion 41 that classical techniques used in numeric analysis and function approximation, such as the Welsh-Berlekamp efficient removal of corrupted data, Arora Khot and the like, relate tothe data abstraction problem. For the first time, we have extended existing classical techniques for the case of three or more function dimensions, finding polynomials that approximate the data in the presence of noise and a limited portion of completely corrupted data. Our underlying assumption is that the data can be represented with a d degree polynomial (with bounded noise and outliers). However, this assumption can be relaxed or tuned in hierarchical fashion, meaning, we can increase the polynomial degree while decreasing the noise to fit our model to a desired accuracy. In addition, we can maintain several polynomials to represent different dimensions or different slices in the data. We believe that the mathematical techniques we have presented have applications beyond the scope of data collection or big data, in addition to being an interesting problem that lies between the fields of error-correcting and classical theories on approximation and curve fitting. Throughout our research we have distinguished two different measures for the polynomial fitting to the Byzantine noisy data problem: the first being the Welsh-Berlekamp generaliza- tion for discrete-noise multidimensional data, and the second being the linear-programming evaluation for multivariate polynomials. Approached by the error-correcting code methods, we have suggested a way to represent noisy-malicious input with a multivariate polynomial. This method assumes that the noise is discrete. When the noise is unrestricted, based on Bernstein-Markov Theorem and Arora & Khot algorithm, we have suggested a method to reconstruct algebraic or trigonometric polynomial that traverses ρ fraction of the noisy multidimensional data. We suggest using polynomials to represent abstract data, since polynomials are dense in the function space on bounded domains (i.e., they can approximate other functions arbitrarily well) and have a simple and compact representation as opposed to spline e.g., [39] or other image processing methods. Directions for further investigation might include the use of interval computation for representing the noisy data with interval polynomials or using other series that can represent any function instead of Taylor series; the Fourier series are a good example.

Chapter 4

Mending Missing Information in Big-Data

4.1 Introduction

One of the main challenges that arise while handling Big-Data is not only the large volume, but also the high-dimensions of the data. Moreover, part of the information at the different dimensions may be missing. Assuming that the true (unknown) data is d-dimensional points, we suggest representing the given data point (which may lack information at different d dimensions) as a k-affine space embedded in the Euclidean d dimensional space R . Denote the affine-Grassmannian set of all k-affine spaces, embedded in the Euclidean d dimensional space, as A(d,k). This means that a point in our data set is a point in the affine-Grassmannian A(d,k). A data object that is incomplete in one or more features corresponds to an affine subspace d (called flat, for short) in R , whose dimension is the number of missing features. This representation yields algebraic objects, which help us to better understand the data, as well as study its properties. A central property of the data is clustering. Clustering refers to the process of partitioning a set of objects into subsets, consisting of similar objects. Finding a good clustering is a challenging problem. Due to its wide range of applications, the clustering problem has been investigated for decades, and continues to be actively studied not only in theoretical computer science, but in other disciplines, such as statistics, data mining and machine learning. A motivation for cluster analysis of high-dimensional data, as well as an overview on some applications where high-dimensional data occurs, is given in [32]. Our underlying assumption is that the original data-points, the real entities, can be divided d into different groups according to their distance in the R . We assume that every group of 44 Mending Missing Information in Big-Data

d points lie in the same d dimensional ball B (a.k.a. a solid sphere), since the distance between a flat and a point (the center of the ball) is well-defined. The classic clustering problems, such as k-means or k-centers (see [24] Chapter 8), can be defined on a set of flats. The clustering problem when the data is k-flats, is to find the centers of the balls that minimizes thesumof the distance between the k-flats and the center of their groups, which is the nearest center among all centers. However, Lee & Schulman [34] argues that the running time of an approximation algo- rithm, with any approximation ratio, cannot be polynomial in even one of m (the number of clusters) and k (the dimension of the flats), unless P = NP. We overcome this obstacle by approaching the problem differently. Using a probabilistic assumption based on the distribution of the data, we achieve a polynomial algorithm, which we use to identify the flats’ groups. Moreover, the presented probability arguments can help us in better understanding the geometric distribution of high dimensional data objects, which is of major interest and importance in the scope of Big Data research.

Our contributions.

We face the challenge of mending the missing information at different dimensions by repre- senting the objects as affine subspaces. In particular, we work within the framework offlat d in R , where the missing features correspond to the (intrinsic) dimension of the flat. This representation is accurate and flexible, in the sense that it saves all the features of theorigin data; it also allows for algebraic calculation over the objects. In this chapter, we study the pairwise distance between the flats, and based on our probabilistic and geometrical results, we developed a polylogarithmic algorithm that achieves clustering of the flats with high probability. The main result of the study is summarized in the following theorem, while the precise definition and the detailed proof are presented in the sequel.

d Theorem 12 Given the separable data set P of n affine subspaces in R , for any ε > 0 and for sufficiently large d (depending on ε), with probability 1 − ε, we can cluster P according d to B , using their pair-wise distance projection in poly(n,k,d) time.

Remarks:

• In addition to proving good performance for high dimensions as required in the scope of big-data, we also show that the algorithm works well for low dimensions.

• Using sampling, we achieve a poly-logarithmic running time. 4.2 Preliminaries 45

• We show we can relax the model assumption about the identical size of clusters to any different sizes.

To enhance the readability of our text, Section 4.2 contains the basic notions, from convex and stochastic geometry, which are needed in the following. In particular, we recall the notion of flats and provide the model assumptions. We prove our main result in Section 4.3, and summarized the suggested Algorithm4 in Section 4.4. We supplement our theoretical results with experimental data in Section 4.5, and generalize our results to clusters with different size in Sections 4.6. In Section 4.7 we illustrate how one can change our algorithm to work in sublinear time and to implement it in distributed fashion. Finally, in Section 4.8, we discuss the geometric and algebraic representation, comparing our approach against others’ proposals.

4.2 Preliminaries

General notation.

d Throughout the following, we work in d-dimensional Euclidean space R , d ≥ 2, with scalar d product ⟨·,·⟩ and norm ∥·∥. Hence, ∥x − y∥ is the Euclidean distance of two points x,y ∈ R , d and dist(X,Y) := inf{∥x − y∥ : x ∈ X,y ∈ Y} is the distance of two sets X,Y ∈ R . We refer to any set S ⊆ X, which is closest to Y, i.e., satisfies ∥Y − S∥ = dist(X,Y), as a projection of Y on X. In general, there can be more than one projection of Y on X, i.e., several subsets in Y closest to X.

Grassmannians.

For k ∈ {1,...,d}, we denote by G(d,k) and A(d,k) the spaces of k-dimensional linear d and affine subspaces of R , respectively, both supplied with their natural topologies (see e.g., [50]). The elements of A(d,k) are also called k-flats (for k = 0, points; for k = 1, lines; for k = 2, planes; and for k = d − 1, hyperplanes). Recall that two subspaces L ∈ G(d,k1) and M ∈ G(d,k2) are said to be in general position if the span of L∪M has dimension k1 +k2 if k1 + k2 < d or if L ∩ M has dimension k1 + k2 − d if k1 + k2 ≥ n. We also say that two flats E ∈ A(d,k1) and F ∈ A(d,k2) are in general position, if this is the case for L(E) and L(F), where L(E) is the linear subspace parallel to E. 46 Mending Missing Information in Big-Data

Geometric and Probabilistic definitions.

Let P = {P1,P2,...,Pn} be the set of n random flats that we want to cluster. For the sakeof simplicity, we consider the situation where all of them are of dimension k, where k is taken to be the greatest dimension of any flat in P. Hence, every flat P is represented by a set of d − k linear equations, each with d variables. Alternatively, we can represent any k-flat using a parametric notation, such that P is given by a set of d linear equations, each with d − k variables. When there is no flat with a fixed ith coordinate, we will call the ith coordinate trivial. We can assume that no coordinate is trivial, since otherwise, simply removing this coordinate from all flats will decrease k and d by 1, while not affecting the clustering cost. c ∈ d d d c d For R , let Bc be the unit ball of dimension , centered at and Bc0 denote the unit ball centered at the origin. Two balls, Bci and Bc j , are ∆-distinct if dist(ci,c j) ≥ ∆. The ball d  Bc intersects the subset of flats P = P1,...,Pj if it intersects each flat in P. We will denote c d d by Pi ∈ P a k-flat intersecting the unit ball Bc and by Pi(r) ∈ P a k-flat in R passing through the point (r,0,...,0). d ∗ Let ∗ be an equivalence relation such that for a point u ∈ Bc , u is the antipodal point of u (i.e., u and u∗ are opposite through the center c). For a k-flat Pc intersecting the unit ball d c∗ Bc in one point only (i.e., tangent to the balls surface), P denote its antipodal k-flat. If E and F are in general position, there are unique points xE ∈ E and xF ∈ F, so that dist(E,F) = ∥xE − xF ∥. We call the point p = midpoint (E,F) := (xE + xF )/2 the midpoint of E and F. The probability, expectation and variance; will be denoted by the common notations Pr(·),E(·) and V(·) respectively. For a random variable A dependent on d, we denote by

A →p c the “converges in probability”, namely, ∀ε, lim Pr(∥A − c∥ ≤ ε) = 1. d→∞

Model assumptions.

Throughout the chapter we assume that the data is separable, namely, satisfing the following assumptions:

• Two independent random flats E,F ∈ A(d,k), with distribution Q, are in general position with probability one.

•1 ≤ k ≤ ⌊d/2⌋ which ensures that the flats do not intersect each other with probability one.

d ,..., d distinct • The (unknown) balls Bc1 Bcm are ∆- with probability one. 4.3 k-flats Clustering 47

• The given flats set P is a superset of m groups P = {P1,...,Pm}, such that every group d Pi ∈ P contains n/m flats that intersect the ball Bci . Moreover, each flat P ∈ Pi has a d normally distributed location and direction at the ball Bci . We model this assumption by normally distributed coefficients. The parametric representation ofa k-flat P is:   α0,1 + α1,1t1 + α2,1t2 + ... + αk,1tk   P = At + a =  ...  α0,d + α1,dt1 + α2,dt2 + ... + αk,dtk

where αi j ∼ N(µ,σ) and t is the k-dimensional vector.

4.3 k-flats Clustering

d Given the set P of n k-flats in R , our goal is to cluster the flats according to the unknown set of balls, namely, to separate P into m groups such that every group Pi ∈ P contains n/m d flats that intersect the same unit ball Bci . We suggest the following procedure (summarized below in Algorithm4) for the clustering process. The first step is to find the distance and the midpoint between every pair of flats in P. Next, we filter the irrelevant midpoints using their corresponding distances such that midpoints with a distance greater than two are dropped and those with a distance ≤ 2 are grouped together. In the final step we check which group contains O(n/m) flats and output those groups. We argue that these simple steps provide the expected clustering procedure with high probability. In this section, we claim its correctedness using geometric and probabilistic arguments which appear in the following Propositions and Lemmas. As mentioned above, we start our procedure by calculating the pair-wise projection of P, namely, finding the distance and the midpoint between every pair in P. Let Pi =  d  d x ∈ R : Ex = e and Pj = y ∈ R : Fy = f be a pair of k-flats in P. Note that the matrices dimensions Dim(E) = Dim(F) = (d − k)×d since each flat P ∈ P is represented by d − k equations with d variables. The suggested algorithm calculates the minimum distance points (i.e., midpoint) between the pair using Euclidean norm minimization:

minimize ∥Ax − b∥ (4.1) ! ! E e where A = ,x = (x1,...,xd),b = F f Since the norm is always nonnegative, we can just as well solve the least squares problem

minimize∥Ax − b∥2 (4.2) 48 Mending Missing Information in Big-Data

The problems are clearly equivalent, while the objective in the first one is not differentiable at any x with Ax − b = 0, whereas the objective in the second is differentiable for all x.

Proposition 1 The least squares minimization (Eq. 4.2) gives unique solution p such that p = midpoint(Pi,Pj).

Proof Using the equation

minimize ∥Ax − b∥2 = (Ax − b)T (Ax − b) this problem is simple enough to have a well known analytical solution - a point p minimizes the function f = xT AT Ax − 2xT AT b + bT b if and only if

∇ f = 2AT Ap − 2AT b = 0 i.e., if and only if p satisfies normal equations

AT Ap = AT b which always will have a solution (note that the system is square or over-determined since 2(d − k) ≥ d for 1 ≤ k ≤ d/2). The columns in A are the different coordinates of the two flats, hence they are independent and have a unique solution: p = (AT A)−1AT b.

Proposition 2 Using the midpoint p = midpoint(Pi,Pj) one can find the distance between  the two flats dist Pi,Pj .

Proof Theorem 1 in [22] calculates the Euclidean distance between the two affine subspaces using the matrices range and null space. Alternatively, since we already have the midpoint p between the flats we can find the distance between them by projecting p onto the flats and then calculating the distance between the projected points. This projection can be made by a least squares method with constraints, more precisely, to solve the following two n o n o optimization problems: min ∥p − x∥2 : Ex = e and min ∥p − x∥2 : Fx = f or any other efficient orthogonal projection method (e.g. [42]).

Having the midpoint and the distance between all the pairs, we filter the irrelevant midpoints using their corresponding distances as shown in the following Lemmas. First we argue that the flats’ pairwise projection helps to define the origin balls, namely, the midpoints that arise from the same ball are centered around that ball: 4.3 k-flats Clustering 49

n c c co d d Lemma 5 Let P = P1 ,P2 ,...,Pj ⊆ P be a set of k-flats in R intersecting the ball Bc . Let  j p = p12, p13,...p1 j,..., p( j−1) j be the set of the midpoints of all 2 pairs of P. The mean d of this set E[p] equals to c (the center of Bc ), and the variance V[p] is bounded.

c c d Proof Let Pi ,Pj ∈ P be two flats intersecting the ball Bc where their distance midpoint is ∗ pi j. Denote by pi j the antipodal point of pi j. Since the directions and the location of flats at P are normally distributed around c (see the model assumptions at Section 4.2), we get ∗ the probability that pi j ∈ p equals to the probability that pi j ∈ p, which implies that their n ∗ o expected value is E[ pi j, pi j ] = c. This geometric-probabilistic consideration holds to the whole set p, hence, we get that E[p] = c. For proving that the variance is bounded we argue in Proposition6 and7 (appear at the end of this section) that for all i, j, the distance ri j between pi j and the center of the ball c is bounded, which implies that V[p] is bounded around c.

At this point, for every pair of flats (Pi,Pj) we have the corresponding midpoint and the distance (pi j,di j). We would like to show that if we eliminate all the midpoints pi j so that their distance di j is greater than 2, we are left with those that arise from the same cluster. The following Lemma argues that this is the case when d is big enough:

d Lemma 6 Let Pi,Pj ∈ P be a pair of k-flats in R .

d 1. If Pi and Pj intersecting the same ball Bc then the probability that the distance between  them is less then 2 is P dist(Pi,Pj) ≤ 2 = 1.  2. Otherwise, for any ε > 0, lim Pr dist(Pi,Pj) ≥ 2(∆ − ε) = 1. d→∞ Proof When both flats are intersecting the same unit ball, the minimum distance between d them is ≤ 2 ∗ radius Bc = 2 which implies the first part of the lemma. Applying Proposi- tion3 with dist(Pi,Qi) ≤ 2 (by the first part of the Lemma), we get that forany ε the distance between the two flats approach 2(∆ − ε).

Proposition 3 Let Pi,Qi and R j be flats intersecting the ∆ − distinct balls Bci and Bc j   (respectively). Then, for any ε > 0, lim Pr dist R j,Q j ≥ (∆ − ε)dist (Pi,Qi) = 1. d→∞ Note: This proposition appears in [7] for random points. Here we reproduce a proof for the distance between the flats.

dist(Pi,Qi) dist(R j,Qi) Proof Let µ = E(dist (Pi,Qi)), V = µ and W = µ . Using Lemma5 and the weak law of large numbers we get that V →p 1. Proposition4 implies that W →p ∆. Thus, 50 Mending Missing Information in Big-Data

dist(R j,Qi) µdist(R j,Qi) W = = →p ∆ (see Corollary 1 at [9]). By definition of convergence in dist(Pi,Qi) µdist(Pi,Qi) V  dist(R ,Q )   dist(R ,Q )  probability for any ε > 0, lim Pr j i − ∆ ≤ ε = 1. So lim Pr ∆ − ε ≤ j i ≤ ∆ + ε = d→∞ dist(Pi,Qi) d→∞ dist(Pi,Qi)   1 which implies lim Pr dist R j,Q j ≥ (∆ − ε)dist (Pi,Qi) = 1. d→∞

Lemma6 implies the correctness of our algorithms when d → ∞. The following Proposi- tions argue that for any dimension d, when we dropped the midpoints with corresponding distances ≤ 2 we eliminate at least a linear fraction λ of the whole set. Proposition4 show that mean distance of flats is linear with respect to the distance. Next, we use this resultto prove that we drop enough flats as presented in Proposition5.

Proposition 4 Let Pi and Pj be flats intersecting the ∆ − distinct balls Bci and Bc j (respec-  tively), then E dist Pi,Pj is linear function of ∆.

d  Proof Denote the mean distance integral between two k-flats in R by S = E dist(Pi,Pj) . Given that the probability density function of the flats is ρ, the expected value of the distance function, is given by the inner product of the functions dist and ρ. E.g., for the d dimensional lines P(1) = (α1t1 + 1,α2t1,...,αdt1) and P(−1) = (β1t2 − 1,β2t2,...,βdt2) such that αi,βi ∼ N(µ,σ), the mean distance integral is

Z ∞ S = dist(P(1),P(−1))ρ (αi,βi)dα1dα2 ··· dαddβ1dβ2 ··· dβd −∞

S S k d S Let 0 be the solution of the integral for two -flats intersecting the unit ball Bc0 and 1 S k d be the solution of for two antipodals -flats tangents the surface of Bc0 , then by Proposition8 below we get 0 < S0 < S1 ≤ 2. Observing that S1 is equals to any antipodal pair of flats d that tangents to the surface of B0, w.l.o.g. we use the pair of flats (P(−1),P(1)). Denote by S1 and S∆ the solutions of the integral S for the pairs (P(−1),P(1)) and (P(−∆),P(∆)), respectively, Proposition9 (below) argues that the density function is invariant while the distance scaling only in one direction, which implies that a linear change in ∆ cause scaling the mean distance with ∆, which complete the proof.

Proposition 5 Given two k−flats P(∆),P(−∆) ∈ P passing through the points (∆,0,...,0) and (−∆,0,...,0) respectively. Let X denote a random variable of dist(P(∆),P(−∆)). The probability p that dist(P(∆),P(−∆)) > 2 is strictly greater than zero, i.e., p = Pr(X > 2) > 0

Proof From all the non-negative random variables Y that their mean is equal to S1∆ and Pr(Y ≤ 2∆) = 1, we would like to find the one that maximizes the probability of Pr(Y ≤ 2), hence, we defined Y to get 2 if dist(Pi(∆),Pj(−∆)) ≤ 2 and 2∆ otherwise. Proposition9 4.3 k-flats Clustering 51

(below) implies S∆ = S1∆, using the expectation definition we get E(Y) = 2q + 2∆q = S∆ = S1∆. Solving the equation and generating a power series expansion for q we got ( − S / ) + ( − S / ) 1 + ( − S / ) 1 + o( 1 ) S < 1 1 2 1 1 2 ∆ 1 1 2 ∆2 ∆3 . Proposition 10 below implies that 1 2. Substitute this result in the power series expression yields 0 < q < 1. Since q is a bound on the probability to accept the flats Pi(∆) and Pj(−∆), it holds that the probability p = Pr(X > 2) n to drop Pi(∆) and Pj(−∆) is p ≥ 1 − q > 0. I.e., we dropped p fraction of the 2 pairs we have got.

Note that Proposition5 implies that the fraction λ of the flats we dropped is at least linear for pair of flats passing through the exact points (∆,0,...,0) and (−∆,0,...,0). The proof also holds for a pair of flats intersecting the ball centered at (∆,0,...,0) and (−∆,0,...,0) by adding the ball’s radius. The following propositions were mentioned in the above proofs and appear here to enhance the readability of the text.

Proposition 6 Let ℓ′ and ℓ′′ be two random 2D lines that intersect the unit disk and p be their intersecting point. With probability 1 − O(1) the distance r from p to the origin is bounded.

Proof Observing that the maximum distance between the intersecting point p and the origin occur when ℓ′ and ℓ′′ are tangents to the disk, we will consider only this case. Let φ ∈ [0,π] be the intersection angle between the two lines (see Figure 4.1). When φ = π the two lines are joined together and r equals 1 (the disk radius). While reducing φ toward the zero angle, r is increased toward infinity (i.e., when φ → 0 the lines are parallel and r → ∞). For example, √ φ when φ = π/2 then r = 2, for φ = π/3 we get r = 2 and generally, r = 1/sin 2 . Since φ is uniformly distributed over [0,π], we get that with probability 1 − ε, the distance r is ≤ r0, where ε = 2arcsin(1/r0). E.g., with probability ≥ 2/3 we have r ≤ 2.

Proposition 7 Let Pc0 and Pc0 be two k-flats that intersect the unit ball d and p be their i j Bc0 midpoint point. With probability 1 − O(1) the distance r from p to the origin is bounded.

Proof Starting with three dimensional space, the relation between the two flats can be expressed using the distance between them, their relative direction (azimuthal angle) and its relative orientation (polar angle). Fixing the orientation, the variation of the direction is described in the 2D case (see proposition6). When the two flats’ directions cause a small distance between p and the origin, changing the orientation will not increase this distance (but may decrease it). Generally, changing the flat orientation will increase the probability that the distance from p to the origin is bound. 52 Mending Missing Information in Big-Data

ℓ′

풑 풓 휙

ℓ′′

Fig. 4.1 Two dimensional pair of flats intersecting a disk. While reducing the intersection angle φ toward the zero angle, the distance r between the intersecting point p and the center of the disk increases towards infinity.

For general d, using the same idea, the flats can be represented by a spherical coordinate (i.e., the coordinates consist of a radial coordinate and d − 1 angular coordinates), implies that the distance between the midpoint and the ball’s center is bounded by a probability that increases as d increases, see illustration at Figure 4.2.

Proposition 8 Let S0 and S1 be the mean distance integral solutions as defined above, then 0 < S0 < S1 ≤ 2.

Proof Since the degree of the flats is ≤ d/2 the probability that the flats intersect is = 0 which implies that 0 < S0. The mean distance integral S contains a density function ρ (µ,σ) and a geometric distance dist (·,·). The density is dependent only on the mean and the variance of the coefficients which are invariant. The distance function get its maximum value for antipodal pair, which implies that S0 < S1. Finally, since the two flats are intersecting the same unit ball, the minimum distance between them is ≤ 2( the ball radius) = 2 which implies S1 ≤ 2 as needed.

Proposition 9 Let S1 and S∆ be the integral solutions as defined above, then∆ S = S1∆.

Proof The mean distance of a pair of k-flats acts symmetrically on the two pairs (P(−1), P(1)) and (P(−∆),P(∆)). Namely, the density function is invariant while the distance scaling only in one direction, which implies a linear change in ∆ in the solution of S, i.e.,

S∆ = ∆S1.

Proposition 10 Let S1 be the integral solutions as defined above, then1 S < 2. 4.4 Algorithm 53

3

2

1

0

-1

-2

-3 -3 -2 -1 0 1 2 3 Fig. 4.2 The distance between the midpoint and the ball’s center is decreasing as d increases. For a unit ball centered at the origin (dashed line), we plot the midpoints (first two coordinates) of a set of 50 flats with dimension d = 9 (red dots), d = 18 (yellow dots), d = 36 (green dots), d = 72 (light blue dots) and d = 144 (blue dots). Note that midpoints from higher dimensions are plotted above those from lower ones. We can observe that most of the midpoints are located inside the unit ball and centered around the origin. Moreover, as the dimension increases, the variance of the location of the midpoints decrease.

Proof By its definition, S1 is the mean distance between two flats passing through the points (−1,0,...,0) and (1,0,...,0). Fixing the flat P(−1), we can observe that if P(1) is intersecting the ball B(−1,0,...,0) then dist(P(−1),P(1)) ≤ 1. Let α denote the probability of this event, i.e. α = Pr(P(1) ∩ B(−1,0,...,0) ̸= /0). To complete the proof, it is enough to prove that α > 0 (since S1 ≤ E(dist(P(−1),P(1))) = 1 ∗ α + 2 ∗ (1 − α)). Observing that P(1) ∩ B(−1,0,...,0) is a spherical cap with nonzero volume (relatively to the measure of all the flats), one can show that the probability that two random flats passing through (−1,0,...,0) and (1,0,...,0) has distance ≤ 1 is greater than zero, i.e., θ > 0.

4.4 Algorithm

Algorithm4 presents the pseudocode for the clustering procedure of a set of n random k-flats d in R . In the first step, we call the procedure FINDMIDPOINTS to find all the midpoints between all the pairs of flats (using Proposition1) and calculate the distance between every pair (as described in Proposition2). We save only the midpoints whose corresponding distance is smaller than two. 54 Mending Missing Information in Big-Data

Algorithm 4 Data clustering using flats minimum distances d Input: a set P of n random k−flats in R , the number of clusters m. Output: a set C of m clusters 1: p ← FINDMIDPOINTS(P) 2: C ← DEFINECLUSTERS(p) ◃ density-based clustering algorithm on the set p, e.g., DBSCAN 3: M ← n/m ◃ threshold for the size of every cluster 4: for each ck ∈ C do 5: if size(ck) < M then 6: C ← C r ck 7: end if 8: end for 9: Return C =0

1: procedure FINDMIDPOINTS(P) 2: p ← /0  3: for each Pi,Pj ∈ P do  4: pi j ← midpoint Pi,Pj  5: di j ← dist Pi,Pj 6: if di j ≤ 2 then 7: p ← p ∪ pi j 8: end if 9: end for 10: Return p 11: end procedure

Using Lemmas5 and6 we explore the potential clusters by finding the high density midpoints’ locations. We can do this by the use of the classical K-Means-like algorithms. However, since (a small) fraction of the midpoints are ‘noise’, i.e., derived from flats intersecting different balls, we would like to ignore those midpoints. Hence, we recommend using an algorithm that has specialized noise handling such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) as describe in [19].

Next, we use our assumption (see Section 4.2) about the equal size of the different clusters and define a threshold M which equals to n/m. Now we eliminate all the clusters that their density is low (as defined by the threshold M).

Note that the algorithm outputs a set of clusters Cres = {ck} such that each cluster contains  midpoints ck = pi j that indicate that the flats Pi,Pj are in the cluster ck. 4.5 Experimental Studies of k-Flat Clustering 55

4.5 Experimental Studies of k-Flat Clustering

As part of the main theorem proof, Lemma6 tells us what happens when we take the dimensionality to infinity. In practice, it is interesting to know at what dimensionality we anticipate that the flat pairwise projection to midpoints implies good separation to different clusters. In other words, Lemma6 describes some convergence, but does not indicate the convergence rate. We addressed this issue through empirical studies. We ran the following experiments using synthetic data set, producing the flats’ inputs with normally distributed location and direction, as described in the model assumption. Without loss of generality we choose the balls’ center to be c1 = (−100,...,0) and c2 = (100,0,...,0) and k (the flats dimension) equals to d/3. Each cluster contain 10 random flats, all together we have 20 random flats. Our algorithm computes the midpoint for all pairs of flats; all together we have 190 center points. See Figure 4.3 which shows four different experiments, each done for different dimensions. Those center points are divided into three groups: the first 45 are shown as a red dot close to the center c1. Furthermore, they are close to one another so that the eye cannot distinguish between them. The second group is also comprised of 45 points, shown as a red dot to the right, close to c2. The third group has 100 points, centered around 0 point. Those points are shown in black, with a distance of > 2. This means that the algorithm rejects all the points in the third group, as was anticipated. The four images illustrate how the variance is decreasing, while increasing the dimension. This illustrates that our algorithm preforms better for higher dimensions.

4.6 Clustering with different group sizes

Algorithm4 we presented above works fine for a set of m clusters, for which each one of them contains the same number of flats. We need this assumption to ensure that we will not identify ‘noisy’ midpoints (i.e., midpoints derived from flats intersecting different balls) asa true cluster. In this section we would like to relax the equal size clusters assumption. For sufficiently large dimension d we do not need the assumption concerning the equal size of the clusters since we show in Lemma6 that when d → ∞ we will drop all the noisy midpoints (since w.h.p their distance is larger than two). For a general dimension d, we show we drop a fraction λ of the noise and argue (as in Proposition5) that this fraction is at least linear. Hence, instead of assuming equal size clusters we can assume that the difference between the clusters is at most λ, we will call it λ-close size clusters. Moreover, when the data contains also some very big clusters (i.e., of size greater than the joint number of all 56 Mending Missing Information in Big-Data

200 200

100 100

0 0

100 100

200 200 200 100 0 100 200 200 100 0 100 200 d = 9,k = 3 d = 30,k = 10 200 200

100 100

0 0

100 100

200 200 200 100 0 100 200 200 100 0 100 200 d = 60,k = 20 d = 90,k = 30

Fig. 4.3 Given two sets of flats from two clusters located at B(−100,0,...0) and B(100,0,...0), the black points are the midpoints of all the pairs and the red points indicate those who left after eliminate flats that their corresponding distance is greater than2. the rest) we suggest peeling off these clusters and continuing recursively as described in Algorithm5. Note that the other model’s assumptions hold (see Section 4.2), i.e., for the given set of k-flats P:

• Two independent random flats are in general position with probability one.

•1 ≤ k ≤ ⌊d/2⌋.

d ,..., d distinct • The (unknown) balls Bc1 Bcm are ∆- with probability one. 4.6 Clustering with different group sizes 57

• P is a superset of m groups P = {P1,...,Pm}, such that every group Pi ∈ P contains d flats that intersect the ball Bci . Moreover, each flat P ∈ Pi have normally distributed d location and direction at the ball Bci .

Given a set P of n flats and the number of clusters m, we find the midpoints set using FINDMIDPOINTS (similar to Algorithm4). Next, we call the recursive procedure RECCLUS- TERING that define the potential clusters (DEFINECLUSTERS as described in Algorithm4 above) and check if there exists a big cluster c1 such that its size is greater than n/2 (where n denote the number of the remaining flats). If such cluster were explored, we eliminate all the flats belong to it and recursively call the procedure again. Otherwise, we assume thatall the clusters are λ-close-size so the algorithm recognized them in the same way it does in Algorithm4.

Algorithm 5 Clustering with different sets size d Input: a set P of n different k−flats in R , the number of clusters m, the fraction λ. Output: a set Cres of clusters 1: p ← FINDMIDPOINTS(P) 2: Cres ← RECCLUSTERING(p,Emptyset,m,n) 3: Return Cres

1: procedure RECCLUSTERING(p,Cres,m,n,λ) 2: C ← DEFINECLUSTERS(p)

3: c1 ← max{C} ck 4: if size(c1) > n/2 then 5: Cres ← Cres ∪ c1  ′ ′ 6: p ← p r pi j : pik ∈ c or pk j ∈ c 7: n ← n − size(c1) 8: Cres ←RECCLUSTERING(p,Cres,m − 1,n,λ) 9: else

10: for each ck ∈ C do 11: if SIZE (ck) < λn/m then 12: C ← C r ck 13: end if 14: end for 15: end if

16: Return Cres 17: end procedure Proposition 11 argue the correctness of the above approach. 58 Mending Missing Information in Big-Data

d Proposition 11 Given the set P of n different k-flats in R . Let s1 denote the largest set of flats that intersect the same unit ball. When the sizeof s1 is greater than n/2 Algorithm5 identify correctly s1’s flats as a true cluster.

n1 Proof Let n1 be the number of flats in s1. Algorithm5 find a cluster c1 contains α 2 n1(n−n1) midpoints that all of them produced by s1 flats’ and also a cluster c2 contains β 2 mid-  points produced by mixture of flats from s1 (i.e., c2 contains the midpoints pi j : Pi ∈ s1,Pj ∈/ s1 ). Lemmas5 and6 imply that w.h.p. for a sufficiently large dimension, α > β. Using the n1 n1(n−n1) assumption about the size of s1 we get that 2 > 2 , hence, c1 is the largest size cluster identify correctly (in Line 4) and the wrong cluster c2 will be dropped (see Line 6).

4.7 Sublinear and distributed algorithms

d Given a set of size n with k-flats in R , the algorithms we presented above find the distance and the midpoints of every pair in O((kd)ω ) time (where ω is the matrix multiplication n complexity), using the least squares method. By doing this for 2 pairs, we presented a poly(n,k,d) running time algorithm. One can achieve polylogarithmic time using sampling. Namely, instead of running the algorithms with the whole n flats set, we apply the algorithms with a sample of logn flats that were picked uniformly at random. The main reason wecan use sampling is due to our assumption about the normally distributed data, namely, that the given set of flats P is a superset of m groups P = {P1,...,Pm}, such that every group Pi ∈ P d contains flats P ∈ Pi that have normally distributed location and direction at the ball Bci . Another way to improve efficiency is to execute the algorithm in a distributed fashion. We describe the distributed algorithm in the procedure DISTRIBUTEDFINDMIDPOINTS which replace the procedure FINDMIDPOINTS of Algorithm5. Given a set of q processors such that each one of them has an access to the whole set of flats P, every processor randomly picks a pair of flats and calculate their midpoints. If the distance between the pair is lessthan two, the processor saves the midpoint in shared memory (stored in the set p). The processors continue this procedure until enough midpoints were collected as defined by the threshold τ. The clustering process can be done by any of the processors as described in Algorithm5. The correctness of this algorithm follows the birthday paradox that promises that with high probability there will not be an overlap between the processors due to the small fraction of the sampling. 4.8 Discussion 59

1: procedure DISTRIBUTEDFINDMIDPOINTS(P,p,τ) 2: while SIZE(p) < τ do  3: randomly pick a pair of flats Pi,Pj ∈ P  4: pi j ← midpoint Pi,Pj  5: di j ← dist Pi,Pj 6: if di j ≤ 2 then 7: p ← p ∪ pi j 8: end if 9: end while 10: end procedure Note, we can also replace the sequential DEFINECLUSTERS procedure in Algorithm5 with a distributed one, namely, for DBSCAN algorithm there also exists a distributed version (see e.g., [26]).

4.8 Discussion

The probability of flats’ intersections appear at different settings in[29] and [49]. Using polar representation [29] measure the probability that d k-flats going through a ball, will intersect each other inside the ball. E.g., for d = 2 and k = 1, random lines intersecting a disk will intersect each other inside the disk with probability 1/2 and for d = 3 and k = 2, three planes that intersecting a convex region K will have their common point inside K with probability π2/48. These results are generalized in [49] for n randomly chosen d subspaces fki (i = 1,2,...,n) in E , such that k1 + k2 + ... + kn ≥ (n − 1)d, that intersect a convex body K. Formalized the probability that fk1 ∩ fk2 ∩...∩ fkn ∩K ̸= /0 using the integral: R f ∩ f ∩...∩ f ∩K̸=/0 d fk1 ∧ fk2 ∧ ... ∧ d fkn [49] (13.39),(14.2) show that the measure of all k1 k2 kn d Od−1···Od−k−1 k-flats fk that intersect a convex body K in is (where Od denotes the surface E (d−k)Ok−1···O0 area of the d-dimensional unit sphere). Another related result one can extract from [49] work is the probability of a hyperplane Ld−1 and a line L1 that intersect a ball to have an intersection inside the ball, which equals 1/d. A detailed description of the above results appear in the following Appendix. The studies of [29] and [49] consist on polar representation of the data (i.e., the coor- dinates consisting of a radial coordinate and d − 1 angular coordinates) which gives high weight to the first coordinate while the weight of the following coordinates decrease (since the coefficients are multiples of sine and cosine). Hence, our assumption on normal distribution over the different coordinates is not fulfilled. 60 Mending Missing Information in Big-Data

100

50

-150 -100 -50 50 100

-50

-100

-150

Fig. 4.4 Pairwise intersection of two dimensional almost orthogonal flats from three disjoint balls. Given a set of 1000 random lines passing through the three unit balls 2 2 2 B(−100,−100),B(100,−10),B(−20,100), we plot the intersecting point of every almost orthog- onal pair, i.e., pair of lines that their intersecting angle is in [π/2 ± ε], where ε = 0.001. One can identify the original three centers (see red arrows) by the points concentrating around their regions. In addition, the rest of the points located on the boundary of the connecting the centers.

Another direction we examine was to find the unknown balls (that defined the clusters) using the intersection of orthogonal flats. The justification of focus on orthogonal sets comes from the “curse of dimensionality” phenomenon [6], were one manifestation of the “curse” is that in high dimensions, almost any two vectors are almost orthogonal [44]. Starting with the two dimensional case, we generate a random set of flats intersecting disjoint balls and picked all the almost orthogonal pairs, namely, pairs that their intersecting angle is in [π/2 ± ε] (where ε depend on n- the number of flats). Interestingly, as presenting in Figure 4.4, we can describe the pairwise intersection by distinguish two sets: those passing through the same ball, and those arise from different balls. The first set concentrated around the original balls center, while the second set create a structural figure, corresponding to the relative geometrical positioning of the original balls. This geometric structure might help in defining the origin unit balls, but the exact definition of ε and the generalization to higher dimensions should be examined in further research. 4.9 Conclusion 61

Canas et al. [10] study the problem of estimating a manifold from random k-flats. Given d collections of k-flats in R their (Lloyd-type) algorithm, analogously to k-means, aims at finding the set of k-flats that minimizes an empirical reconstruction over the whole collection. Although they also deal with the input of k-flat, their framework and goals are different from ours, and specifically, impractical for the clustering task. The distance between pairs of k-flats as well as measuring the geometry of the midpoints was studied in [51] and generalized at [25]. Although these papers consider the probabilistic aspects of the flats intersections, as we do, they focus only on stationary processes (suchas Poisson processes) that do not satisfy the uniform and Gaussian distributions that we assume here. As mentioned above, Lee & Schulman [34] presented algorithms and hardness results d for clustering general k-flats in R . After proving that the exponential dependence on k (the internal dimension of the flat) and m (the number of clusters) is inevitable they suggest an algorithm which runs in time exponential in k and m but is linear in n and d. Their theoretical results are based on the assumption that the flats are axis-parallel. Our model overcomes their exponential bounds due to the randomized assumption.

4.9 Conclusion

The analysis of incomplete data is one of the major challenges in the scope of big data. d Typically, data objects are represented by points in R , we suggest that the incomplete data is corresponding to affine subspaces. With this motivation we study the problem of clustering k-flats, where two objects are similar when the Euclidean distance between them issmall. d The study presented a simple clustering algorithm for k-flats in R , as well as studied the probability of pair-wise intersection of these objects. The key idea of our algorithm is to formulate the pairs of flats as midpoints, which preserves distance features. This way, the geometric location of midpoints that arise from the same cluster, identify the center of the cluster with high probability (as shown in Lemma5). Moreover, we also show (Lemma6) that when the dimension d is big enough, the corre- sponding distance of flats that arise from different clusters approach the mean distance ofthe cluster’s center. Using this, we can eliminate the irrelevant midpoints with high probability. For low dimensions, we did not identify the exact probability that we dropped all the irrelevant flats (i.e., those that arise from different clusters), however, we do show thatwe eliminate a linear fraction λ of those irrelevant flats. In addition, using experimental results, we support our claim that the algorithm works well in low dimensions as well. 62 Mending Missing Information in Big-Data

Finally, we show we can achieve a polylogarithmic running time using sampling; we also illustrate a distributed version of the algorithm. Future work includes proving that λ → 1 for a general dimension d (we show this only for d → ∞). Obtaining this result will make our algorithm practical to any mixture size of clusters.

4.10 Appendix: The probability of flats intersection

The probability of flats intersection appear at different settings29 in[ ] and [49]. Due to “Bertrand Paradox" (see explanation at [29] Introduction) the most natural coordinates to use d for the description of flats in the d Euclidean space,E , are the polar coordinates. Starting with the two dimensional space, a line on the plane is determined by its distance p from the origin and the angle θ of the normal with the x axis. The equation of the line is

xcosθ + ysinθ − p = 0

The measure of the set of all lines L1 intersecting a bounded convex set K is [49](3.12) Z m(L1,L1 ∩ K ̸= /0) = pdθ = L = 2π (4.3)

L1∩K̸=/0

where L is the length of ∂K (perimeter of K, for the disk its equals to 2π). The measure for two random chords of K to intersect inside K is [29](3.9) Z xdpdθ = 2πA (4.4)

where A is the area of K. Since the measure of each line that intersecting K is L and they are taken as independent, the appropriate measure for pair of lines is L2. This implies that the probability for random lines intersecting a disk to intersect each other inside the disk is 2πA 2ππ 1 p = 2 = 2 = . (4.5) m(L1,L1 ∩ K ̸= /0) (2π) 2 This result is fixed while changing the radius of the disk. (Note: the probability that all the n! bL n intersection points lie inside K is < (2n)! 2 (where b is the maximal value of the curvature of ∂K), see [53]). 4.10 Appendix: The probability of flats intersection 63

Random planes in the 3D space.

At the three dimensional space, the original set of flats might appear as lines or planes. We continue here by assuming a set of planes (the probability of lines and mixture of lines and planes appear as part of the general case). The appropriate definition for planes is given by the polar equation [29](4.1):

xsinθ cosφ + ysinθ sinφ + zcosθ = p (4.6) and the element of measure is

sinθdθdφdp (4.7) where 0 ≤ θ ≤ π and 0 ≤ φ ≤ 2π.

For calculating the probability that three planes intersecting K have their common point inside K, we need the value of the integral

Z m(L2i ,L2 j ,L2ℓ ;L2i ∩ L2 j ∩ L2ℓ ∩ K ̸= /0) = dL2i ∧ L2 j ∧ dL2ℓ L ∩L ∩L ∩K̸= 2i 2 j 2ℓ /0

Like in the planar case, first we extract the measure M of all planes L2 that meeting a convex region K, which is Z m(L2;L2 ∩ K ̸= /0) = dL2 = 4π (4.8)

L2∩K̸=/0 The proof of this is given by Minkowski [29] (see Section 4.7).

Now we calculate the measure that three planes L2i ,L2 j ,L2ℓ that meet K, intersect each other inside K. Suppose two of the planes intersecting inside K, denote the intersection length by L, i.e., L2i ∩L2 j ∩K = L. The measure of all planes which intersect L is πL [29](4.3). The 1 2 integral of L over all positions of one of these planes is 2 π A [29](4.7),where A is the area of intersection of the other plane. In turn, the integral of A, the area of intersection over all intersecting planes is 2πV. Hence, the measure of all such triples is π4V, and the required probability is

4 4 4 π V π 3 π 2 p = 3 = 3 = π /48 (4.9) m(L2;L2 ∩ K ̸= /0) (4π) 64 Mending Missing Information in Big-Data

Changing the radius of the sphere inversely proportional to the probability p, since the radius R at the numerator is power of 3 (part of the volume V = 4/3πR3), but at the denominator R have power of 6.

d Random r-planes in E .

Given n randomly chosen subspaces Lri (i = 1,2,...,n), such that r1 +r2 +...+rn ≥ (n−1)d, d that intersect d-dimensional ball B . We would like to find the probability that Lr1 ∩ Lr2 ∩ d ... ∩ Lrn ∩ B ̸= /0.Namely, to solve the integral: Z m(Lr1 ,Lr2 ,...,Lrn ;Lr1 ∩ Lr2 ∩ ... ∩ Lrn ∩ K ̸= /0) = dLr1 ∧ Lr2 ∧ ... ∧ dLrn

Lr1 ∩...∩Lrn ∩K̸=/0

Mimic the way we use in the low dimensions, we have to calculate the measure of all r-planes d Lr that intersect B , and also to find out the measure that all the intersecting of set is interior d to B . Let Od denote the surface area of the d-dimensional unit sphere and κd denote the volume of the n-dimensional unit ball. Their values are:

2π(d+1)/2 O 2πd/2 O = ; κ = d−1 = (4.10) d Γ((d + 1)/2) d d dΓ(d/2)

2 where Γ is the Gamma function. For instance, O0 = 2,O1 = 2π,O2 = 4π,O3 = 2π . d The measure of all r-planes Lr that intersect B appear at [49] (13.39),(14.2):

d Od−1 ··· Od−r−1 m(Lr,Lr ∩ B ̸= /0) = (4.11) (d − r)Or−1 ··· O0 Santalo [49] also show that

d p!q!Od−1κd p(Lp ∩ Lq ∩ B ̸= /0; p + q = d) = (d − 1)!Op−1Oq−1

d 2(p − 1)!(q − 1)!O2d−p−q+1 p(Lp ∩ Lq ∩ B ̸= /0; p + q > d) = (4.12) (p + q − d − 1)!(d − 1)!Od−p+1Od−q+1 This result will help us while we use the intersection of pairs of flats to locate the Ball’s center.

Another result we can extract from [49] work is the probability of a hyperplane Ln−1 and a line L1 that intersect a ball having an intersection inside the ball:

d p(L1,Ln−1;L1 ∩ Ln−1 ∩ B ̸= /0) = 1/n (4.13) 4.10 Appendix: The probability of flats intersection 65

This result can be useful for records such that at one of the record all exclude one of the coordinate is missing (Ln−1), and for the second record, only one coordinate is missing (L1).

Chapter 5

Conclusions

We are experiencing a deluge of big data, as a result of not only the internetization and computerization of our society, but also of the fast development of affordable and powerful data collection and storage devices. The importance of big data is widely recognized both in academia and in industry. The abundantly growing data, both in size and form, has posed a fundamental challenge for big data analytics: how to efficiently handle and analyze such big data in order to bridge the gap between data and information. The data abstraction challenge is an emerging technology in the data analysis field that can be used to capitalize on massive datasets of different types. Data abstraction not only plays an important role in big data analytics, but also provides a powerful framework to address a wide range of data-intensive applications in modern massive data set. Our work builds on previous research in data approximation and error correcting code theory (mainly Chapter3), as well as geometry, algebraic geometry and probability (Chap- ters2 and4). This inter-disciplinary approach is typical of this research area since big data is heterogeneous and is best modeled in different ways. To fully exploit the power of the data algorithms we need to take into account the characteristics of their input data and approach them differently. Therefore, designing solutions for large scale abstraction is an interesting and challenging research goal. This thesis sheds some light on different aspects of information abstraction and addresses three challenges in the various domains: exploration, interpolation and extrapolation. In Chapter2, we illustrate the exploration challenge relating to specific data type - directional wireless sensors. The benefit of the huge amount of sensors is reflected by the narrow (tends to zero) communication range each sensor should maintain in order to induce connectivity. In that work we offered a probabilistic solution to a geometric problem that was researched before and proved the bounds of the threshold that guarantee a connected graph. 68 Conclusions

Regarding the interpolation aspect as described in Chapter3, we model the data using algebraic functions, namely by the use of polynomials. This representation achieves a succinct description of the data and also provides information on missing features in the set. The difficulties that arise when computing such a representative function is the presence of noise and outliers in the data. We have distinguished two different solutions for the noise and the Byzantine appearance: Approached by the error-correcting code methods, we have suggested a way to represent a noisy-malicious input with an multivariate polynomial. This method assumes that the noise is discrete. When the noise is unrestricted, based on Bernstein-Markov Theorem and Arora-Khot algorithm, we have suggested a method to reconstruct algebraic polynomials that handle multidimensional corrupted data. In addition to the algorithmic and mathematical aspects of this work, we believe that its main contribution is in the representation aspect - the idea of translating the big data into succinct representation using well known algebraic objects. Following this direction, we continue in the extrapolation study (Chapter4). Here we are particularly interested in situations where the data is very large, and where the space is high-dimensional. We work with incomplete data so that computing clustering over it will supply completion of the missing information. We modeled the data in a way that every data d point in R is corresponding to affine subspaces of dimension k embedded in the Euclidean d dimensional space. Using probabilistic assumptions and proofs, we presented a simple algorithm that offers a method for discovering clusters in data. These probabilistic results are of independent interest, and can serve to better understand the features and behaviors of high dimensional objects. Information extraction, exploration and prediction have a major impact in our society, and we believe that the contributions presented in this thesis will be valuable for future researchers in the field. Nomenclature

Greek Symbols

R Real numbers

D The unit disk

∂D The boundary of the unit disk

d Bc The unit ball of dimension d, centered at c

Pr(·) Probability

E(·) Expectation

V(·) Variance log Natural logarithm

⟨·,·⟩ Scalar product

Acronyms / Abbreviations

WB Welch and Berlekamp w.h.p. With high probability, i.e., with probability 1 − o(1) as n tends to infinity

References

[1] Abbe, E., Khandani, A., and Lo, A. (2012). Privacy-preserving methods for sharing financial risk exposures. The American Economic Review, 102(3):65–70.

[2] Ackerman, E., Gelander, T., and Pinchasi, R. (2013). Ice-creams and wedge graphs. Computational Geometry, 46(3):213–218.

[3] Ar, S., Lipton, R. J., Rubinfeld, R., and Sudan, M. (1992). Reconstructing algebraic functions from mixed data. In FOCS, pages 503–512. IEEE Computer Society.

[4] Arora, S. and Khot, S. (2003). Fitting algebraic curves to noisy data. Journal of Computer and System Sciences, 67(2):325–340.

[5] Aschner, R., Katz, M., and Morgenstern, G. (2012). Do directional antennas facilitate in reducing interferences? In Fomin, F. and Kaski, P., editors, Algorithm Theory – SWAT 2012, volume 7357 of Lecture Notes in Computer Science, pages 201–212. Springer Berlin Heidelberg.

[6] Bellman, R., Bellman, R. E., Bellman, R. E., and Bellman, R. E. (1961). Adaptive control processes: a guided tour, volume 4. Princeton university press Princeton.

[7] Bennett, K. P., Fayyad, U., and Geiger, D. (1999). Density-based indexing for approxi- mate nearest-neighbor queries. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 233–243. ACM.

[8] Bertino, E., Bernstein, P., Agrawal, D., Davidson, S., Dayal, U., Franklin, M., Gehrke, J., Haas, L., Halevy, A., Han, J., et al. (2011). Challenges and opportunities with big data.

[9] Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999). When is nearest neighbor meaningful? In Database TheoryICDT99, pages 217–235. Springer.

[10] Canas, G., Poggio, T., and Rosasco, L. (2012). Learning manifolds with k-means and k-flats. In Advances in Neural Information Processing Systems, pages 2465–2473.

[11] Caragiannis, I., Kaklamanis, C., Kranakis, E., Krizanc, D., and Wiese, A. (2008). Communication in wireless networks with directional antennas. In Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pages 344– 351. ACM.

[12] Carmi, P., Katz, M. J., Lotker, Z., and Ros A. (2011). Connectivity guarantees for wireless networks with directional antennas. Computational Geometry, 44(9):477 – 485. 72 References

[13] Daltrophe, H., Dolev, S., and Lotker, Z. (2013a). Big data interpolation: An efficient sampling alternative for sensor data aggregation. ALGOSENSORS 2012, pages 66–77. [14] Daltrophe, H., Dolev, S., and Lotker, Z. (2013b). Probabilistic connectivity threshold for directional antenna widths - (extended abstract). In Structural Information and Communication Complexity - 20th International Colloquium, SIROCCO 2013, Ischia, Italy, July 1-3, 2013, Revised Selected Papers, pages 225–236. [15] Daltrophe, H., Dolev, S., and Lotker, Z. (2015). Probabilistic connectivity threshold for directional antenna widths. Theor. Comput. Sci., 584:103–114. [16] Damian, M. and Flatland, R. (2010). Spanning properties of graphs induced by direc- tional antennas. In Electronic Proc. 20th Fall Workshop on Computational Geometry. Stony Brook University, Stony Brook. [17] Davis, P. (1975). Interpolation and approximation. Dover Publications. [18] Ditzian, Z. (1992). Multivariate bernstein and markov inequalities. Journal of approxi- mation theory, 70(3):273–283. [19] Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231. [20] Fasolo, E., Rossi, M., Widmer, J., and Zorzi, M. (2007). In-network aggregation techniques for wireless sensor networks: a survey. Wireless Communications, IEEE, 14(2):70–87. [21] Goldreich, O. (2011). A brief introduction to property testing. In Studies in Complexity and Cryptography, pages 465–469. [22] Gross, J. and Trenkler, G. (1996). On the least squares distance between affine subspaces. Linear Algebra and its Applications, 237:269–276. [23] Gupta, P. and Kumar, P. (1998). Critical power for asymptotic connectivity. In Decision and Control, 1998. Proceedings of the 37th IEEE Conference on, volume 1, pages 1106– 1110. IEEE. [24] Hopcroft, J. and Kannan, R. (2014). Foundations of data science1. [25] Hug, D., Thäle, C., and Weil, W. (2015). Intersection and proximity of processes of flats. Journal of Mathematical Analysis and Applications. [26] Januzaj, E., Kriegel, H.-P., and Pfeifle, M. (2004). Scalable density-based distributed clustering. In Knowledge Discovery in Databases: PKDD 2004, pages 231–244. Springer. [27] Jesus, P., Baquero, C., and Almeida, P. (2011). A survey of distributed data aggregation algorithms. arXiv preprint arXiv:1110.0725. [28] Kahn, J. M., Katz, R. H., and Pister, K. S. (1999). Next century challenges: mobile networking for smart dust. In Proceedings of the 5th annual ACM/IEEE international conference on Mobile computing and networking, pages 271–278. ACM. References 73

[29] Kendall, M. G. and Moran, P. A. P. (1963). Geometrical probability. Griffin London.

[30] Kozma, G., Lotker, Z., and Stupp, G. (2010). On the connectivity threshold for general uniform metric spaces. Information Processing Letters, 110(10):356–359.

[31] Kranakis, E., MacQuarrie, F., and Morales-Ponce, O. (2012). Stretch factor in wireless sensor networks with directional antennae. In Lin, G., editor, Combinatorial Optimization and Applications, volume 7402 of Lecture Notes in Computer Science, pages 25–36. Springer Berlin Heidelberg.

[32] Kriegel, H.-P., Kröger, P., and Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1):1.

[33] Kumar, U., Gupta, H., and Das, S. (2006). A topology control approach to using directional antennas in wireless mesh networks. In Communications, 2006. ICC ’06. IEEE International Conference on, volume 9, pages 4083–4088.

[34] Lee, E. and Schulman, L. J. (2013). Clustering affine subspaces: hardness and algo- rithms. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 810–827. SIAM.

[35] Liberty, E. (2013). Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 581–588. ACM.

[36] Lynch, C. (2008). Big data: How do your data grow? Nature, 455(7209):28–29.

[37] Meester, R. and Roy, R. (1996). Continuum percolation. Cambridge University Press.

[38] Mitzenmacher, M. and Upfal, E. (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press.

[39] Nürnberger, G. (1989). Approximation by spline functions. Berlin: Springer, 1989, 1.

[40] Penrose, M. (1999). A strong law for the longest edge of the minimal spanning tree. The Annals of Probability, 27(1):246–260.

[41] Pinkus, A. (2000). Weierstrass and approximation theory. Journal of Approximation Theory, 107(1):1 – 66.

[42] Plesník, J. (2007). Finding the orthogonal projection of a point onto an affine subspace. Linear algebra and its applications, 422(2):455–470.

[43] Rajagopalan, R. and Varshney, P. (2006). Data aggregation techniques in sensor networks: A survey.

[44] Rajaraman, A. and Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press.

[45] Rivlin, T. (2003). An introduction to the approximation of functions. Dover Publications. 74 References

[46] Ron, D. (2009). Algorithmic and analysis techniques in property testing. Foundations and Trends in Theoretical Computer Science, 5(2):73–205. [47] Rubinfeld, R. and Shapira, A. (2011). Sublinear time algorithms. SIAM J. Discrete Math., 25(4):1562–1588. [48] Saniee, K. (2008). A simple expression for multivariate lagrange interpolation. [49] Santaló, L. A. (2004). Integral geometry and geometric probability. Cambridge University Press. [50] Schneider, R. and Weil, W. (2008). Stochastic and integral geometry. Springer Science & Business Media. [51] Schulte, M. and Thäle, C. (2014). Distances between poisson k-flats. Methodology and Computing in Applied Probability, 16(2):311–329. [52] Sudan, M. (1997). Decoding of reed solomon codes beyond the error-correction bound. Journal of complexity, 13(1):180–193. [53] Sulanke, R. (1965). Schnittpunkte zufälliger geraden. Archiv der Mathematik, 16(1):320–324. [54] Toffler, A., Longul, W., and Forbes, H. (1981). The third wave. Bantam books New York. [55] Ullman, J. D., Aho, A. V., and Hopcroft, J. E. (1974). The design and analysis of computer algorithms. Addison-Wesley, Reading, 4:1–2. [56] Welch, L. and Berlekamp, E. (1986). Error correction for algebraic block codes. US Patent 4,633,470. [57] Xue, F. and Kumar, P. (2004). The number of neighbors needed for connectivity of wireless networks. Wireless networks, 10(2):169–181.