I dedicate this thesis to my husband, Amiel, for his constant support and unconditional love.
Acknowledgements
I would like to express my special appreciation and thanks to my PhD advisors, Professors Shlomi Dolev and Zvi Lotker, for supporting me during these past five years. You have been a tremendous mentors for me. I would like to thank you for encouraging my research and for allowing me to grow as a research scientist. Your scientific advice and knowledge and many insightful discussions and suggestions have been priceless. I would also like to thank my committee members, professor Eitan Bachmat, professor Chen Avin, professor Amnon Ta-Shma for their helpful comments and suggestions in general. A heartfelt thanks to the really supportive and active BGU community here in Beer-Sheva and all my friends who made the research experience something special, in particular, Ariel, Dan, Nisha, Shantanu, Martin, Nova, Eyal, Guy and Marina. Special thanks to Daniel and Elisa for proof reading my final draft. A special thanks to my family. Words cannot express how grateful I am to my mother-in law, father-in-law, my mother, and father for all of the sacrifices that youve made onmy behalf. Finally, I would like to acknowledge the most important person in my life my husband Amiel. He has been a constant source of strength and inspiration. There were times during the past five years when everything seemed hopeless and I didnt have any hope. I can honestly say that it was only his determination and constant encouragement (and sometimes a kick on my backside when I needed one) that ultimately made it possible for me to see this project through to the end.
Abstract
The abundance in the amount of data is forcing us to redefine many scientific and techno- logical fields, with the affirmation of any environment of Big Data as a potential source of data.The advent of Big Data is introducing important innovations: the availability of additional external data sources, dimensions previously unknown and questionable consis- tency poses new challenges to computer scientists, demanding a general reconsideration that involves tools, software, methodologies and organizations. This thesis investigates the problem of big data abstraction in the scope of exploration, in- terpolation and extrapolation. The driving vision of data abstraction is to turn the information overload into an opportunity: the goal of the abstraction is to make our way of processing data and information transparent for an analytic discourse as well as a tool for complete missing information, predicate unknown features and filter noise and outliers. We confront three aspects of the abstraction problems with gradual levels of generaliza- tion. First we focus on a specific data type and propose a novel solution for exploring the connectivity threshold of wireless data when the number of the sensors approach infinity. Second, we consider how to use polynomials to effectively and succinctly interpolate general data functions that tolerate noise as well as bounded number of maliciously corrupted outliers. Third, we show how to represent a high-dimensional data set with incomplete information that fulfills the demand of predictive modeling. Our main contribution lies in rethinking these problems in the context of massive amounts of data which dictate large volumes and high dimensionality. Information extraction, exploration and extrapolation have a major impact on our society. We believe that the topics investigated in this thesis can potentially influence greatly and practically.
Table of contents
List of figures ix
1 Introduction1 1.1 The Information Age ...... 1 1.2 Big Data Abstraction ...... 3 1.2.1 Big Data Exploration ...... 3 1.2.2 Big Data Interpolation ...... 4 1.2.3 Big Data Extrapolation ...... 4
2 Probabilistic Connectivity Threshold for Directional Antenna Widths7 2.1 Introduction ...... 7 2.2 Preliminaries ...... 9 2.2.1 Notations ...... 11 2.2.2 Probability and the relation between Uniform and Poisson distributions 12 2.2.3 Covering and Connectivity ...... 12 2.3 Centered Angles ...... 13 2.3.1 Finding the Connectivity Threshold ...... 17 2.4 Random Angle Direction ...... 19 2.5 Discussion ...... 23 2.6 Appendix ...... 24
3 Big Data Interpolation using Functional Representation 27 3.1 Introduction ...... 27 3.2 Discrete Finite Noise ...... 30 3.2.1 Handle the discrete noise ...... 30 3.2.2 Multidimensional Data...... 32 3.3 Random Sample with Unrestricted Noise ...... 36 3.3.1 Polynomial fitting to noisy data ...... 36 viii Table of contents
3.3.2 Byzantine Elimination...... 39 3.4 Discussion ...... 40
4 Mending Missing Information in Big-Data 43 4.1 Introduction ...... 43 4.2 Preliminaries ...... 45 4.3 k-flats Clustering ...... 47 4.4 Algorithm ...... 53 4.5 Experimental Studies of k-Flat Clustering ...... 55 4.6 Clustering with different group sizes ...... 55 4.7 Sublinear and distributed algorithms ...... 58 4.8 Discussion ...... 59 4.9 Conclusion ...... 61 4.10 Appendix: The probability of flats intersection ...... 62
5 Conclusions 67
Nomenclature 69
References 71 List of figures
2.1 Directional antenna model...... 9 2.2 The communication graph over the disk and the disk’s boundary...... 10 2.3 Covering vs. connectivity problems...... 13 2.4 Project nodes from the disk onto antipodal pair on the boundary...... 14 2.5 Transforming the antipodal pair to a node on the boundary...... 16 2.6 Projection a node from the boundary to a node on the disk...... 17 2.7 A node and its intercepted arc...... 18 2.8 The disk’s cover expansion...... 18 2.9 Represent the three dimensional variable of the graph using a torus. . . . . 20 2.10 Represent the minimal coverage area by an annulus...... 21 2.11 The possible directions that induce adjacency...... 22 2.12 Generalize to convex fat objects with curvature> 0 ...... 24
4.1 Two dimensional pair of flats intersecting a disk ...... 52 4.2 The distance between the midpoint and the ball’s center ...... 53 4.3 Eliminate the irrelevant midpoints ...... 56 4.4 Almost orthogonal flats pairwise intersection ...... 60
Chapter 1
Introduction
1.1 The Information Age
Almost 35 years ago, Alvin Toffler54 [ ] published his book “The Third Wave” where he described three phases of human society’s development based on the concept of ‘waves’, with each wave pushing the older societies and cultures aside. According to Toffler, civilization can be divided into three major phases: The First Wave, referred to as the settled Agricultural society, and which replaced the first hunter-gatherer cultures. The symbol of this ageis the hoe, and the profile of the wealthy person is the land owner. Battles were typically carried out with swords. The Second Wave is the Industrial age society, symbolized by the machine, beginning with the industrial revolution. At this time, the wealthy were the factory owners and machines were used during times of war - tanks, aircrafts etc. The Third Wave is the post-industrial society. Toffler says that since the late 1950s, most countries havebeen transitioning into the Information age. The symbol now is obviously the computer. The wealthy and powerful people are those that develop or collect the data and sell others the privilege to use it, and one of the main threats in this modern age is the cyber attack.
At the beginning of the 80’s, no one could have imagined the significance and power that data would play in everyday life. Today, Big data - a large pool of data that can be captured, communicated, aggregated, stored and analyzed, is part of every sector and function of the global economy [36]. The use of big data can create significant value for the world economy, enhancing the productivity and competitiveness of companies and the public sector, and creating a substantial economic surplus for consumers. 2 Introduction
Big data challenges.
There are many different definitions for the term ‘Big Data’. Generally, this term refers toa massive amount of data, the size of which is beyond the ability of typical database software tools to capture, store, manage, and analyze. A popular abbreviation which is commonly used to characterize it is the three V’s: Volume, Variety and Velocity. By Volume, we usually mean the sheer size of the data, that is of course the major challenge, and the most easily recognized. By Variety, we mean heterogenity of data types, representation, and semantic interpretation. The meaning of Velocity is both the rate at which data arrives and the speed in which it needs to be processed- for example, to perform fraud detection at a sale point. Another important feature of big data includes not only the huge number of items but also their ‘wideness’, i.e., each item maintains many fields. Hence, it is common to describe these items by objects in a high-dimensional space. High-dimensional objects have a number of unintuitive properties that are sometimes referred to as the ‘curse of dimensionality’ [6]. Multiple dimensions are hard to think in, impossible to visualize, and due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality. One manifestation of the ‘curse’ is that in high dimensions, almost all pairs of points are equally far away from one another and almost any two vectors are nearly orthogonal. Another manifestation is that high dimensional functions tend to have more complex features than low-dimensional functions, and are hence harder to estimate. Moreover, in order to obtain a statistically sound and reliable result, e.g., to estimate multivariate functions with the same accuracy as functions in low dimensions, we require that the sample size grow exponentially with the dimension. The emerging field of data science relates those aspects of big data and provides them with different solutions which are fundamentally multidisciplinary. The platform’s engi- neers try to build parallel data processing platforms where their goal is to make it easy for developers of big data applications to write programs, as they would on a single node computational environment, and to rapidly deploy those applications on tens or hundreds of nodes. From the Machine Learning and Understanding point of view, the challenge is to develop classification and clustering applications in a wide spectrum of domains. The privacy and security challenges encourage the development of technologies and policies for protecting and allowing people to retain control over their data [1]. In the algorithmic field, scientists design algorithms that can deal with very large volumes of data. These include parallel implementations of a range of known algorithms including matrix computations, as well as statistical operations like regression and optimization methods [35]. There is a need to produce new algorithms for encoding, comparing, and searching massive data sets. Another algorithmic aspect is the need for sublinear algorithms. Here, the goal is to develop 1.2 Big Data Abstraction 3 powerful algorithmic sampling techniques which allow one to estimate parameters of the data by viewing only a small portion of it. Such parameters may be combinatorial, algebraic, or distributional [21, 46, 47].
1.2 Big Data Abstraction
This thesis examines another aspect of the big data challenge which is part of the data modeling idea. Modeling is the process of creating a simpler representation of something or the design of representative structures at various levels of abstraction from conceptual to physical. In big data abstraction, we try to capture the essence of the data using a succinct representation which maintains the main feature of the data. We identify three different aspects of the abstraction- Exploration, Interpolation and Extrapolation-each approached in a different manner. In the Exploration part, we deal with a specific data aggregation task establishing com- munication in the scope of two dimensional wireless sensors that randomly scatter over the unit disk, where our goal is to cover the disk with minimal communication resources. In the Interpolation part we relate to more general data, multi-variate functions that can be approximated using polynomials. The interpolation provides concise representation as well as estimates unknown values within the range of the known points. Finally, at the Extrapolation phase, we model general high dimensional data using algebraic objects in a way that helps us estimate beyond the original observation range of unknown features of one object, on the basis of its relationship with another object. Below we give a short description of these three parts and summarize our main results, while in the following three chapters (Chapters2,3,4) we describe the entire study in detail. Finally in Chapter5, we give our concluding remarks.
1.2.1 Big Data Exploration
Sensor networks and IoT are a growing source for big data. The communication network coverage should be the most important issue in the network design, in order to allow accumu- lation of the data. This part of the research is focused on establishing connected computer networks that cover the entire relevant space using minimal communication resources. Consider the task of maintaining connectivity in a wireless network where the network nodes are equipped with directional antennas. Nodes correspond to points on the unit disk and each uses a directional antenna covering a sector of a given angle α. 4 Introduction
To calculate the width required for a connectivity problem we have to find out the necessary and sufficient conditions of α that guarantee connectivity when an antenna’s location is uniformly distributed and the orientation of the antenna’s sector is either random or fixed. We show that when the number of network nodes is big enough, the required α ap- proaches zero. Specifically, on the unit disk, assuming uniform orientation, it holdswith q 4 logn high probability that the threshold for connectivity is α = Θ( n ). This is shown by the use of Poisson approximation and geometrical considerations. Moreover, when the model is relaxed, assuming that the antenna’s orientation is directed towards the center of the disk, we logn demonstrate that α = Θ( n ) is a necessary and sufficient condition. This work, described in the following chapter, was also presented at the 20th International Colloquium in Structural Information and Communication Complexity (SIROCCO) [14] and published in Theoretical Computer Science journal [15].
1.2.2 Big Data Interpolation
Given a large set of data measurements, in order to identify a simple function that captures the essence of the data, we suggest representing the data by an abstract function, in particular by polynomials. We interpolate the datapoints to define a polynomial that would represent the data succinctly. The interpolation is challenging, since in practice the data can be noisy and even Byzantine, where the Byzantine data represents an adversarial value that is not limited to being close to the correct measured data. We present two solutions, one that extends the Welch-Berlekamp technique [56] to eliminate the outliers appearance in the case of multidimensional data, and copes with discrete noise and Byzantine data, and the other is based on the Arora and Khot [4] method to handle noisy data, generalize it in the case of multidimensional noisy and Byzantine data. The minute details of this part are presented in Chapter3, while an extended abstract of this study was published in the International Symposium on Algorithms and Experiments for Sensor Systems, Wireless Networks and Distributed Robotics (ALGOSENSORS) [13] and now under review at the ACM Transactions on Knowledge Discovery from Data.
1.2.3 Big Data Extrapolation
Consider a high-dimensional data set, in which for every data-point there is incomplete information. Each object in the data set represents a real entity, which is described by a point in high-dimensional space. We model the lack of information for a given object as affine d subspace in R with dimension k. 1.2 Big Data Abstraction 5
Our goal in this part is to find clusters of objects. The main problem is to cope with partial information. We studied a simple algorithm we call Data clustering using flats minimum distances, using the following assumptions: (i) There are m clusters. (ii) Each cluster is d modeled as a ball in R . (iii) All k-dimensional affine subspaces, which belong in the same cluster, are intersected with the ball of the cluster. (iv) Each k-dimensional affine subspace, that belong to a cluster, is selected uniformly among all k-dimensional affine subspaces that intersect the ball’s cluster. A data set that satisfies these assumptions will be called separable data. Our suggested algorithm calculates pair-wise projection of the data. We use probabilis- tic considerations to prove the algorithm’s correctness. These probabilistic results are of independent interest, and can serve to better understand the geometry of high dimensional objects. Chapter4 concludes this work, while an abstract of this work was published in The Haifa 2nd Security Research Seminar 2015 and the whole study is under review at the IEEE Transactions on Big Data.
Chapter 2
Probabilistic Connectivity Threshold for Directional Antenna Widths
2.1 Introduction
Communication among wireless devices is of great interest in the current wireless technology, where devices are part of sensor networks, mobile ad-hoc networks and RFID devices that take part in the emerging ubiquitous computing, and even satellite networks. These communication networks are usually extremely dynamic, where devices frequently join and leave (or crash) and, therefore, require probabilistic techniques and analysis. Imagine, for example, sensor networks that use directional antennas (saving energy and increasing communication capacity) among the sensors that should be connected even though they are deployed by an airplane that drops them from the air (just as in a smart dust scenario). What is the density of those sensors needed to ensure their connectivity? Is there a way to renew connectivity after some portion of the sensors stops functioning- maybe by deploying only an additional fraction, uniformly distributed in the area with random orientation of the antennas? In this work, we try for the first time to suggest and analyze ways to ensure connectivity in such probabilistic scenarios. Namely, we have studied the problem of arranging randomly scattered wireless sensor antennas in a way that guarantees the connectivity of the induced communication graph. The main challenge here is to minimize energy consumption while preserving node connectivity. In order to save power, increase transmission capacity and reduce interference [33], the unbounded range antennas do not communicate information in all direction but rather inside a wedge-shaped area. Namely, the coverage area of a directional antenna located at point p 8 Probabilistic Connectivity Threshold for Directional Antenna Widths of angle α, is two opposite sectors of angle α of the unit disk centered at p (see Figure 2.1). Throughout the chapter we will call α the communication angle. The smaller the angle is, the better it is in terms of energy saving. When knowing nothing about the future positioning of the antennas, each antenna may be directed to a random direction that may stay fixed forever. Therefore, we wish to find the minimum α > 0 so that no matter what finite set of locations the antennas are given, with high probability theycan communicate with each other. Our goal is to specify necessary and sufficient conditions for the width of wireless antennas that enable one to build a connected communication network when antennas locations and directions are randomly and uniformly chosen. Throughout this chapter, we refer to an undirected graph where the nodes are the antennas, and two nodes are connected by an edge if and only if their corresponding antennas are located in each other’s transmission area. However, our calculations hold for the directed case as well. Specifically, Theorems2 and3 hold for both cases, and the result proven by Theorems4 and5 also implies a connectivity threshold for the directed graph case. Previous results that handle wireless directional networks [2, 12] assume coordinated locations and orientations for the antennas. They show that a connected network can be built with antennas of width α = π/3. The same model’s assumptions were used by [5] to study graph connectivity in the presence of interference and in [31] to optimize the transmission range as well as the hop-stretch factor of the communication network. A different model of a directed graph of directional antennas of bounded transmission range was studied in [11, 16]. In contrast to the above worst case approaches, to the best of our knowledge, we consider for the first time the connectivity problem from a probabilistic perspective. Namely, weare interested in the minimal communication angle that implies high probability for the graph to be connected as a function of the number of nodes. This approach significantly reduces the required communication angle and is more general in the sense that we don’t have to make directing procedures as the procedures employed in [2,5, 12] algorithms. The probabilistic setting of the problem is related to other research in the field of contin- uum percolation theory [37]. The model for the points here is a Poisson point process, and the focus is on the existence of a connected component under different models of connections. For example, [57] studied the number of neighbors that implies connectivity. Papers [23, 40] are focused on the minimum number r such that two points are connected if and only if their metric distance is ≤ r. In [30] the authors generalized the results in [23, 40] and proved that d logn 1/d for a fractal in R , it holds with high probability that r ≈ ( n ) , where ≈ means that the ratio of the two quantities is between two absolute constants. Our main results (summarized in Theorems2,3,4 and5) discuss two different models. The first is related to the case where all the antennas are directed to one referencepoint 2.2 Preliminaries 9
(specifically, we used the center of a disk). The second model generalizes the resultsby dealing with randomly chosen locations and directions. Assuming that the number of nodes is big enough, we show in both cases that with high probability, the threshold approaches zero. We believe that these results are important for both their combinatorial and geometric perspectives and in their implications on the design of wireless networks. This work was published in [14, 15].
2.2 Preliminaries
Let P = {p1(x,y,θ),..., pn(x,y,θ)} be a set of n points (or nodes). The point location (x,y) is chosen independently from the uniform distribution over the unit disk D (or over the unit disk boundary ∂D) in the plane. The antenna direction θ of each point might be fixed or to be chosen independently and uniformly at random from [0,π], and the antenna range is unbounded. Each point represents a communication station by a pair of opposite wedges (since when assuming unbounded range the sector becomes a wedge) of angle α with direction θ at each node (see Figure 2.1).
푝푖(푥, 푦, 휃) 휽
휶
Fig. 2.1 Fixing at each point pi two opposite wedges of angle α with direction θ.
Given nodes u and v with communication angle α, we say that u sees v if u lies in the the coverage area (on the intercepted arc) of v.
Definition 1 (The communication graph) The communication graph G(P,E,α) is an undi- rected graph that consists of the set P, its communication angle α and the set of edges E = {(u,v)|u sees v and v sees u}. (Since E is defined by P(x,y,θ) and α, we omit E in the sequel from the graph notation, i.e., G(P,E,α) = G(P,α)).
Figure 2.2 illustrate examples of the communication graph over the disk and the disk’s boundary. Our model assumes that the communication between two nodes is limited only due to the angle aperture (and not the distance), hence, when α = 2π it is clear that the induced 10 Probabilistic Connectivity Threshold for Directional Antenna Widths
u 훼 u 훼
훼 훼 훼 v w v 훼 w
Fig. 2.2 The communication graph G = (V,E) = ({u,v,w},{(u,v)}) on the disk (on the left) and on the boundary (on the right). Note that G is not connected.
graph is connected with probability 1. In contrast, when α = 0 the probability of achieving a connected graph equals zero. While increasing α, the probability that the induced graph is connected monotonically increases.
Definition 2 (The critical communication angle) Given the set P, the critical communica- tion angle αˇ of the graph G(P,α) is the minimal angle such that G will be connected with probability 1 − o(1) as n tends to infinity iff α ≥ αˇ .
Having these definitions we are ready to present the main results we obtained inthis chapter:
1. Let P = {p1(x,y,θ),..., pn(x,y,θ)} be a set of n points such that their location (x,y) is chosen independently from the uniform distribution over the unit disk D, and their orientation θ is directed to the center of the disk. Then, there exist two constants C > c > 0, such that for all ε > 0 and for any sufficiently large n (depending on ε):
logn (a) if α > C n then the graph G(n,α) is connected with probability at least 1 − ε. logn (b) if α < c n then the graph G(n,α) is connected with probability at most ε.
2. Let P = {p1(x,y,θ),..., pn(x,y,θ)} be a set of n points such that their location (x,y) is chosen independently from the uniform distribution over the unit disk D, and their orientation θ is chosen independently and uniformly over [0,π]. Then, there exist two constants C > c > 0, such that for all ε > 0 and for any sufficiently large n (depending on ε): q 4 logn (a) if α > C n then the graph G(n,α) is connected with probability at least 1 − ε. 2.2 Preliminaries 11
q 4 logn (b) if α < c n then the graph G(n,α) is connected with probability at most ε. Remark 1 From the calculation in the proofs of Theorems2,3,4, and5, one can choose any c < 1/2 and C > π2.
Remark 2 The connectivity threshold in the random direction settings (Theorems4 and5) q 3 logn holds for an undirected graph. We also prove that for a directed graph, α = Θ( n ) is the necessary and sufficient threshold for graph connectivity, though not necessarily for graph strong connectivity.
2.2.1 Notations
• The intercepted arc of a point u with angle α is the part of a circle that lies between the two rays of α that intersect it.
• Let arc(u) be the intercepted arc of u and let |arc(u)| denote its length.
• Let wedge(u) be the coverage area of u.
• Let bisector(u) be the line that passes through the apex of wedge(u), that divides it into two equal angles.
• Let ∗ be an equivalence relation such that for a point u ∈ P, u∗ is the antipodal point of u (i.e., u and u∗ are opposite through the center).
• Throughout, we use the term w.h.p. as a shortcut for “with probability 1 − o(1) as n tends to infinity.”
• Following we sometimes use the term “random point” instead of “random variable.”
• Let D denote the unit disk (in R2) and let ∂D denote its boundary.
∗ • Let ∂D2 be the space of pairs {u,u } of antipodal points in the unit disk boundary ∂D.
• Let X = {X1,...,Xn} be a set of uniform random variables defined over D.
• Let Y = {Y1,...,Yn} be a set of uniform random variables defined over ∂D.
2 • Let Z = {Z1,...,Zn} be a set of uniform random variables defined over ∂D .
2 • Let (Y1,R1),...,(Yn,Rn) be pairs of uniform random variables defined over ∂D ×[0,1].
2 • Let (Z1, χ1),...,(Zn, χn) be pairs of uniform random variables defined over ∂D × {0,1}. 12 Probabilistic Connectivity Threshold for Directional Antenna Widths
2.2.2 Probability and the relation between Uniform and Poisson distri- butions
Throughout this chapter we use standard tools from continuum percolation and refer the reader to [37, 38] for a general introduction of the topic. The problem of finding the critical communication angle can be translated into the mathematical framework of “balls and bins”. We have n balls that are thrown independently and uniformly at random into b bins. The “balls” represent the set P of n nodes distributed over the disk (the disk’s boundary), and the “bins” are slices of the disk area (boundary) defined by the coverage area (by the intercepted arc) of the nodes. Note that the bins in this setting are not disjoint, and we will refer to this later. The distribution of the number of balls in a given bin is approximately a Poisson variable with a density parameter λ = n/b. Moreover, [38] shows (Chapter 5, Theorems 5.6 − 5.10) that the joint distribution of the number of balls in all the bins is well approximated by assuming the load at each bin is an independent Poisson random variable with λ = n/b (the precise definition and properties of the Poisson distribution also appear in that chapter). Let us call the scenario in which the number of balls in the bins are taken to be independent Poisson random variables with mean λ = n/b the Poisson case, and the scenario where n balls are thrown into b bins independently and uniformly at random the exact case. We justify the use of Poisson approximation instead of calculating the exact uniform case by using the following Theorem:
Theorem 1[38] Let Λ be an event whose probability is either monotonically increasing or monotonically decreasing in the number of balls. If Λ has probability p in the Poisson case, then Λ has probability at most 2p in the exact case.
The theorem and its proof appear at [38] as Corollary 5.11. Another basic probability tool we going to use is the “coupon collector” principle. Using Theorem 5.13 of [38] we get that the number of balls that need to be thrown until all the b bins have at least one ball w.h.p. is blogb. In the sequel, we will use this result in a variety of settings.
2.2.3 Covering and Connectivity
The final issue we would like to relate in this preliminary, is the connection betweenthe problem of covering the disk (or the disk boundary) and the problem of ensuring a connected graph. The two problems are defined below: 2.3 Centered Angles 13
Problem 1 Covering problem: We would like to find the minimal communication angle α¯ , such that the coverage area (intercepted arc) induced by the set P defined over D (over [ [ ∂D), covers the whole disk (boundary): wedge(u) = D (or arc(u) = ∂D)) with high u∈P u∈P probability.
Problem 2 Connectivity problem: We would like to find the minimal communication angle αˇ , such that the nodes’ coverage area (intercepted arc) induces a connected graph with high probability.
The disk cover is a necessary but not a sufficient condition for the graph to be connected, as illustrated in Figure 2.3. The relation between the covering and the connectivity problems
푎 푏
푑 푐
Fig. 2.3 The disk is covered by the nodes, however, the induced graph is not connected (the graph contain only two edges). is given by the following Lemma which is explicitly proven in Lemma 2.2 of [30].
Lemma 1[30] Given α¯ which is the minimal angle that induces a cover, with high probabil- ity, αˇ = 3α¯ is the expected connectivity threshold.
2.3 Centered Angles
In this section, we consider the case where the antennas’ communication angle α is directed to the center of the disk. We define three different models, prove their equivalence and use one of them to resolve the connectivity threshold. The diagram below illustrates the way we have proven that the three models are equivalent up to O(·) notation: The equivalence of these three models implies the following observation:
• Let αˇ G(Y,α) be the critical α of G(Y,α), then there exists a constant c > 0 such that the critical α of G(X,α) is αˇ G(X,cα) = Θ(αˇ G(Y,α)). We prove each of the three equivalences separately, producing a new set of points from the given one, as follows: 14 Probabilistic Connectivity Threshold for Directional Antenna Widths
Lemma2 G(X,Θ(α)) G(Z,Θ(α))
Lemma4 Lemma3
G(Y,Θ(α))
If G(X,α) is connected ⇒ G(Z,α) is connected
Definition 3 Let φ : D −→ ∂D2 be a function defined as follows. φ projects every point ∗ u ∈ D to the antipodal pair {u¯,u¯ } ∈ ∂D2 located on the intersection of ∂D and the line goes through u and o (the center of the disk). Note that φ is not defined over the center of the disk, i.e., the center is not transformed to any place on ∂D. This does not change the correctness of the function since the probability that a random point will fall exactly on the center is 0.
−1 Claim 1 Given the set {X1,...,Xn}, one can produce the set {Z1,...,Zn} by Zi ≡ Xi ◦ φ , 2 which implies that Zi is independently identically uniformly distributed over ∂D .
Proof The claim is immediate from the definitions. Precisely, the set {X1,...,Xn} is defined 2 over D, i.e., Xi : D −→ R. Similarly, Zi : ∂D −→ R. One can produce the set {Z1,...,Zn} by −1 2 φ Xi −1 Zi : ∂D −→ D −→ R, which is Zi ≡ Xi ◦ φ .
Claim 2 φ defines for every edge (u,v) ∈ G(X,α) a connected path: ∗ ∗ ∗ ∗ Pu¯,v¯ = ((u¯,v¯ ),(v¯ ,v¯),(v¯,u¯ ),(u¯ ,u¯)) ∈ G(Z,α).
∗ 푣 푢 ∗
푣
푢
푢 푣
Fig. 2.4 The nodes u,v ∈ G(X,α) and their projected nodes u¯,u¯∗,v¯,v¯∗ ∈ G(Z,α) defined by φ. Note that we did not sketch wedge(u¯∗) and wedge(v¯) due to visualization considerations. 2.3 Centered Angles 15
Proof The existence of the edges (u¯,u¯∗),(v¯,v¯∗) is immediate from the antipodal definition. We show that (u¯,v¯∗) is in G(Z,α), and the proof for (v¯,u¯∗) is symmetric. W.l.o.g. we say that dist(u¯∗,v) ≤ dist(u¯,v) and dist(v¯∗,v) ≤ dist(v¯,v) (see Figure 2.4). Since u¯ lies on bisector(u), we get that wedge(u) ⊆ wedge(u¯). Notice that (u,v) ∈ G(X,α) implies u sees v; hence, u¯ sees v. Since v and v¯∗ share the same bisector, it must hold that u¯ sees v¯∗, i.e., (u¯,v¯∗) is in G(Z,α). Note that for any nodes u,v in G(Z,α) (as well as in G(X,α) or in G(Y,α)) if u sees v it must hold that v sees u since all the angles are directed to the center).
−1 Lemma 2 Given the communication graphs G(X,α) and G(Z,α) such that Zi = Xi ◦ φ , then if G(X,α) is connected, it implies that G(Z,α) is connected.
Proof Given that G(X,α) is connected, it implies that for every two nodes u,v ∈ G(X,α) there exists a path Puv = ((u,x),...,(y,v)) ∈ G(X,α) that connects them. To show that G(Z,α) is connected, we provide a path Pu¯,v¯ by replacing every edge (x,y) ∈ Pvu by the path Px¯,y¯ ∈ G(Z,α) using Claim2, i.e., Pu¯,v¯ = (Pu¯,x¯,...,Py¯,v¯).
If G(Z,α) is connected ⇒ G(Y,3α) is connected
∗ Definition 4 Let ϕ : ∂D2 × {0,1} −→ ∂D be a function that maps the pair {u,u } ∈ ∂D2 ′ ∗ ∗ ∗ and a bit b to one node u ∈ ∂D from the pair, e.g., ϕ({u,u ,0}) = u and ϕ({u,u ,1}) = u .
Claim 3 Given the set {(Z1, χ1),...,(Zn, χn)}, one can produce the set {Y1,...,Yn} by Yi ≡ −1 (Zi, χi) ◦ ϕ , which implies that Yi is independently identically uniformly distributed.
(The proof is similar to the proof of Claim1.)
Lemma 3 Given the communication graphs G(Z,α) and G(Y,cα), for a constant c ≥ 3, −1 such that Yi = (Zi, χi) ◦ ϕ . If G(Z,α) is connected, then w.h.p G(Y,cα) is connected.
Proof We will show that every covered arc in G(Z,α) is covered in G(Y,3α). Given the nodes u,u∗ ∈ G(Z,α), ϕ produces (w.l.o.g.) the node u′ = u ∈ G(Y,3α). arc(u) is covered at G(Y,3α) since u′ ∈ G(Y,3α). However, since u∗ ̸∈ G(Y,3α), we need to show that arc(u∗) is covered in G(Y,3α). ′ ∗ ′ ′ Partition arc(u ) and arc(u ) into two bins each, denoted (from left to right) by bℓ,br,bℓ,br ′ respectively (see Figure 2.5). Note that if there exists a node ℓ at bℓ, then br is covered, ′ and if there exists a node r at br, then bℓ is covered. We prove that b and r do exist using probabilistic considerations. 16 Probabilistic Connectivity Threshold for Directional Antenna Widths
퓁 푢∗ 푟 푏′퓁 푏′푟
푏퓁 푏푟 푢 = 푢′
Fig. 2.5 Transforming the points u,u∗ ∈ G(Z,α) to the point u′ ∈ G(Y,3α). In G(Y,3α) the ∗ nodes ℓ,r cover arc(u ) = bℓ ∪ br.
After throwing n balls at random into b bins, the distribution of the number of balls in a given bin is approximately Poisson with mean n/b (see Section 2.2.2). The balls in our case are the nodes and the bins are the arcs. Hence, the density of a given bin (i.e., #balls 2n arc(u)) in G(Z,α) is λZ = #bins = π/α = 2nα/π. The density of a given bin of G(Y,3α) is n λY = π/3α = 3nα/π. To finish the proof, we argue that ℓ and r do exist at G(Z,α); otherwise, u is connected only to u∗ which implies that the graph is not connected (or contains only one antipodal ′ ′ pair). Since λZ < λY , we get that w.h.p. at G(Y,3α) there exists a node at bℓ and at br, which implies a cover.
If G(Y,α/2) is connected ⇒ G(X,α) is connected
Definition 5 Let ψ : ∂D × [0,1] −→ D be a function defined as follows. For every point u¯ ∈ ∂D, ψ maps a radius ru to the point u ∈ D located on the line going through u¯ and o, √ such that dist(u,u¯) = ru.
Claim 4 Given the set (Y1,R1),...,(Yn,Rn), one can produce the set X1,...,Xn by Xi ≡ −1 (Yi,Ri) ◦ ψ , which implies that Xi is independently identically uniformly distributed over D.
(The proof is similar to the proof of Claim1.)
Lemma 4 Given the communication graphs G(Y,α/2) and G(X,α) such that Xi = (Yi,Ri)◦ ψ−1, and given that G(Y,α/2) is connected, then G(X,α) is connected. 2.3 Centered Angles 17
푢 푢
푢 푢 푟푢
푣
푣
Fig. 2.6 The point u¯ and its projected point u defined by ψ illustrated at the left. Note that √ the distance between the two points is ru for uniformly distributed ru. At the right one can observe that ifu ¯ seesv ¯, then u sees v.
Proof Let EY and EX denote the edges sets of G(Y,α/2) and G(X,α) respectively, we prove the connectedness dependency by establishing that EY ⊆ EX . Given the nodes u¯,v¯ ∈ Y(α/2) such that (u¯,v¯) ∈ EY , let u,v ∈ D be the nodes produced by applying ψ on u¯,v¯, respectively (see Figure 2.6). Due to the geometrical relation between angles and intercepted arcs, we know that |arc(u¯)| = α, and α ≤ |arc(u)| ≤ 2α. Since u¯ lies on bisector(u), we get that arc(u¯) ⊆ arc(u). However, v¯ lies in arc(u¯) (because
(u¯,v¯) ∈ EY ); hence, u sees v¯. Since v and v¯ share the same bisector, then u sees v. Symmetric considerations imply that v sees u; hence, the edge (u,v) is in EX .
2.3.1 Finding the Connectivity Threshold
Given the set L of n uniformly distributed points on ∂D, such that the angle α of every node is directed to the center, we show that the threshold for G(Y,α) is αˇ (Y) = Θ(logn/n) as presented in the following two lemmas.
Theorem 2 (Sufficient condition for connectivity) Given the set (Y,α), there exists a con- logn stant c > 0 such that when α ≥ c n , the communication graph G(Y,α) is connected w.h.p.
Proof The proof provides a cover of the disk’s boundary, and then by Lemma1 we can achieve the expected connectivity. Given a node u ∈ G(Y,α) we partition arc(u) into three equal length bins denoted (from left to right) by bℓ,bmid and br. Since |arc(u)| = 2α, the length of each bin is 2α/3. Let ℓ be a node at bℓ, and uℓ be a node that lies at arc(ℓ). The angles are directed to the center; hence, the nodes u,ℓ and uℓ are connected. In the same way, we expand the cover to the left (with relation to arc(u)), See Figure 2.8. The same considerations are valid for the right side of the 18 Probabilistic Connectivity Threshold for Directional Antenna Widths
푢 푢
훼 훼
푎푟푐 푢 = 2훼 푣
Fig. 2.7 The node u with the angle α and its intercepted arc of length 2α. Note that every node v ∈ arc(u) is connected to u, i.e., (u,v) ∈ E.
arc and for u,r and ur, respectively. Note that the existence of ℓ,r,uℓ and ur is an outcome of the connectedness of G(Y,α). By the coupon collector principle (see Section 2.2.2), we know that the number of balls that need to be thrown until all bins have at least one ball with high probability is blogb, where b is the number of bins. Similarly, when placing n balls, n/logn bins suffice to guarantee that in each bin, there will be at least one ball. 2πr 3π Dividing the circumference to 2α/3 cells, we have 2α/3 = α bins (note that r = 1). 3π n Assigning α = logn bins, we get that α = O(logn/n) as expected.
푢 푢퓁 푢
퓁 퓁 푟 2훼
3
Fig. 2.8 The left figure demonstrate that the existent of a node at the left side of the intercepted arc (like ℓ), expand the cover to the right (in respect to u). We partition each cell of length 2α into 3 segments to ensure that the left node ℓ and the right node r exist, for expand the cover to the right and the left respectively. 2.4 Random Angle Direction 19
Theorem 3 (Necessary condition for connectivity) Given the set (Y,α), there exists a con- logn stant c > 0 such that if α < c n , then w.h.p. the induced communication graph G(Y,α) is not connected.
logn logn Proof When α < c n , every node induces an intercepted arc of length ≤ n . We will show that there exist an arc that does not contain any node, hence, its antipodal arc is not covered, which yields by Lemma1 that G(Y,α) is not connected. n clogn n n Let I = {Ii}i=1 be a set of clogn disjoint arc intervals induced by clogn nodes of G(Y,α). The existence of I is proven in Claim9 in the Appendix.
The load in every bin i is an independent Poisson random variable Xi with mean λ = n/|A| (see explanation at Section 2.2.2). Let χi be an indicator random variable that is 1 when the ith n bin is empty and 0 otherwise. The density parameter of X is λ = n/|I| = n/(clogn) = clogn. −clogn 1 c Thus, the probability that a bin i is empty is Pr(χi = 1) = Pr(Xi = 0) = e = n . By 1 c the union bound, we get that the probability that a bin i is not empty is Pr(χi = 0) = 1− n . Using the independency property of the Poisson variables, the probability that in all the bins there is at least one ball is
|I| n 1c 1c clogn Pr χ = 0 ∩ χ = 0 ∩ ... ∩ χ = 0 = 1 − = 1 − (2.1) 1 2 |A| n n
n 1 c clogn When setting 0 < c < 1, we get that limn→∞ 1 − n = 0, which implies that w.h.p. there exists an empty bin, i.e., an arc fragment that is not covered.
2.4 Random Angle Direction
Given the set P of n uniform point distributed over the unit disk, we now assume that the direction of the antenna is a random uniform variable θi. Hence, each point can be represented by three parameters (x,y,θ) where x and y indicate the point location, distributed over D, and θ distributed over [0,π] is the direction of the antenna. Since the problem has three dimensions, it makes sense to do a transformation from two dimensions to three dimensional space. Observe the set P lies over a torus T in R3, such that the unit disk is swept around an axis with length 2π (all the possible directions1), see Figure 2.9. At this setting, the points are 2 2 distributed over the volume of T which is VT = πr 2πR = {R = r = 1} = 2π . To achieve a probability space, we normalize this number to be equal to one.
1Note that in this setting, θ is distributed over [0,2π] instead of [0,π]. However, these are equal in O(·) notation. 20 Probabilistic Connectivity Threshold for Directional Antenna Widths
Note that the transformation from the planar disk to the 3D torus does not refer to the center of the disk, i.e., the center did not transform to any place in the torus. This does not change the correctness of the transformation since the probability that a random point will fall exactly on the center is 0.
Fig. 2.9 The torus T ∈ R3. The blue circle (with the “minor radius”) is swept around the axis defining the red circle. The radius of the red circle R, (the “major radius”) is the distance from the center of the tube to the center of the torus.
Our goal is to find the minimal angle that suffice to guarantees that the induced com- munication graph is connected; hence, we would like to find the set of points that in- duces the minimal coverage area and ensures that these points induce a cover (which in turn yields a connected graph, see Lemma1). Observing that when the node is located on the boundary and the node’s direction is close to the tangent direction, the cover- age area is minimal, we focus on the set B of points which is defined as follows: B =
{(xi,yi,θi) ∈ Tex ∩ P : θi is the tangent direction}, where Tex is the external ring (a.k.a. annu- lus) of T (see Figure 2.10). Namely, B is the set of points that their location is closed to the disk’s boundary and their direction is closed to the tangent’s direction.
Claim 5 For any constant c ≥ 4 there exists natural n0, such that for all n ≥ n0, tangent angle induce a coverage area of size ≤ c(α(n))3.
Proof Calculating the area of the circle’s sector and subtracting the unnecessary area (namely the triangle ∆uot in Figure 2.10) results 1/2(2α − sin2α)r2. Since the radius is r = 1 and α3 sinα ≤ α − 3! (by Taylor expansion), then for any sufficiently large n, we get 1/2(2α − (2α)3 3 (2α − 3! )) ≤ cα .
Claim 6 For any constant c ≥ 2 there exists natural n0 such that for all n ≥ n0, B lies over a 2 volume VB ≥ c(α(n)) . 2.4 Random Angle Direction 21
1 − 훼 표
훼 푇푖푛 푡
푇푒푥 훼 퓁 = 2훼 푢
Fig. 2.10 When the node is located on the boundary and the node’s direction is close to the tangent direction, the coverage area is minimal. We calculate their volume by eliminating the interior torus with r = 1 − α from VT.
Proof We are interested in the points that located close to the boundary and their direction is close to the tangent direction. Namely, in the three-dimensional representation, these are the points lying on the external ring of T. Denote the interior torus with radius = 1 − α by Tin and the external ring (a.k.a. annulus) by Tex such that T = Tex ∪ Tin. The volume 2 2 2 2 of Tex is VTex = VT −VTin = 1 − π(1 − α) 2πR /(2π ) = 1 − (1 − 2α + α ) = 2α − α . We multiply this by α to insert the direction constraint (see Claim8 below) and achieve 2 3 2 VB = 2α − α = Ω((α(n)) ) (Note that α is a function of n, such that when n → ∞ then α → 0).
Each node i = (xi,yi,θi) ∈ B defines a ball (spherical cap) Hi of the set of nodes that can communicate with i: Hi = (x j,y j,θ j) ∈ P ∩ Tex : θ j ∈ [θi − α,θi + α] (i.e., Hi is the 3D shape symmetric to the 2D sector defined by wedge(i)).
Claim 7 For any constant c ≥ 4 there exists natural n0 such that for all n ≥ n0, the volume 4 of the ball Hi is VHi ≤ c(α(n)) .
Proof From Claim5, we find that the area of wedge(i) is ≤ cα3 (for c ≥ 4). We multiply this by α to insert the direction constraint (see Claim8) and achieve the expected volume.
For the directed communication settings, we can avoid multiplying by α since the nodes in the ball Hi may not be directed to the nodes that reside in B. Hence, the volume VH is q i 3 3 logn ≤ α , and the connectivity threshold becomes αˇ = Θ( n ).
Claim 8 Given the nodes u,v each with a communication angle of size α, the possible directions that induce adjacency between v and u is α. 22 Probabilistic Connectivity Threshold for Directional Antenna Widths
훼 푢 훼 푣
훼 훼 푢 푣
훼 훼 푢 푣
Fig. 2.11 The possible directions that induce adjacency between v and u is α.
Proof Fix u’s direction and turn v around. We can observe that v’s direction range that preserves adjacency of v and u (i.e., u and v sees each other) is [0,α].
Theorem 4 (Sufficient condition for connectivity) Given n nodes that are uniformly dis- q 4 logn tributed on the unit disk, there exists a constant c ≥ 4 such that if α ≥ c n , then w.h.p. the induced communication graph is connected.
Proof The set B represents the nodes that induce the minimal communication volume
(VHi ). To achieve the space cover, we can partition it into disjoint cells of size VHi and use probabilistic consideration to guarantee that w.h.p. there exists at least one ball in every cell. 4 Claim7 implies that for a constant c ≥ 4, VH ≤ cα ; hence, we partition the whole volume i q into 1 cells. By the coupon collector principle (see Section 2.2.2) since for any ≥ c 4 logn cα4 α n the number of nodes n is bigger then 1 log 1 , it yields that with high probability there is cα4 cα4 no empty cell.
Theorem 5 (Necessary condition for connectivity) Given n nodes that are uniformly dis- q 4 logn tributed on the unit disk, there exists a constant c < 4 such that if α < c n , then w.h.p. the induced communication graph is not connected. q 4 logn Proof We show that when α = c n , the graph G is not connected since B induces a volume that is not covered w.h.p. 4 From Claim7 we get that each node induce a ball Hi with volume ≤ α . Claims 10 √ S and 11 argue that there exist n such disjoint balls. Let H = Hi be the set of these disjoint Hi∩Hj=/0 2.5 Discussion 23 balls. To complete the proof, we show that w.h.p. there exists a ball Hi ∈ H that does not contain any node j, i.e., the node i ∈ B is not connected. When throwing n balls (nodes) at random into b bins (balls, the set H), the distribution of the number of balls in a given bin is approximately Poisson with λ = n/b (see Section 2.2.2).
Let χi denote an indicator random variable that is 1 when the ball Hi is empty and 0 otherwise. Let Xi be a Poisson random variable over the number of nodes in Hi. The density parameter 4 n c of Xi is λ = n/VHi = n/α = clogn/n = logn . Therefore, the probability that a given ball Hi is empty is c c −λ − 2 Pr(χi = 1) = Pr(Xi = 0) = e = e logn = (1/n) log n
Using the independency property of the Poisson variable, we get that the probability that all of the balls in H are not empty is
√ √ Pr χ1 = 0 ∩ χ2 = 0 ∩ ... ∩ χ n = 0 = Pr(χ1 = 0)Pr(χ2 = 0)...Pr(χ n = 0) = (2.2) √ c ! n 1 log2 n = 1 − (2.3) n
√ c n 1 log2 n Since limn→∞ 1 − n = 0, it implies that, with high probability, there exists an empty ball. Hence, the graph is not connected.
2.5 Discussion
In this study, we have analyzed the connectively threshold for directional antennas. Our results show that if one can adjust the direction of the antennas, then in order to guarantee log(n) the network connectivity, the angle should be Θ n . In contrast, if the direction of the q 4 log(n) antenna is a random variable, then the angle should be Θ n . This gives a polynomial gap between the two models. One of the simplest ways to increase the capacity of the network is by directional antennas. Our work defines theoretical bounds on how small the angle of the directional antennas can be in order to maintain connectivity. If we compare the classical results on connectivity in the unit disk graph model to our results, we find two main differences. The first difference is that in the unit disk graph model, the minimal graph that maintains the connectivity of the network is the Euclidean minimal spanning tree. In the directional antennas, the minimal graph is closer to the Hamiltonian cycle in nature. More accurately, if our points are located on the boundary of a disk, then it is the Hamiltonian cycle. Moreover, our analysis indicates 24 Probabilistic Connectivity Threshold for Directional Antenna Widths that the network is totally different near the critical angle from a usual network graph implied by the unit disk; while in the unit disk graph model the communication moves over short distances, on the directional antenna model, the communication prefers long distances, i.e., forms a point p to the antipodal point p∗. Throughout the study, we have assumed that the wireless antennas are scattered on a unit disk. We believe that the disk assumption can be relaxed to accommodate a more general and realistic setting, namely; our results hold for any convex fat object with curvature> 0, denoted by S. The fatness constraint means that there are concentric balls Bin ⊂ S ⊂ Bout, such that diam(Bin)/diam(Bout) ≥ β, where diam(A) = diameter o f A. see Figure 2.12. Given that the antennas-nodes uniformly distributed on S, the fatness property provides that the number nin of nodes in Bin, is proportional to the number nout of points that lie at Bout. Namely, there exists a constant c > 0 such that Bin = cBout. The convexity property of S implies that every two nodes u,v that lie in S see each other if each lies in the wedge of the other. We believe that these three properties yield that the connectivity threshold we presented for the unit disk D, suffice for S.
퐵표푢푡 S
퐵푖푛
Fig. 2.12 The concentric balls Bin ⊂ S ⊂ Bout. If S is fat, then diam(Bin)/diam(Bout) ≥ β.
2.6 Appendix
n logn Claim 9 Let I = {Ii}i=1 be a set of arc intervals of length |Ii| ≤ n . There exists a positive ′ constant c and natural n0 such that for all n ≥ n0 and for all I, there exist subset I = n clogn n Ii j j=1 ⊆ I of clogn disjoint intervals induced by the nodes of G(Y,α).
′ Proof Given the set of intervals I, we partition the set into 1/(c|Ii|) disjoint cells. Let I ⊆ I denote the set of arcs that fall in the non-neighboring cells of I. By the construction, it holds ′ ′ n that I ’s arcs are disjoint. We argue that I contains clogn disjoint fragments. After throwing n balls at random into b bins, the distribution of the number of balls in a given bin is approximately Poisson with mean n/b (see Section 2.2.2). Let Xi be the Poisson 2.6 Appendix 25
random variable over the number of balls in bin i. Let χi denote an indicator random variable that is 1 when the ith bin is empty and 0 otherwise. n The density parameter of Xi is λ = = clogn; hence, the probability that a given 1/(c|Ii|) e−λ λ 0 −λ c bin i is empty is Pr(χi = 1) = Pr(Xi = 0) = 0! = e = (1/n) . By union bound, the probability that in the ith bin there is at least one ball is Pr(χi = 1 c 0) = (1 − ( n ) ). Using the independency property of the Poisson variables, the probability that there at least one ball in all the bins is c|I′| 1 Pr χ = 0 ∩ χ = 0 ∩ ... ∩ χ ′ = 0 = 1 − 1 2 |I | n n 1c clogn = 1 − (2.4) n
n 1 c clogn For every c > 1, limn→∞ 1 − n = 1, meaning that w.h.p. there exist one ball n (node of G) in all the clogn disjoint arcs intervals bins.
Claim 10 There exist a constant c > 0 and natural n0 such that for all n ≥ n0, there are at least cn1/2 nodes in the set B.
2 Proof Claim6 implies that B lie at a volume of VB = Ω((α) ). The assumption of Poisson distribution implies the independent parts of the volume, hence, we can multiply this volume with n, achieving that there exist a constant c > 0 such that the number of points at this volume is p ncα2 = nc(logn/n)1/2 = cn1/2 logn = Ω(n1/2)
2 p √ Claim 11 Given the volume VB = α = logn/n and partition this volume into n balls 4 with volume VH α = logn/n, there exist a constant c > 0 and natural n > 0 such that for i √ 0 all n ≥ n0, with high probability there is a subset of at least c n disjoint balls induced by the graph nodes.
Proof After throwing n balls at random into b bins, the distribution of the number of balls in a given bin is approximately Poisson with mean n/b (see Section 2.2.2). Let Xi be a Poisson √ 4 √ random variable with parameter λ = c n/(1/V ) = clogn/ n and χi be an indicator that is 1 when the ith bin is empty and 0 otherwise. The probability that a given bin i is empty is √ −λ c/ n Pr(χi = 1) = Pr(Xi = 0) = e ≤ (1/n) . 26 Probabilistic Connectivity Threshold for Directional Antenna Widths
√ 1 c/ n By the union bound, the probability that the bin i is not empty is Pr(χi = 0) = 1− n . The probability that all bins are not empty is
√c !|A| √ 1 n − √c n Pr χ = 0 ∩ χ = 0 ∩ ... ∩ χ = 0 = 1 − = 1 − n n (2.5) 1 2 |A| n
√ − √c n √ Since for every c, limn→ 1 − n n = 0, w.h.p. our chosen O( n) cells are not ∞ √ empty, i.e., there exist nodes in the graph that induced these O( n) disjoint balls. Chapter 3
Big Data Interpolation using Functional Representation
3.1 Introduction
Consider the task of representing information in an error-tolerant way, such that it can be formulate even if it contains noise or even if the data is partially corrupted and destroyed. In this chapter we present the concept of data interpolation in the scope of data aggregation and representation, as well as the new big data challenge, where abstraction of the data is essential in order to understand the semantics and usefulness of the data. For example, a sensor network may interact with the physical environment, while each node in the network may sense the surrounding environment (e.g., temperature, humidity etc). The environmentally measured values should be transmitted to a remote repository or remote server. The concept of ‘smart dust’ [28] where there are plenty of tiny devices scattered around, is a motive for representing information in a concise manner. Note that the environmental values usually contain noise and there can be malicious inputs, i.e., part of the data may be corrupted. In contrast to distributed data aggregation where the resulting computation is a function such as count, sum and average (e.g. [20, 27, 43]), in distributed data interpolation, our goal is to represent every value of the data by an abstracting function. Our computational model consists of sampling the data and estimating the missing information using polynomial manipulations. Our motivation comes from the big data systems management where the big data abstrac- tion becomes one of the most important tasks in the presence of the enormous amount of data currently produced. The popular abbreviation ‘VVV’ summarizes the main challenges 28 Big Data Interpolation using Functional Representation that arise while handling big data: Volume, Velocity and Variety (see e.g. [8]). Volume emphasizes the big mass of information that is clearly the major challenge, and is the one that is most easily recognized. Velocity refers to is both the rate at which data arrives and the quick time it needs to be processed. By variety, we mean heterogeneity of data types and their representation as well as noise and outliers. The idea of a concise function that captures the essence of the information faces these three aspects as well; there is no need to maintain the whole data set, as the correct information is succinctly represented by the function. This function can serve as a description of the essence semantics of the data. Most of all, we address the problem of noisy, corrupted, and adversarial data. An extended abstract of this work was published at the International Symposium on Algo- rithms and Experiments for Sensor Systems, Wireless Networks and Distributed Robotics [13].
Polynomial Fitting to Noisy and Adversarial Data.
Generally, this study proposes the use of polynomials to represent large amounts of data. The process works by sampling the data, then using this sample to construct a polynomial whose distance (according to the ℓ∞ metric) from the polynomial constructed using the whole data N set is small. Our model assumes that we are given a set of points {(x1i ,...,xki )}i=1 in a finite k k domain [a,b] ⊆ R . Additionally, to every point we get a corresponding real value of an unknown continuous function yi = f (x1i ,...,xki ). Let S denote the set of pairs (x¯i,yi) where x¯ is the vector (x1,...,xk). Formally we examine the following problem in our study:
Definition 6 (Polynomial Fitting to Noisy and Byzantine Data Problem) Given the set S as defined above, a noise parameter δ > 0 and a threshold ρ > 0, we would like to find a polynomial p of total degree d satisfying
p(x¯) ∈ [y − δ,y + δ] for at least ρ fraction of S (3.1)
Such a polynomial will be called a (ρ,δ)-fitting polynomial.
Given that the function f is real-valued and continuous on the finite domain by the Weierstrass approximation theorem [41], we know that for any given ε > 0, there exists a polynomial p such that ∥ f − p∥∞ < ε for all x ∈ [a,b]. This tells us that our desired polynomial p exists (i.e., ε = δ,ρ = 1, satisfying eq.3.1). One obvious candidate to construct an approximating polynomial is interpolation at equidistant points. However, the sequence of interpolation polynomials does not converge uniformly to f due to Runge’s phenomenon [17]. Chebyshev interpolation (i.e., interpolate f using the points defined by the Chebyshev polynomial) minimizes Runge’s oscillation, but it forces us to know the value in a specific 3.1 Introduction 29 location, which is not reasonable in the underlying model assumptions. Taylor polynomials are also not appropriate; even setting aside questions of convergence, they are applicable only to functions that are infinitely differentiable, and not to all continuous functions. Considering other classical curve-fitting and approximation theories45 [ ], most research has used the ℓ2 norm of noise like the least squares error method. These attitudes do not suffice for the adversarial noise we have assumed here. To our knowledge, only Aroraand
Khot [4] referred the ℓ∞ noise that fits our considered problem. In Section 3.3 we introduce the polynomial fit generalization, where we provide a polynomial-time algorithm dealing with multi-variate data, based on Arora and Khot’s results. The polynomial fitting problem as defined above can also be approached byError- Correcting Code Theory. From that point of view, extensive literature exists dealing with an outlier’s appearance but noise-free data (i.e., δ = 0 and ρ < 1). In Section 3.2 we present an algorithm that handles Byzantine noisy data based on the Welch-Berlekamp [56] error- elimination method, where we suggest a means to deal with corrupted data appearing at multi-dimensional inputs.
Our contributions.
The polynomial fitting to noisy and byzantine data problem, as defined above, lies between the error-correcting and classical theory of approximation and curve fitting fields. As such, studying the polynomial fitting problem properties and solutions intersect with both fields’ aspects, especially in the scope of big data research. The primary results of the chapter can be summarized in the following theorems, while the precise definitions and the detailed proofs are presented in the sequel.
Theorem 6 There is an algorithm that reconstructs (ρ,δ)-fitting k dimensional polynomials for discrete value noise. The algorithm reconstructs the unknown total degree d polynomial within ℓ∞ error in time polynomial in δ,d and k.
Theorem 7 There is an algorithm that reconstructs (ρ,δ)-fitting k dimensional polynomial where ρ bounded away from 1/2. The algorithm reconstructs the unknown total degree d polynomials within ℓ∞ error with high probability over sample choice in time polynomial in δ,d and k.
Basic Definitions.
k k • For two multivariate polynomials p : [a,b] → R and q : [a,b] → R, their ℓ∞ distance
is defined to be ∥p − q∥∞ = supx¯∈[a,b]k |p(x¯) − q(x¯)|. 30 Big Data Interpolation using Functional Representation
α1 α2 αn • A monomial in a collection of variables xi,...,xn, is a product x1 x2 xn where αi are non-negative integers.
• The total degree of a multivariate polynomial p is the maximum degree of any term in p, where the degree of particular term is the sum of the variable exponents.
• A polynomial q is a δ-approximation to p if ∥p − q∥∞ ≤ δ.
• Let ∆ : R2 → {0,1} denote the difference between two values define as follows: 1 x=y ∆(x,y) = 0 otherwise
3.2 Discrete Finite Noise
In this section, we will study one aspect of the polynomial fitting problem posed in Defini- tion6, where the data contains a bounded number of Byzantine values (denoted by ρ) and we relaxed the noise constraint (denoted by ∆) to be discrete. First, we relate to two-dimensional data using the Welch-Berlekamp (WB) algorithm [56] and extend it to suffice for (discrete) noisy data. In the second part of this section, we generalize the WB algorithm to handle multidimensional data.
3.2.1 Handle the discrete noise
Welch and Berlekamp (WB) related the problem of polynomial reconstruction in their decoding algorithm for Reed Solomon codes [56]. The main idea of their algorithm is to describe the received (partially corrupted) data as a product of polynomials. Their solution holds for noise-free cases and a limited fraction of the corrupted data (δ = 0,ρ > 1/2). Almost 30 years later, Sudan’s list decoding algorithm [52] relaxed the Byzantine constraint (δ = 0,ρ can be less than 1/2) by using bivariate polynomial interpolation. Those concepts do not hold up well in the noisy case since they use the roots of the polynomial and the divisibility of one polynomial by other methods that are problematic for noisy data (as shown in [4], Section 1.2). Here, we will use the WB algorithm [56] as a “black box" to obtain an algorithm that handles the discrete-noise notation of the polynomial-fitting problem. N Given a data set (x,y) = (xi,yi)i=1 that is within a distance of t = (1 − ρ)N from some N polynomial p(x) of degree < d (i.e., ∑i=1 ∆(yi, p(xi) ≤ t), the WB approach to eliminate the irrelevant data is to use the roots of an object called the error-locating polynomial, e. In other words, we want e(xi) = 0 whenever p(xi) ̸= yi. This is done by defining another polynomial 3.2 Discrete Finite Noise 31 q(x) ≡ p(x)e(x). To resolve these polynomials we need to solve the linear system,
q(xi) = yie(xi) for all i. (3.2)
Welch and Berlekamp show that e(x)|q(x) and p(x) can be found by the ratio p(x) = q(x)/e(x) at O((d + 2t + 1)ω ) running time (where ω is the matrix multiplication complex- ity). In Algorithm1, we are use the WB method as a subroutine to manage the noisy-corrupted data.
Algorithm 1 Reconstruct (ρ,δ)-fitting polynomial where δ is discrete
Input: d,S,t,δ,{∆i} Output: a polynomial pi(x) which is the (ρ,δ)-fitting polynomial i ← 0 repeat i ← i + 1 Si ← S + ∆i pi(x) ← WB(Si,d,t) until pi(x) ∈ [y − δ,y + δ] for at least ρ fraction of S Return pi(x)
The algorithm receives the degree d > 0 of the goal polynomial; a set S of N = d +2t +1 N datapoints (xi,yi)i=1; the Byzantine appearance bound 0 < t = (1 − ρ)N; the noise bound δ and the set of vectors are {∆i} = (a1,...,ad+2t+1) : a j ∈ [−δ,δ] of all possible (discrete) noise .
At every step i, we add S different values of noise ∆i and reconstruct the polynomial pi using the WB algorithm. The resulting polynomial pi is tested where the criteria is that pi is within δ from all nodes except the Byzantine nodes (according to the maximal number of Byzantine values as defined by ρ). Note that if the desired polynomial’s degree d is not given, we can search for the minimal degree of a polynomial that fits the δ and ρ in a binary search fashion. In addition, note that Ar et al. [3] also suggest a way to reconstruct (ρ,δ)-fitting polyno- mial where δ is discrete. They also insert all possible (discrete) noise sequentially, but use a different ‘black box’ procedure and do not deal with multivariate polynomials as we present in the sequel.
3.2.2 Multidimensional Data.
To generalize WB algorithm to handle multidimensional data, there is a need to formalize the algorithm to deal with multivariate polynomials. This is a challenging task due to the infinite 32 Big Data Interpolation using Functional Representation
roots those polynomials may have (for example, P(x1,x2) = x1x2 has infinitely many roots in R) and as previously mentioned, the WB method is strictly based on the polynomials’ roots. A suggested method to handle three dimensional data is to assume that the values of datapoints in one direction (e.g., x-direction) are distinct. This assumption helps us to define the error locating polynomial e in the x-direction only (or symmetrically over the y-axis). That assumption can be achieved by using i.i.d samples or by enlarging the sample size from N to N2 which ensures that (using the pigeonhole principle) we have at least N distinct values. The 3-dimensional polynomial reconstruction is described in Algorithm2.
Algorithm 2 Reconstruct the polynomial p(x,y) representing the true data N Input: d,t,(xi,yi,zi)i=1 Output: Polynomial p(x,y) of total degree at most d or fail. Step 1: Compute a non-zero univariate polynomial e(x) of degree exactly t and a bivariate polynomial q(x,y) of total degree d +t such that: zie(xi) = q(xi,yi), 1 ≤ i ≤ N. if e(x) and q(x,y) do not exist then output fail. end if End
Step 2: if e(x) does not divide q(x,y) then output fail. else q(x,y) compute p(x,y) = . e(x) end if if ∑∆(zi, p(xi,yi)) > t then output fail. else output p(x,y). end if End
The algorithm receives the total degree d > 0 of the goal polynomial, the Byzantine appearance bound 0 < t = (1 − ρ)N and a set of N triples (xi,yi,zi) with distinct xi’s. At the first step we define the error location polynomial e and the multiplication polynomial q. At the next step we check if e divides q and if the distance between the resulting polynomial ′ and zis does not exceed the bound t. Next we argue that this algorithm fulfills its goal, and illustrate its behavior in an example. 3.2 Discrete Finite Noise 33
Theorem 8 Let p be an unknown d total degree polynomial with two variables. Given a 3 d+t+2 threshold t > 0 and a sample S ∈ R of N = d+2 +t distinct points (xi,yi,zi) such that ω zi ̸= p(xi,yi) for at most t values. The algorithm above reconstructs p in O(N ) running time (where ω is the matrix multiplication complexity).
Proof The proof of the Theorem above follows from the subsequent claims.
Claim 12 (Correctness) There exist a pair of polynomials e(x) and q(x,y) that satisfy Step 1 such that q(x,y) = p(x,y)e(x).
Taking the error locator polynomial e(x) and q(x,y) = p(x,y)e(x), where deg(q) ≤ deg(p) + deg(e) ≤ t + d. By definition, e(x) is a degree t polynomial with the following property:
e(x) = 0 iff zi ̸= p(x,y)
We now argue that e(x) and q(x,y) satisfy the equation zie(xi) = q(xi,yi). Note that if e(xi) = 0, then q(xi,yi) = zie(xi) = 0. When e(xi) ̸= 0, we know p(xi,yi) = zi and so we still have p(xi,yi)e(xi) = zie(xi), as desired.
Claim 13 (Uniqueness) If any two distinct solutions (q1(x,y),e1(x)) ̸= (q2(x,y),e2(x)) sat- q (x,y) q (x,y) isfy Step 1, then they will satisfy 1 = 2 . e1(x) e2(x)
It suffices us to prove that q1(x,y)e2(x) = q2(x,y)e1(x). Multiply this with zi and sub- stitute x,y with xi,yi, respectively, q1(xi,yi)e2(xi)zi = q2(xi,yi)e1(xi)zi We know, ∀i ∈ [N] q1(xi,yi) = e1(xi)zi and q2(xi,yi) = e2(xi)zi If zi ̸= 0, then we are done. Otherwise, if zi = 0, then q1(xi,yi) = q2(xi,yi) = 0 ⇒ q1(xi,yi)e2(xi) = q2(xi,yi)e1(xi). Since these polynomi- als agree on more points than their degree, they are identical. Note that as e1(x) ̸= 0 and q1(x,y) q2(x,y) e2(x) ̸= 0 this implies that = as desired. e1(x) e2(x)