Exploration, Interpolation and Extrapolation-Each Approached in a Different Manner
Total Page:16
File Type:pdf, Size:1020Kb
I dedicate this thesis to my husband, Amiel, for his constant support and unconditional love. Acknowledgements I would like to express my special appreciation and thanks to my PhD advisors, Professors Shlomi Dolev and Zvi Lotker, for supporting me during these past five years. You have been a tremendous mentors for me. I would like to thank you for encouraging my research and for allowing me to grow as a research scientist. Your scientific advice and knowledge and many insightful discussions and suggestions have been priceless. I would also like to thank my committee members, professor Eitan Bachmat, professor Chen Avin, professor Amnon Ta-Shma for their helpful comments and suggestions in general. A heartfelt thanks to the really supportive and active BGU community here in Beer-Sheva and all my friends who made the research experience something special, in particular, Ariel, Dan, Nisha, Shantanu, Martin, Nova, Eyal, Guy and Marina. Special thanks to Daniel and Elisa for proof reading my final draft. A special thanks to my family. Words cannot express how grateful I am to my mother-in law, father-in-law, my mother, and father for all of the sacrifices that youve made onmy behalf. Finally, I would like to acknowledge the most important person in my life my husband Amiel. He has been a constant source of strength and inspiration. There were times during the past five years when everything seemed hopeless and I didnt have any hope. I can honestly say that it was only his determination and constant encouragement (and sometimes a kick on my backside when I needed one) that ultimately made it possible for me to see this project through to the end. Abstract The abundance in the amount of data is forcing us to redefine many scientific and techno- logical fields, with the affirmation of any environment of Big Data as a potential source of data.The advent of Big Data is introducing important innovations: the availability of additional external data sources, dimensions previously unknown and questionable consis- tency poses new challenges to computer scientists, demanding a general reconsideration that involves tools, software, methodologies and organizations. This thesis investigates the problem of big data abstraction in the scope of exploration, in- terpolation and extrapolation. The driving vision of data abstraction is to turn the information overload into an opportunity: the goal of the abstraction is to make our way of processing data and information transparent for an analytic discourse as well as a tool for complete missing information, predicate unknown features and filter noise and outliers. We confront three aspects of the abstraction problems with gradual levels of generaliza- tion. First we focus on a specific data type and propose a novel solution for exploring the connectivity threshold of wireless data when the number of the sensors approach infinity. Second, we consider how to use polynomials to effectively and succinctly interpolate general data functions that tolerate noise as well as bounded number of maliciously corrupted outliers. Third, we show how to represent a high-dimensional data set with incomplete information that fulfills the demand of predictive modeling. Our main contribution lies in rethinking these problems in the context of massive amounts of data which dictate large volumes and high dimensionality. Information extraction, exploration and extrapolation have a major impact on our society. We believe that the topics investigated in this thesis can potentially influence greatly and practically. Table of contents List of figures ix 1 Introduction1 1.1 The Information Age . .1 1.2 Big Data Abstraction . .3 1.2.1 Big Data Exploration . .3 1.2.2 Big Data Interpolation . .4 1.2.3 Big Data Extrapolation . .4 2 Probabilistic Connectivity Threshold for Directional Antenna Widths7 2.1 Introduction . .7 2.2 Preliminaries . .9 2.2.1 Notations . 11 2.2.2 Probability and the relation between Uniform and Poisson distributions 12 2.2.3 Covering and Connectivity . 12 2.3 Centered Angles . 13 2.3.1 Finding the Connectivity Threshold . 17 2.4 Random Angle Direction . 19 2.5 Discussion . 23 2.6 Appendix . 24 3 Big Data Interpolation using Functional Representation 27 3.1 Introduction . 27 3.2 Discrete Finite Noise . 30 3.2.1 Handle the discrete noise . 30 3.2.2 Multidimensional Data. 32 3.3 Random Sample with Unrestricted Noise . 36 3.3.1 Polynomial fitting to noisy data . 36 viii Table of contents 3.3.2 Byzantine Elimination. 39 3.4 Discussion . 40 4 Mending Missing Information in Big-Data 43 4.1 Introduction . 43 4.2 Preliminaries . 45 4.3 k-flats Clustering . 47 4.4 Algorithm . 53 4.5 Experimental Studies of k-Flat Clustering . 55 4.6 Clustering with different group sizes . 55 4.7 Sublinear and distributed algorithms . 58 4.8 Discussion . 59 4.9 Conclusion . 61 4.10 Appendix: The probability of flats intersection . 62 5 Conclusions 67 Nomenclature 69 References 71 List of figures 2.1 Directional antenna model. .9 2.2 The communication graph over the disk and the disk’s boundary. 10 2.3 Covering vs. connectivity problems. 13 2.4 Project nodes from the disk onto antipodal pair on the boundary. 14 2.5 Transforming the antipodal pair to a node on the boundary. 16 2.6 Projection a node from the boundary to a node on the disk. 17 2.7 A node and its intercepted arc. 18 2.8 The disk’s cover expansion. 18 2.9 Represent the three dimensional variable of the graph using a torus. 20 2.10 Represent the minimal coverage area by an annulus. 21 2.11 The possible directions that induce adjacency. 22 2.12 Generalize to convex fat objects with curvature> 0 ............. 24 4.1 Two dimensional pair of flats intersecting a disk . 52 4.2 The distance between the midpoint and the ball’s center . 53 4.3 Eliminate the irrelevant midpoints . 56 4.4 Almost orthogonal flats pairwise intersection . 60 Chapter 1 Introduction 1.1 The Information Age Almost 35 years ago, Alvin Toffler [54] published his book “The Third Wave” where he described three phases of human society’s development based on the concept of ‘waves’, with each wave pushing the older societies and cultures aside. According to Toffler, civilization can be divided into three major phases: The First Wave, referred to as the settled Agricultural society, and which replaced the first hunter-gatherer cultures. The symbol of this ageis the hoe, and the profile of the wealthy person is the land owner. Battles were typically carried out with swords. The Second Wave is the Industrial age society, symbolized by the machine, beginning with the industrial revolution. At this time, the wealthy were the factory owners and machines were used during times of war - tanks, aircrafts etc. The Third Wave is the post-industrial society. Toffler says that since the late 1950s, most countries havebeen transitioning into the Information age. The symbol now is obviously the computer. The wealthy and powerful people are those that develop or collect the data and sell others the privilege to use it, and one of the main threats in this modern age is the cyber attack. At the beginning of the 80’s, no one could have imagined the significance and power that data would play in everyday life. Today, Big data - a large pool of data that can be captured, communicated, aggregated, stored and analyzed, is part of every sector and function of the global economy [36]. The use of big data can create significant value for the world economy, enhancing the productivity and competitiveness of companies and the public sector, and creating a substantial economic surplus for consumers. 2 Introduction Big data challenges. There are many different definitions for the term ‘Big Data’. Generally, this term refers toa massive amount of data, the size of which is beyond the ability of typical database software tools to capture, store, manage, and analyze. A popular abbreviation which is commonly used to characterize it is the three V’s: Volume, Variety and Velocity. By Volume, we usually mean the sheer size of the data, that is of course the major challenge, and the most easily recognized. By Variety, we mean heterogenity of data types, representation, and semantic interpretation. The meaning of Velocity is both the rate at which data arrives and the speed in which it needs to be processed- for example, to perform fraud detection at a sale point. Another important feature of big data includes not only the huge number of items but also their ‘wideness’, i.e., each item maintains many fields. Hence, it is common to describe these items by objects in a high-dimensional space. High-dimensional objects have a number of unintuitive properties that are sometimes referred to as the ‘curse of dimensionality’ [6]. Multiple dimensions are hard to think in, impossible to visualize, and due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality. One manifestation of the ‘curse’ is that in high dimensions, almost all pairs of points are equally far away from one another and almost any two vectors are nearly orthogonal. Another manifestation is that high dimensional functions tend to have more complex features than low-dimensional functions, and are hence harder to estimate. Moreover, in order to obtain a statistically sound and reliable result, e.g., to estimate multivariate functions with the same accuracy as functions in low dimensions, we require that the sample size grow exponentially with the dimension. The emerging field of data science relates those aspects of big data and provides them with different solutions which are fundamentally multidisciplinary.