Analysis-Aware Approach to Improving Social Data Quality
Total Page:16
File Type:pdf, Size:1020Kb
UC Irvine UC Irvine Electronic Theses and Dissertations Title Analysis-Aware Approach to Improving Social Data Quality Permalink https://escholarship.org/uc/item/6k08w8js Author Sadri, Mehdi Publication Date 2017 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Analysis-Aware Approach to Improving Social Data Quality DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Computer Science by Mehdi Sadri Dissertation Committee: Professor Sharad Mehrotra, Chair Professor Chen Li Professor Nalini Venkatasubramanian Professor Yaming Yu 2017 c 2017 Mehdi Sadri DEDICATION To my beloved parents, Monir and Mohammad. ii TABLE OF CONTENTS Page LIST OF FIGURES vi LIST OF TABLES viii ACKNOWLEDGMENTS ix CURRICULUM VITAE x ABSTRACT OF THE DISSERTATION xi 1 Introduction 1 2 Preliminaries and Related Work 4 2.1 Data Quality . 5 2.1.1 Social Data Quality . 5 2.2 Data Acquisition . 6 2.2.1 Social Data Acquisition . 7 2.3 Data Cleaning . 9 2.3.1 Social Data Cleaning . 10 2.4 Analysis-Aware Approach . 11 3 Social Data Acquisition 13 3.1 Introduction . 14 3.2 Motivating Example . 18 3.3 Notation and Problem Definition . 19 3.4 Query Generation . 23 3.4.1 Probabilistic Query Coverage . 23 3.4.2 Query Generation . 25 3.4.3 Statistics Maintenance . 26 3.4.4 Combinatorial MAB Framework . 30 3.4.5 Greedy Approximation Bound . 33 3.4.6 Greedy Algorithm . 34 3.5 Relevance Check . 35 3.5.1 Phrase Based Relevance (Rt) ...................... 36 3.5.2 Clue Relevance (Rc) ........................... 37 iii 3.5.3 User History (Ru)............................. 38 3.6 Topic Maintenance . 38 3.7 Experimental Evaluation . 40 3.7.1 Experimental Setup . 40 3.7.2 Evaluation Criteria . 43 3.7.3 Experimental Results . 44 3.8 TAPP (Twitter Follow-up Application) . 53 3.8.1 System Overview . 54 3.9 Summary . 57 4 Social Entity Linking 58 4.1 Introduction . 59 4.2 Motivating Example . 65 4.3 Preliminaries . 67 4.3.1 Window-based Stream . 67 4.3.2 Data Cleaning Functionalities . 68 4.3.3 Entity Blocks . 72 4.3.4 Mention Probabilities . 72 4.3.5 Continuous Top-k Query . 73 4.4 Deterministic Top-k . 75 4.5 Probabilistic Top-k . 77 4.5.1 Factor Graph . 78 4.5.2 Entity Probabilistic Model . 80 4.5.3 Entity Dominance Graph (EDG) . 83 4.5.4 Selection Criteria . 85 4.5.5 Stopping Criteria . 86 4.5.6 Finding Top-K . 90 4.5.7 Scalability of EDG . 94 4.6 Architecture of TkET . 96 4.6.1 Sliding Window Stream Processing . 97 4.7 Experimental Evaluation . 98 4.7.1 Experimental Setup . 99 4.7.2 Knowledge Base . 101 4.7.3 Factor Graphs and Dimple . 104 4.7.4 Synthetic Dataset Genration . 105 4.7.5 Experimental Results . 106 4.7.6 Real Tweet Dataset Experimental Results . 111 4.7.7 Discussion . 112 4.8 Related Work . 113 4.8.1 Social Entity Linking . 113 4.8.2 Top-k Query Answering . 114 4.9 Summary . 114 5 SoDAS: Social Data Analytics System 116 5.1 System Overview . 117 iv 6 Conclusions and Future Work 120 Bibliography 124 v LIST OF FIGURES Page 2.1 Common Steps in Data Processing Pipelines . 4 3.1 TAS Architecture . 14 3.2 Phrase Weight . 18 3.3 TAS Iterations . 19 3.4 Phrase Maintenance . 39 3.5 Approximate Relative Recall . 44 3.6 TAS vs. BaseM: Number of Tweets . 45 3.7 TAS over Simulation: Number of Tweets . 46 3.8 TAS vs. BaseM: Approximate Relative Recall . 47 3.9 TAS with Different Phrase Budgets . 48 3.10 TAS with Different Inner Iteration Sizes . 49 3.11 Topic Maintenance Module On vs. Off: Number of Tweets . 49 3.12 TAS: Number of Phrases . 50 3.13 TAS vs. ATM: Topic 75 . 52 3.14 TAS vs. ATM: Topic 85 . 52 3.15 TAAP Application . 55 4.1 NER and NEL Black Box Interfaces . 68 4.2 Factor Graph for the “Catfish” Entity Block . 80 4.3 Entity Dominance Graph Example . 84 4.4 In-Degree vs. Out-Degree based Stopping Criteria . 87 4.5 Out-Degree based Stopping Criteria Example . 88 4.6 In-Degree based Stopping Criteria Example . 89 4.7 EDG 1-2 steps of TkET top-2 algorithm on Motivating Example . 91 4.8 EDG 3-5 steps of TkET top-2 algorithm on Motivating Example . 92 4.9 EDG 6-7 steps of TkET top-2 algorithm on Motivating Example . 93 4.10 Transitivity of Pairwise Dominance . 95 4.11 Overview of TkET . 95 4.12 Architecture of TkET . 96 4.13 Sliding Window, Stream Processing . 98 4.14 Motivating Example's Identified Entity Blocks . 102 4.15 Selected Synthetic Datasets Block Size Distribution . 107 4.16 SDS4: Latency vs. Parameters(k, th) . 109 4.17 SDS4: Accuracy vs. Parameters(k, th) . 110 vi 5.1 SoDAS General Architecture . 118 vii LIST OF TABLES Page 3.1 Example Phrases of an Interest . 19 3.2 Fixed Corpus Topics of Interest . 45 3.3 ARR Calculation, Sample Sizes . 47 3.4 Streaming Topics of Interest . 51 3.5 Streaming Experiment Summary . 53 4.1 Example Raw Tweets . 65 4.2 Selected Dataset Parameters . 108 4.3 Efficiency for Out-Degree based Stopping Criteria . 108 4.4 Efficiency for In-Degree based Stopping Criteria . 110 4.5 Efficiency over the Real Tweet Dataset . 112 viii ACKNOWLEDGMENTS I would like first to express my deepest sincere gratitude to my advisor Prof. Sharad Mehro- tra for his unwavering guidance, support, and encouragement. Prof. Sharad has patiently taught me how to identify new important research problems, solve problems in principle, and how to write research papers. I am glad to have had the opportunity to work with him and for that I am very grateful. An additional special gratitude is to due to Prof. Yaming Yu and Prof. Charless Fowlkes for their insightful support and suggestions throughout this research, especially on the third and fourth chapters of this thesis. The time and effort they spent with me were instrumental in my progress. I would also like to extend my appreciation to the members of my doctoral committee; Prof. Chen Li, Prof. Nalini Venkatasubramanian, Prof. Yaming Yu, for their useful feedback and for finding the time to serve on my committee. I would like to thank everyone in the ISG group, especially my colleagues in the Data Quality and Privacy Group at UCI, Yasser Altowim, Hotham Altwaijry, Stylianos Doudalis, Kerim Oktay, Jie Xu, Liyan Zhang, Abdulrahman Alsaudi, and Jamshid Esmaelnezhad. The work reported in this thesis was also supported in part by NSF grants CNS-1527536, CNS-1545071, CNS-1450768, CNS-1450768, CNS-1059436, CNS-1118114. Foremost, I would.