Open Xingjie-Dissertation-Final.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School College of Engineering SOCIAL-ENRICHED DATA ANALYSIS AND PROCESSING TOOLS A Dissertation in Computer Science and Engineering by Xingjie Liu c 2013 Xingjie Liu Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2013 The dissertation of Xingjie Liu was reviewed and approved∗ by the following: Wang-Chien Lee Associate Professor of Computer Science and Engineering Dissertation Advisor, Chair of Committee Piotr Berman Associate Professor of Computer Science and Engineering Daniel Kifer Assistant Professor of Computer Science and Engineering Jia Li Professor of Statistics Lee Coraor Professor of Computer Science and Engineering Graduate Officer for Computer Science and Engineering ∗Signatures are on file in the Graduate School. Abstract In recent years, the rapid development of online social services, such as Facebook, Twitter, LinkedIn and Foursquare, poses new opportunities and challenges to re- searchers. On the one hand, with huge amount of comprehensive social network data and various types of user-generated contents made available for analysis, we are able to conduct in-depth studies on the scale we never had before. The data will help us better understand people's opinions and activities, capture trends in our society and improve social services. On the other hand, however, such data require novel techniques for modeling, extraction and processing to reveal its real value, because many existing solutions cannot handle new issues such as heterogeneous data type, scalability and efficiency requirement, etc. In this thesis, we introduce the concept of social-enriched data, which is defined as the social connection graphs as well as user-created contents distributed on the graphs, to represent the data collected in the aforementioned online social services. We identified several issues in handling the social-enriched data and proposed a set of novel solutions to tackle these issues. First, as people tend to interact with others both through the online services and in their offline lives, capturing the properties of heterogeneous social networks with both online and offline components becomes critical. Hence, we investigated a new type of social network as Event-based Social Networks(EBSNs) as a typi- cal example for the heterogeneous graph. The EBSNs contain both online social interactions as in other conventional online social networks, as well as offline so- cial interactions captured in offline activities. Based on real data collected from Meetup, a social event organizing service, we analyzed EBSN properties and dis- covered many unique and interesting characteristics, such as heavy-tailed degree distributions and strong locality of social interactions. In addition, we subse- quently studied the heterogeneous nature (co-existence of both online and offline social interactions) of EBSNs on two challenging problems: community detection iii and information flow. We found that communities detected in EBSNs are more cohesive than those in other types of social networks (e.g. location-based social net- works). In the context of information flow, we studied the event recommendation problem and significantly improved the recommendation with a community-based diffusion model which infuses both online and offline interactions. Second, as user-created contents consist of one essential ingredient of many online social services, we chose to study it in a widely applied practice, i.e., recom- mendations. In particular, we focused on the problem of recommending contents for a group of users by utilizing the social context. To extract the group user pref- erence information from the social-enriched data, we analyzed the decision making process in user groups, and proposed a personal impact topic (PIT) model as a type of probabilistic generative model. The PIT model effectively identifies the group preference profile for a given group by mining the individual preferences and personal impacts of group members from group recommendation history. Further, we integrate the friends connection information to obtain an extended personal im- pact topic (E-PIT) model. Through comprehensive data analysis and evaluations conducted on three real datasets, we demonstrate that the social based PIT and E-PIT approachs achieved good performance. Finally, to support efficient data analysis and combat the scalability issues, we proposed two data analyzing tools for social-enriched data, namely, distributed graph summary and uncertain skyline query. The distributed graph summary algo- rithms summarize a large scale graph into an abstract graph, where the topologies of the original graph is preserved. As online social networks can become extremely large and complex, graph summarization is crucial in uncovering useful insights about the patterns hidden in the underlying graphs. In our study, we introduce three distributed algorithms enable parallel processing of graph summarization, which produce good quality summaries and scales well with increasing data sizes. The uncertain skyline operator is a data filtering operator to identify a set of data items that are not dominated by any other items, where each item is represented as a multidimensional data tuple with probabilistic attribute values. The operator is particularly useful for multi-criteria data analysis and filtering for user created contents. Specifically, the U-Skyline query searches for a set of tuples that has the highest probability (aggregated from all possible scenarios) as the skyline answer. In order to answer U-Skyline queries efficiently, we propose a series of optimiza- tion techniques for query processing. Our performance evaluation shows that our algorithm is 10 100 times faster than the state-of-art solutions. Social-enriched− data analysis gains more and more research interests today. This thesis presents pioneer works in several challenging topics in this area, and we believe that our solutions will provide real value to the utilization of social- enriched data in practice. iv Table of Contents List of Figures ix List of Tables xi Acknowledgments xii Chapter 1 Introduction 1 1.1 System Model . 4 1.2 Challenges and Our Proposals . 6 1.3 The Organization of the Thesis . 8 Chapter 2 Literature Review 10 2.1 Online Social Services . 10 2.2 Social Network Data . 14 2.2.1 Before and After Social Service Boom . 14 2.2.2 Social Network Properties . 15 2.2.3 Online and Offline Social Networks . 16 2.3 User-created Contents . 18 2.3.1 Recommendations of User-created Contents . 19 2.4 Social-enriched Data Analyzing Tools . 20 2.4.1 Graph Summary . 21 2.4.2 U-Skyline . 21 Chapter 3 Event-based Social Networks 23 3.1 Overview . 24 v 3.2 Related Works . 26 3.3 Event-based Social Networks . 27 3.3.1 Event-based social services . 27 3.3.2 Event-based Social Networks Definition . 28 3.3.3 Representative Datasets Description . 29 3.4 EBSNs Properties . 30 3.4.1 Social Events . 30 3.4.2 Event and Group Participation . 31 3.4.3 Network Properties . 32 3.4.4 Locality of Social Interactions . 34 3.5 EBSNs Community Structure . 35 3.5.1 Clustering on Homogeneous Networks . 35 3.5.2 Clustering on Heterogeneous EBSNs . 36 3.5.2.1 Baseline 1: Linear Combination . 36 3.5.2.2 Baseline 2: Generalized SVD . 36 3.5.2.3 Extended Fiedler Method . 37 3.5.3 Community Structure Evaluation . 38 3.5.3.1 Evaluation Settings . 38 3.5.3.2 Results . 39 3.6 EBSNs Information Flow . 40 3.6.1 Event-Centric Diffusion . 41 3.6.1.1 Basic Event-Centric Diffusion . 41 3.6.1.2 Diffusion over EBSNs . 42 3.6.1.3 Community-Based Diffusion . 42 3.6.2 Information Flow Evaluation . 43 3.6.2.1 Experimental Settings . 43 3.6.2.2 Compare Diffusion Models . 44 3.6.2.3 Compare Various Diffusion Patterns on EBSNs . 45 3.6.2.4 Examine the Effect of Cold-Start . 46 Chapter 4 Group Recommendation 47 4.1 Overview . 48 4.2 Related Works . 50 4.3 Preliminary . 52 4.3.1 Problem Definition . 53 4.3.2 Extended Author Topic Model . 54 4.4 Personal Impact Topic Model . 56 4.4.1 Personal Impact . 56 4.4.2 Personal Impact Topic Model . 57 vi 4.4.3 PIT Model Parameter Inference . 58 4.4.4 PIT Model Extension . 61 4.5 Evaluation . 64 4.5.1 Dataset Description . 64 4.5.2 Evaluated Recommendation Methods . 65 4.5.3 Evaluation Results . 66 4.5.4 Personal Impact Distribution and Significance . 68 4.5.5 Topic Analysis . 70 Chapter 5 Data Analyzing Tools: Distributed Graph Summary 72 5.1 Overview . 72 5.2 Preliminaries . 75 5.2.1 Definition of Graph Summarization . 75 5.2.2 Comparison to Graph Clustering . 76 5.2.3 NP-hardness and Super-edge Assignment . 76 5.2.4 Centralized Graph Summary Algorithms . 77 5.3 Giraph/Pregel Overview . 78 5.4 Algorithm Overview . 80 5.5 Finding Merge Candidates . 82 5.5.1 DistGreedy . 82 5.5.2 DistRandom . 87 5.5.3 DistLSH . 87 5.6 Merging Super-nodes . 91 5.6.1 Optimization for Hub Nodes . 94 5.7 Experimental Evaluation . 94 5.7.1 Experiment Setup . 94 5.7.2 Effectiveness and Efficiency Balance . 96 5.7.3 Number of Stripes on DistLSH . 98 5.7.4 Scalability of Our Distributed Techniques . 99 Chapter 6 Data Analyzing Tools: Uncertain Skyline 100 6.1 Overview . 101 6.2 Preliminary . 108 6.2.1 Skyline Query . 108 6.2.2 Uncertain Data Model . 109 6.2.3 U-Skyline Query . 110 6.2.4 Related Works . 111 6.3 U-Skyline Probability and Integer Programming Model . 112 vii 6.3.1 U-Skyline Probability . 112 6.3.2 Integer Programming Formulation . 113 6.4 U-Skyline Processing Algorithms . 115 6.4.1 Dynamic Construction of Candidate Skyline . 116 6.4.2 Recursive Computation of U-Skyline Probability .