Identifying Interests and Expertise in the Twitter Social Network»
Total Page:16
File Type:pdf, Size:1020Kb
DEPARTMENT OF INFORMATICS MSc IN COMPUTER SCIENCE M.Sc. Thesis «Identifying Interests and Expertise in the Twitter Social Network» Evita Zoi Bakopoulou Student ID: 1418 Advisor: Prof. Vana Kalogeraki ATHENS, FEBRUARY, 2016 ii “Social media is the ultimate equalizer. It gives a voice and a platform to anyone willing to engage.” Amy Jo Martin iii ATHENS UNIVERSITY OF ECONOMICS AND BUSINESS Abstract Department of Informatics Master of Science in Computer Science Identifying Interests and Expertise in the Twitter Social Network by Evita Zoi Bakopoulou Social media have become very popular in recent years and con- stitute a powerful source of information. Twitter users have various interests and expertise levels, and we believe that their profile data can be explored to better tailor crowdsourcing systems to the needs of spe- cific tasks. This thesis aims to infer the interests and expertise of users in a Twitter sample data set, and use these user profiles when assigning tasks to the users in a crowdsourcing system. For example, user exper- tise can provide important benefit when solving complex tasks or for avoiding non-legitimate answers. Our work ensures that user answers will have higher validity, as there will be a better match between the user interests and the issued tasks, thus tasks requiring expertise will be appropriately assigned to corresponding users. Various features in Twitter can be used in such a system, e.g., Users’ lists, which could indicate the users’ interests based on whether the users are subscribed to a list, and their expertise can be determined if they are added to ex- isting lists by other users. In this thesis text mining techniques are used to extract the users’ topics and to match them with a query describing a crowdsourcing task. We present experimental results to illustrate the working and benefits of our approach. v Acknowledgements First and foremost, I would like to express my deepest gratitude to my advisor Prof. Vana Kalogeraki, for her continuous guidance and incredible encouragement throughout this thesis and related research. I also thank my family, especially my husband Kostas, for providing me with unfailing support and understanding during the challenges of graduate school and life. vii Contents Abstract iii Acknowledgementsv 1 Introduction1 2 Background and Related Work5 2.1 Twitter social network...................5 2.2 Comparison with existing approaches...........8 3 Identifying topical expertise and interests 11 3.1 Data Collection Analysis.................. 11 3.2 Data Preprocessing..................... 14 3.3 Proposed Approach..................... 14 3.3.1 LDA Model..................... 15 3.3.2 LSI Model...................... 16 3.3.3 Naive Approach................... 17 4 Experimental Results 19 4.1 LDA results......................... 19 4.2 Comparing LDA and LSI results with different features. 20 4.3 Comparing with a new dataset............... 25 5 Conclusions and Future Work 29 Bibliography 31 1 Chapter 1 Introduction Approximately 2 billion Internet users are using social networks daily and this number increases every year. There are various social plat- forms, each focusing on a different use; Facebook is focused on ex- changes (photos, status sharing) between family and friends, Twitter is more about communication (posting micro messages). Twitter, a micro-blogging website, is one of the most popular social networks to- day, as it has surpassed 320 million monthly active users. Users provide information about themselves voluntarily and publicly and thus, it is an ideal social network for discovering topics of users’ interests and expertise. A key challenge for identifying topical expertise and interests is how to choose the appropriate source of information. The emergence of so- cial media has introduced many distinct sources of data and as more information becomes available it is more difficult to use it efficiently to find what we need. There are many prior studies that attempt to infer topics from Twitter by using the content of tweets or user’s social relationships. The authors in [1] first suggested that there are features, such as Twitter Lists that provide superior results for mining topical expertise and interests. They claim that tweets often consist of mun- dane conversations that do not indicate accurately users’ interests and expertise. Moreover, interests are passive traits of users and thus, infer- ring them from their activity (e.g. posted tweets, "follow" relationship) can produce misleading results. Results extracted via Twitter Lists can be beneficial to different do- mains, such as in personalized adversing, surveys and crowdsourcing systems. For instance, a List-based approach can be used in crowd- sourcing systems to find suitable workers for a task to increase the validity of answers. Current crowdsourcing systems do not profile their workers and they do not classify them based on their expertise and interests. Crowdsourc- ing systems use human workers for tasks that cannot be completed automatically and human perception is needed. As human workers execute tasks, they collect a reward for their effort, which is usually monetary. Example of a popular crowdsourcing systems are Amazon’s Mechanical Turk [2] and CrowdFlower [3]. The tasks are generated by requesters and they are assigned to human workers. Tasks have differ- ent levels of difficulty; some of them require a level of expertise. An 2 Chapter 1. Introduction example is could be a crowdsourcing task containing two images of cars and the worker has to choose the most expensive one. If the worker does not have the sufficient knowledge/expertise about cars, his/her answer will not be valid and thus, will be useless for the crowdsourcing purposes and it should be discarded. Even if this task is assigned to multiple workers with the goal to obtain quality results by majority voting, the results will have poor quality. The proposed method in this thesis aims to extract the expertise and interests of workers though their Twitter profiles in order to assign effectively tasks to proper workers and thus, increase the validity of their answers. The biggest challenge in crowdsourcing systems is how to assign optimally tasks to workers in order to obtain the most reliable answers from them, since the risk of having malicious Spam workers is always present. Spammers execute tasks randomly in order to be rewarded with the prizes and their answers should be discarded, since they could influence negatively the results of a task. In order to reduce such risks and to ensure quality answers, a crowdsourcing system should use the profiles of workers on social media, which are publicly available con- taining a lot of personal information. Additionally to user’s expertise, his or her interests could be inferred as well. This hopefully will reduce also the number of Spam workers, since the topic of assigned tasks will be similar to the inferred interests of the certain user and prob- ably he/she will provide legitimate answers. For instance, if the task requires knowledge about cars, the word "car" can be given as a query. In this work, we propose a List-based methodology to infer user expertise and interests from his/her Twitter profile in order to find suitable users that are experts or interested in a topic/query. There are two types of lists; Membership list ("member of"), which means that the user was added by another user into a List, or user can sub- scribe to a Subscription List ("subscribed to"). The given query can contain keywords of a task’s description or in general, any topic we would like to identify experts users. Our results contain the screen names of the Twitter users that are probably experts or interested in this topic. Moreover, we compare the value of each feature used (Mem- bership/Subscription Lists, Bio) for inferring topical expertise and in- terests. Our findings indicate that Membership Lists are superior to the Bio feature, as the latter contains many verbs and thus, produces more noisy results. Also, Membership Lists are more broadly used by users than Subscription Lists. Another important advantage of this thesis is that it can be used in crowdsourcing systems to increase the quality of the tasks. We suggest that crowdsourcing systems should adopt such the proposed methodology to profile their workers they will hopefully reduce Spammers and/or random answers. We analyze user expertise and interests in a dataset containing 1197 active users in Twitter. Then we compare the results to a snapshot of user profiles captured later to discover and analyze the trends in expertise/interests. Finally, we compare the quality of extracted topics Chapter 1. Introduction 3 via two popular algorithms for topic modeling and a naive approach, as well as the contribution of several features in Twitter. The rest of this thesis is organized as follows. In Section 2, there is a background and a summary of the related works on mining expertise and interest in social media; Section 3 describes the proposed method- ology for data modeling and the dataset; The experimental results are presented in Section 4; Finally, the conclusion and future work are given in Section 5. 5 Chapter 2 Background and Related Work 2.1 Twitter social network Twitter is a popular micro-blogging site, where users can write short messages, the so-called tweets and share them publicly. Users in Twitter are encouraged to participate by posting tweets (messages limited to 140 characters, but will be expanded to 10,000 characters in the near future), following other users, liking or re-tweeting other tweets. A Twitter account contains a small description, the so called Bio section, the number of followers, likes, followees and user’s lists. Twitter Lists feature is a curated group of users that another user has added and who wishes to receive updates from them. There are two types of Lists in Twitter; a user can be member of a Membership list ("member of"), which means that he/she was added by another user into a List, or he/she can subscribe to a Subscription List ("subscribed to").