Analysis and Detection of Low Quality Information in Social Networks
Total Page:16
File Type:pdf, Size:1020Kb
ANALYSIS AND DETECTION OF LOW QUALITY INFORMATION IN SOCIAL NETWORKS A Thesis Presented to The Academic Faculty by De Wang In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computer Science at the College of Computing Georgia Institute of Technology August 2014 Copyright c 2014 by De Wang ANALYSIS AND DETECTION OF LOW QUALITY INFORMATION IN SOCIAL NETWORKS Approved by: Professor Dr. Calton Pu, Advisor Professor Dr. Edward R. Omiecinski School of Computer Science School of Computer Science at the College of Computing at the College of Computing Georgia Institute of Technology Georgia Institute of Technology Professor Dr. Ling Liu Professor Dr. Kang Li School of Computer Science Department of Computer Science at the College of Computing University of Georgia Georgia Institute of Technology Professor Dr. Shamkant B. Navathe Date Approved: April, 21 2014 School of Computer Science at the College of Computing Georgia Institute of Technology To my family and all those who have supported me. iii ACKNOWLEDGEMENTS When I was a little boy, my parents always told me to study hard and learn more in school. But I never thought that I would study abroad and earn a Ph.D. degree at United States at that time. The journey to obtain a Ph.D. is quite long and full of peaks and valleys. Fortunately, I have received great helps and guidances from many people through the journey, which is also the source of motivation for me to move forward. First and foremost I should give thanks to my advisor, Prof. Calton Pu. He is not only a great research mentor guiding and inspiring me to diving deeply in the area of cyber security, but also an awesome friend who is willing to share his rich experience in life and study. He encouraged and trained me to be a great researcher by pushing me to exceed my own boundaries and excel in all aspects of what I did during the Ph.D. I really appreciate his patience on my research progress, as well as his amazing comments along my exploration within the area of cyber security. His underlying gentleness and dedication to research will continue to have a significant impact on my future. Thanks to members of my thesis committee: Prof. Ling Liu, Prof. Shamkant B. Navathe, Prof. Edward R. Omiecinski, and Prof. Kang Li for reading and comment- ing on my dissertation. I really appreciate all the questions and challenges that led to an improved dissertation. To Dr. Danesh Irani, your drive and dedication inspire me, and I will not forget your guidance and patience with a junior Ph.D. student entering the area of cyber security. I thank you for the early direction in research and constant guidances. To Dr. Qinyi Wu, you are a great friend, and I will not forget your helps on my life and iv study. The friendship, companionship, and support of my colleagues at Georgia Tech would be hard to replace. Special thanks to Rocky Dunlap, Qingyang Wang, Aibek Musaev, Yi Yang, Binh Han, Yuzhe Tang, Kisung Lee, Jack Li, Junhee Park, and Kunal Malhotra for being around in the lab. Rocky is a great lab manager. Qingyang is such a good friend to hang around and discuss about research. Aibek and Yi are valuable colleagues to work with. Binh, Yuzhe, and Kisung are my study buddies. Jack, Junhee and Kunal are good labmates. More importantly, I would like to acknowledge the unconditional love and support from my family along the way. To my father, Chuanli Wang, I want to be as good a father as you. To my mom, Xiaoling Wu, thanks for giving me birth. To my wife, Jing Chen, I cannot go this far without you. To my little son, Eric J. Wang, I hope you will obtain a Ph.D. as well. I love you all dearly and this is for you! v TABLE OF CONTENTS DEDICATION .................................. iii ACKNOWLEDGEMENTS .......................... iv LIST OF TABLES ............................... xi LIST OF FIGURES .............................. xiii LIST OF SYMBOLS OR ABBREVIATIONS .............. xvi SUMMARY .................................... xvi I INTRODUCTION ............................. 1 1.1 Dissertation Statement and Dissertation Contributions . .4 1.2 Organization of the Dissertation . .5 II SPADE: A SOCIAL-SPAM ANALYTICS AND DETECTION FRAME- WORK .................................... 10 2.1 Related Work . 12 2.2 Motivation . 14 2.3 Social-Spam Analytics and Detection Framework . 15 2.3.1 Mapping and Assembly . 15 2.3.2 Pre-filtering . 18 2.3.3 Features . 18 2.3.4 Classification . 19 2.4 Experimental Plan . 21 2.4.1 Datasets . 21 2.4.2 Experiment Implementation . 23 2.4.3 Evaluation . 26 2.5 Experimental Results . 27 2.5.1 Baseline . 27 2.5.2 Cross Social-Corpora Classification . 28 2.5.3 Associative classification . 36 vi 2.5.4 System Performance Evaluation . 39 2.5.5 Discussion . 41 2.5.6 Countermeasures for Spam Evolution . 43 2.6 Conclusion . 45 III EVOLUTIONARY STUDY OF WEB SPAM ............ 47 3.1 Motivation . 49 3.2 Webb Spam Corpus 2011 . 50 3.2.1 Data Collection . 50 3.2.2 Data Cleansing . 53 3.2.3 Data Statistics . 55 3.3 Comparison between Two Datasets . 56 3.3.1 Redirections . 56 3.3.2 HTTP Session Information . 61 3.3.3 Content . 65 3.4 Classification Comparison . 69 3.4.1 Feature Generation and Selection . 69 3.4.2 Classifiers . 71 3.4.3 Classification Setup and Cross validation . 71 3.4.4 Result Analysis . 72 3.4.5 Computational Costs . 74 3.4.6 Discussion . 75 3.5 Related Work . 77 3.6 Conclusion . 79 IV EVOLUTIONARY STUDY OF EMAIL SPAM ........... 81 4.1 Motivation . 83 4.2 Data Collection . 83 4.3 Data Analysis . 85 4.3.1 Content Analysis . 85 vii 4.3.2 Topic Modeling . 89 4.3.3 Network Analysis . 92 4.4 Discussion . 98 4.5 Related Work . 99 4.6 Conclusion . 101 V CLICK TRAFFIC ANALYSIS OF SHORT URL FILTERING ON TWITTER .................................. 103 5.1 Motivation . 104 5.2 Data Collection . 105 5.2.1 Collection Approach . 105 5.2.2 Datasets . 106 5.2.3 Data Labeling . 108 5.3 Data Analysis . 109 5.3.1 Creator Analysis . 109 5.3.2 Click Source Analysis . 112 5.3.3 Country Source Analysis . 112 5.3.4 Referrer Source Analysis . 113 5.4 Classification . 115 5.4.1 Classification Features . 115 5.4.2 Machine Learning Classifiers . 116 5.4.3 Classification Setup and Cross Validation . 116 5.5 Evaluation . 117 5.5.1 Evaluation Results . 117 5.6 Discussion . 120 5.6.1 Limitations . 120 5.6.2 Challenges and Possible Countermeasures . 121 5.7 Related Work . 122 5.8 Conclusion . 124 viii VI BEAN: A BEHAVIOR ANALYSIS APPROACH OF URL SPAM FILTERING IN TWITTER ....................... 126 6.1 Preliminaries . 127 6.1.1 Objects in Twitter Trending Topics . 127 6.1.2 Relationships among Objects . 128 6.2 Data Collection and Analysis . 129 6.2.1 Statistics of URLs . 129 6.2.2 Data Labeling . 130 6.2.3 Analysis of URL Spam . 130 6.3 Overview . 133 6.3.1 Problem Definition . 133 6.3.2 Overall Procedure . 136 6.4 Detailed Methodology . 137 6.4.1 Markov Chain Model . 137 6.4.2 Converting to Classifier from Markov Chain Model . 138 6.4.3 Common Functions . 139 6.5 Experimental Evaluation . 141 6.5.1 Tuning Parameters . 141 6.5.2 Dataset and Experimental Setup . 142 6.5.3 Performance Comparison . 143 6.5.4 Anomalous patterns . 144 6.6 Related Work . 145 6.7 Conclusion . 147 VII INFORMATION DIFFUSION ANALYSIS OF RUMOR DYNAM- ICS OVER A SOCIAL-INTERACTION BASED MODEL .... 148 7.1 Related Work . 149 7.2 Social-interaction based Model . 151 7.2.1 Beyond Simple Graph . 151 7.2.2 FAST Model . 152 ix 7.3 Influential Spreaders in Information Diffusion . 155 7.3.1 Existing Metrics . 155 7.3.2 FD-PCI: Fractional and Directed PCI . 156 7.3.3 Spreading Capability . 157 7.3.4 Experimental Evaluation . 158 7.4 Influential Features in Rumor Detection . 160 7.4.1 Rumor Collection . ..