Detecting Vandalism on Wikipedia Across Multiple Languages
Total Page:16
File Type:pdf, Size:1020Kb
Detecting Vandalism on Wikipedia across Multiple Languages Khoi-Nguyen Dao Tran A thesis submitted for the degree of Doctor of Philosophy The Australian National University May 2015 c Khoi-Nguyen Dao Tran 2015 Except where otherwise indicated, this thesis is my own original work. Khoi-Nguyen Dao Tran May 5, 2015 To my family for your enduring love and support. To my friends for sharing your life and experiences with me. To my supervisors for your inspiration, guidance, patience, and feedback. To the virtual online worlds and gaming communities that are always welcoming with new friendships, experiences, and knowledge. And to Wikipedia and its supporters for providing the conditions and foundations that have made this thesis possible. Thank you :-) Acknowledgments I thank my family for their enduring love and support throughout my life. There are no words that can express my gratitude for Dat Tran (Dad), Phuong Dao (Mum), Thao Tran (Sister), and our loving dogs Wiggy and Pepper. I thank my friends for sharing their life, experiences, and time with me; and for helping me through my hardest times while studying for my bachelor and postgrad- uate degrees at our university. I have spent many hours interacting with friends in online communities and spent countless more with real life friends. Many friends have left parts of their personality with me, and have helped me grow and appreciate life for all its ups and downs. In particular, I thank Lachlan Horne, Jimmy Thomson, and Mayank Daswani for the uncountable number of hours we have spent talking and hanging out, and for the life lessons that I have learnt from each of them. I thank my main supervisor, Peter Christen, for his inspiration, guidance, pa- tience, and prompt feedback for all aspects of my research. I have been extremely lucky and grateful to have a supervisor who is passionate about his work and the work of his students. I cannot thank him enough for the time he has taken out of his busy schedule to meet, review, and discuss my written work, and to attend my presentations and provide feedback. My research work, skills, and confidence have improved immensely under his supervision. I also thank my co-supervisors Scott Sanner and Lexing Xie for their continued support and providing guidance when I felt lost in pursuing a solution. I thank them for attending my presentations, and making time for meetings and providing feedback on my written work despite their busy schedule. In addition, I thank my supervisors, Mamoun Alazab and Roderick Broadhurst, at the cybercrime laboratory for providing me with an opportunity to extend my research work and skills to cybercrime detection, making Chapter 8 possible. The re- search in Chapter 8 is funded by an ARC Discovery Grant on the Evolution of Cyber- crime (DP 1096833), the Australian Institute of Criminology (Grant CRG 13/12-13), and the support of the ANU Research School of Asia and the Pacific. We also thank the Australian Communications and Media Authority (ACMA) and the Computer Emergency Response Team (CERT) Australia for their assistance in the provision of data and support. I have been part of an amazing research group with many interesting research topics. I thank them for their friendship, encouragements, support, and participation in my presentations and research work. Finally, I thank the Australian National University (ANU) and the Research School of Computer Science for providing a welcoming environment for study and pursu- ing research. I have made many friends and have had many interesting, fun, and memorable experiences at the ANU. vii Abstract Vandalism, the malicious modification or editing of articles, is a serious problem for free and open access online encyclopedias such as Wikipedia. Over the 13 year lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more than 500 million revisions of over 9 million English articles, but smaller manually inspected sets of revisions for research show vandalism may appear in 7% to 11% of all revisions of English Wikipedia articles. The persistent threat of vandalism has led to the development of automated programs (bots) and editing assistance programs to help editors detect and repair vandalism. Research into improving vandalism detection through application of machine learning techniques have shown significant improvements to detection rates of a wider variety of vandalism. However, the focus of research is often only on the English Wikipedia, which has led us to develop a novel research area of cross-language vandalism detection (CLVD). CLVD provides a solution to detecting vandalism across several languages through the development of language-independent machine learning models. These models can identify undetected vandalism cases across languages that may have insufficient identified cases to build learning models. The two main challenges of CLVD are (1) identifying language-independent features of vandalism that are common to mul- tiple languages, and (2) extensibility of vandalism detection models trained in one language to other languages without significant loss in detection rate. In addition, other important challenges of vandalism detection are (3) high detection rate of a va- riety of known vandalism types, (4) scalability to the size of Wikipedia in the number of revisions, and (5) ability to incorporate and generate multiple types of data that characterise vandalism. In this thesis, we present our research into CLVD on Wikipedia, where we identify gaps and problems in existing vandalism detection techniques. To begin our thesis, we introduce the problem of vandalism on Wikipedia with motivating examples, and then present a review of the literature. From this review, we identify and address the following research gaps. First, we propose techniques for summarising the user ac- tivity of articles and comparing the knowledge coverage of articles across languages. Second, we investigate CLVD using the metadata of article revisions together with article views to learn vandalism models and classify incoming revisions. Third, we propose new text features that are more suitable for CLVD than text features from the literature. Fourth, we propose a novel context-aware vandalism detection tech- nique for sneaky types of vandalism that may not be detectable through constructing features. Finally, to show that our techniques of detecting malicious activities are not limited to Wikipedia, we apply our feature sets to detecting malicious attachments and URLs in spam emails. Overall, our ultimate aim is to build the next genera- tion of vandalism detection bots that can learn and detect vandalism from multiple languages and extend their usefulness to other language editions of Wikipedia. ix x List of Publications Below is our list of publications and their relevance to this thesis. We include ad- ditional information on the quality of the conferences and journals (or transactions) from their Web sites, email communications, and the Computing Research and Edu- cation Association (CORE) of Australasia1. CORE provides a ranking of publication venues based on feedback from research peers2. The rankings are A* (flagship conference), A (excellent conference), B (good conference), and C (other). 1. Khoi-Nguyen Tran and Peter Christen. Identifying Multilingual Wikipedia Arti- cles based on Cross Language Similarity and Activity. 2013. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Manage- ment (CIKM). CORE ranking of A3. Poster paper (acceptance rate of 37.7%). This paper describes measures that summarise cross-language knowledge cov- erage and within-language user activity of Wikipedia articles in different lan- guages and over time. This work is the basis of Chapter 4 with extensions for this thesis of additional measures, visualisations, and analyses. 2. Khoi-Nguyen Tran and Peter Christen. Cross-Language Prediction of Vandalism on Wikipedia Using Article Views and Revisions. 2013. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). CORE ranking of A4. Full paper (acceptance rate of 11.3%). This paper investigates cross-language vandalism detection using features de- rived from metadata, and compares performance of multiple classifiers. This work is the basis of Chapter 5. 3. Khoi-Nguyen Tran and Peter Christen. Cross-Language Learning from Bots and Users to detect Vandalism on Wikipedia. 2015. In IEEE Transactions of Knowledge and Data Engineering (TKDE). CORE ranking of A5. Full paper (acceptance rate in the first 10 months of 2014 is 13.2%; impact factor in April 2015 of 1.81). This paper investigates cross-language vandalism detection using features de- rived from text data of Wikipedia articles, and provides an analysis of the con- tributions of bots in vandalism detection, which is often ignored by related work. This work is the basis of Chapter 6. 1http://www.core.edu.au 2http://www.core.edu.au/index.php/conference-rankings 3http://103.1.187.206/core/25/ 4http://103.1.187.206/core/1667/ 5http://103.1.187.206/core/2064/ xi xii 4. Khoi-Nguyen Tran, Peter Christen, Scott Sanner, and Lexing Xie. Context-Aware Detection of Sneaky Vandalism on Wikipedia across Multiple Languages. 2015. In Proceedings of the 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). CORE ranking of A6. Awarded “Best Student Pa- per” at the PAKDD conference. This paper proposes a novel context-aware detection technique for sneaky vandalism, which are more difficult or ambigu- ous types of vandalism. This work is the basis of Chapter 7. 5. Khoi-Nguyen Tran, Mamoun Alazab, and Roderic Broadhurst. Towards a Fea- ture Rich Model for Predicting Spam Emails containing Malicious Attachments and URLs. 2013. In Proceedings of the 11th Australasian Data Mining Conference (AusDM). CORE ranking of B7. Full paper (acceptance rate of 37%). This paper shows the extension of vandalism detection – the detection of ma- licious activity on Wikipedia – to the detection of malicious content in spam emails from only the email text.