Kukulage, Charika Samangi Perera Og Attestog, Targeir Oppgave
Total Page:16
File Type:pdf, Size:1020Kb
Mapping Extremist Forums using Text Mining Targeir Attestog, Samangi Perera Supervisor Professor Ole-Christoffer Granmo This Master’s Thesis is carried out as a part of the education at the University of Agder and is therefore approved as a part of this education. However, this does not imply that the University answers for the methods that are used or the conclusions that are drawn. University of Agder, 2013 Faculty of Engineering and Science Department of Information and Communication Technology Abstract Political opinions far from what is considered normal, a distorted view of reality, and hatred to certain other groups are spread amongst political extremists like Islamists and White Supremacists. Demonstrations and violence performed by some members of these groups are well-known from mass media and get a lot of attention. The Islamists and right-extremists exploit this benefit to spread their message to ordinary people. In online forums, young, curious people can read detailed information (or propaganda) from extremists. Which words do extremists then use to convince each other in addition to other curious readers that what they stand for is right? The goal of this thesis is to first find algorithms or techniques for how to discover characteristic vocabulary in online extremist forums and words that frequently are used in the same forum message. Then we analyse the results to find patterns of what is typical vocabulary in the different forums. Mapping the extremists’ habits of vocabulary usage can help us know better how extremists write in online extremist forums, and possibly also help us recognize them when they write on some other websites. In this thesis, we find frequent and characteristic words by means of Global Term Frequency (GTF) and pairs of co-occurring words by means of odds ratio in different extremist forums. We compare normalized GTF (NGTF) of words in two forums to find out where they are used most. Words used in only one of two forums are found as well. We find the GTFs for words written by five of the ten most active authors in each forum, and we find words that one author writes, while the other of ten most active authors does not write. From results we see that Islamists write most about religion, but also some politics. Some popular words are “allah”, “prophet”, “fasting”, and “hajj”. The right-extreme websites Stormfront and Vigrid discuss politics and argument for their own ideology and against the mainstream politics. Frequent words in Stormfront are “white”, “jews”, and “race”, in the Norwegian Vigrid website “jødene”, “tyskland”, and “krigen”. In the German right-extreme website Deutsche Stimme, “npd”, “Deutschland”, “partei”, and “volk” are frequent words. Both Islamists and right-extremists are preoccupied by family values. Our results are useful for discovering topics that extremists write about in their online forums, topics that other people do not write about at all or write about with a different point of view. ii Preface This thesis is the result of a master’s thesis project at the Faculty of Engineering and Science at the University of Agder in Grimstad, Norway. After writing a pre-project report in the late autumn 2012, all the research and thesis writing were done during the spring semester in 2013. The project has the topic code IKT590 and is worth 30 ECTS credit points. It is meant to be the final project for a two-year master study program in Information and Communications Technology (ICT) with the study profile System Development. The purpose of a master’s thesis project is to immerse oneself in a specific subject area, and it is supposed to be an independent research. The thesis was carried out under the supervision of professor Ole-Christoffer Granmo. We would like to thank him for the valuable guidance and advices throughout the project. The task of the completion of this thesis would have not been possible without his support. We would also like to thank associate professor Morten Goodwin for helping us with the pre- project report in the autumn. We also thank our families and friends for moral support while we were working with this research. Targeir Attestog Samangi Perera University of Agder Grimstad, Norway 2 June 2013 iii Contents List of Figures .......................................................................................................................... vii List of Tables ......................................................................................................................... viii Chapter 1: Introduction .............................................................................................................. 1 1.1 Problem Discussion ..................................................................................................... 2 1.2 Hypotheses .................................................................................................................. 3 1.3 Contributions ............................................................................................................... 4 1.4 Key Components of our Project .................................................................................. 5 1.5 Structure of Report ...................................................................................................... 5 Chapter 2: Former work ............................................................................................................. 7 2.1 Relationship Discovery in Large Text Collections Using Latent Semantic Indexing 7 2.2 Names: A New Fronter in Text Mining ...................................................................... 8 2.3 Anomaly Detection in Extremist Web Forums Using a Dynamical Systems Approach ................................................................................................................................ 9 2.4 US Domestic Extremist Groups on the Web: Link and Content Analysis.................. 9 2.5 Applying Authorship Analysis to Extremist-Group Web Forum Messages ............. 10 2.6 Exploring the Dark Side of the Web: Collection and Analysis of U.S. Extremist Online Forums ...................................................................................................................... 11 2.7 Authorship Attribution of Web Forum Posts ............................................................ 12 2.8 Combining Entity Matching Techniques for Detecting Extremist Behavior on Discussion Board.................................................................................................................. 13 2.9 Contributions to our Project ...................................................................................... 14 Chapter 3: Method ................................................................................................................... 16 3.1 Online Forums ........................................................................................................... 16 3.2 Getting Data .............................................................................................................. 18 3.2.1 Download of Webpages .................................................................................... 19 3.2.2 Extraction of Data from Webpages ................................................................... 21 3.3 Forum Text Mining ................................................................................................... 23 3.3.1 TF and IDF ........................................................................................................ 23 3.3.2 GTF and NGTF ................................................................................................. 23 3.3.3 Forum Word Count Comparison ...................................................................... 24 3.3.4 Word Colocation Analysis ................................................................................ 25 3.5 Author Text Mining .................................................................................................. 26 iv 3.6 Manual Analysis ........................................................................................................ 27 Chapter 4: Results .................................................................................................................... 28 4.1 Word Frequency Analysis of Ummah ....................................................................... 29 4.2 Word Frequency Analysis of Islamic Web-Community ........................................... 30 4.3 Word Frequency Analysis of Stormfront .................................................................. 31 4.4 Word Frequency Analysis of Swiss English Forum ................................................. 37 4.5 Word Frequency Analysis of the CS:GO Forum in Steam ....................................... 39 4.6 NGTF Comparison of Swiss English Forum and Islamic Web-Community ............ 40 4.7 NGTF Comparison of Islamic Web-Community with Steam ................................... 43 4.8 NGTF Comparison of Swiss English Forum and Ummah ........................................ 52 4.9 NGTF Comparison of Steam and Ummah ................................................................ 63 4.10 NGTF Comparison of Swiss English Forum and Stormfront ................................... 67 4.11 NGTF Comparison of Steam and Stormfront ........................................................... 72 4.12 NGTF Comparison of Stormfront and Ummah ........................................................ 80 4.13 Word Frequency Analysis of Vigrid ........................................................................