
MisInfoWars: A Linguistic Analysis of Deceptive and Credible News by Emilie Francis B.A., University of Victoria, 2013 Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Arts in the Department of Linguistics Faculty of Arts and Social Sciences c Emilie Francis 2018 SIMON FRASER UNIVERSITY Summer 2018 Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation. Approval Name: Emilie Francis Degree: Master of Arts (Linguistics) Title: MisInfoWars: A Linguistic Analysis of Deceptive and Credible News Examining Committee: Chair: Ashley Farris-Trimble Assistant Professor Maite Taboada Senior Supervisor Professor Trude Heift Supervisor Professor Fred Popowich Professor School of Computing Science Simon Fraser University Date Defended: July 31, 2018 ii Abstract Misinformation, bias, and deceit, clandestine or not, are a pervasive and continual problem in media. Real-time mass communication through online media such as news outlets, Twitter, and Facebook, has extended the reach of deceptive information, and increased its impact. The concept of fake news has existed since before print, but has acquired renewed attention due to its perceived influence in the 2016 U.S. Presidential election. Previous studies of fake news have revealed much about why it is produced, how it spreads, and what measures can be taken to combat its rising influence. Despite the continued interest in fake news, current research on the language of deceptive media has been largely superficial. This thesis serves to provide a profound understanding of the stylistic and linguistic features of fake news by comparing it to its credible counterpart. In doing so, it will advocate for differentiation between disingenuous and respectable media based on linguistic variation. With a dataset of approximately 80,000 articles from known fake and legitimate news sources, specific stylistic differences will be examined for saliency and significance. Using multidimensional analysis for discourse variation established by Biber (1988), this thesis will confirm that there exist sufficient textual differences between the articles of fake news and credible news to consider them distinct varieties. Detecting misinformation has not proven to be simple, neither has minimizing its reach. As the ambition of fake news articles is to appear authentic, acquiring knowledge of the subtleties which serve to discriminate realism from fabrication is crucial. A better understanding of the linguistic composition of deception and fabrication in comparison to credibility and veracity will facilitate future attempts at both manual and automatic detection. Keywords: Fake News, News Text, Corpus Linguistics, Multidimensional Analysis iii Table of Contents Approval ii Abstract iii Table of Contents iv List of Tables vi List of Figures vii 1 Introduction 1 1.1 Past and Present Research in Deceptive News . 2 1.2 The Economics of Misinformation . 4 1.3 What is Fake News? . 5 1.4 A Text-based Approach to Understanding Deceptive News . 7 2 Disinformation: The Language of Lies, Spread of Misinformation, and Deception Detection 10 2.1 Deception in Speech and Writing . 10 2.2 How Misinformation Spreads . 12 2.3 Deceptive News and Machine Learning . 14 2.3.1 A Soft Introduction to Machine Learning Approaches Used in Decep- tion Classification . 15 2.3.2 Automatic Deception Detection . 17 2.4 Human Deception Detection . 19 2.5 Harmonizing Previous Research and the Current Research . 20 3 Methodology and Data 22 3.1 Multidimensional Analysis and Varieties of Text . 22 3.1.1 Register, Genre, Style, and Text Type . 22 3.1.2 Using Linguistic Variables to Determine Dimensions of Text . 23 3.1.3 Dimensions and Text Types . 24 3.1.4 MDA in Application . 26 iv 3.2 Data . 27 3.3 Analysis . 30 4 Results and Discussion 32 4.1 General Findings . 32 4.1.1 Statistical Significance of the Differences between Dimension Scores and Z-Scores . 33 4.2 Explanation of Scoring . 33 4.2.1 Dimension Scores . 34 4.2.2 Z-Scores . 34 4.3 Dimension Score Results . 35 4.3.1 Dimension One: Involved vs. Informational . 35 4.3.2 Dimension Two: Narrative vs. Non-narrative . 38 4.3.3 Dimension Three: Context Dependency . 41 4.3.4 Dimension Four: Overt Expression of Persuasion . 42 4.3.5 Dimension Five: Abstractness . 43 4.4 Correlations Amongst Dimension Scores . 45 4.4.1 Dimension Correlations within the Deceptive News Corpus . 46 4.4.2 Dimension Correlations within the Credible News Corpus . 47 4.5 Z-Scores Results . 49 4.5.1 Salient Linguistic Variables . 50 4.5.2 Small Numbers Big Differences . 52 4.6 A Note on Correlation . 57 4.6.1 Correlations Attributed to Grammatical Patterns . 58 4.6.2 Correlations within the Deceptive News Corpus . 58 4.6.3 Correlations within the Credible News Corpus . 62 4.6.4 Summary of Correlations . 66 4.7 Comparing Corpora with Established Varieties of Text . 67 4.7.1 Deceptive News, Credible News, and Common Text Types . 67 4.7.2 Comparing News Subtypes . 68 5 General Discussion and Conclusion 75 5.1 Summary of Results . 75 5.2 Future Work . 77 5.3 Concluding Remarks . 78 Bibliography 80 Appendix A Z-Score Variables 89 Appendix B Text Type Comparisons 91 v List of Tables Table 3.1 Biber’s eight text types, their typical dimension scores, and examples. 25 Table 3.2 List of sources used in the credible news corpus, the number of articles per source, and total. 29 Table 3.3 List of sources used in the deceptive news corpus, the number of articles per source, and total. 30 Table 4.1 Mean dimension scores for dimension one through five. These numbers represent the average dimension score across all articles within the corpus. 35 Table 4.2 Mean dimension score per corpus and mean z-scores per corpus of features associated with dimension one. 36 Table 4.3 Mean dimension score per corpus and the mean z-scores of features associated with dimension two. 38 Table 4.4 Mean dimension score per corpus and the mean z-score of the feature most closely associated with dimension three. 41 Table 4.5 Mean dimension score per corpus and the mean z-scores of the features most closely associated with dimension four. 42 Table 4.6 Mean dimension score per corpus and the mean z-scores of the features most closely associated with dimension five. 43 Table 4.7 The average scores per salient linguistic variable for each corpus. 50 Table 4.8 Scores of variables with an LSD threshold of ≥ 0.5 not flagged by MAT. 53 vi List of Figures Figure 1.1 An image from a Buzzfeed quiz on misinformation (Lytvynenko, 2017). 4 Figure 1.2 The tweet from Twitter channel of CNN Reporter Oliver Darcy (A. B. Wang, 2017) on the ABC News Scandal. 6 Figure 2.1 Guidelines created by the International Federation of Library Asso- ciations and Institutions (IFLA) based on a FactCheck.org articles (Kiely & Robertson, 2016). 21 Figure 4.1 The results for the deceptive news corpus from the r correlation measure displayed by dimension. 47 Figure 4.2 The results for the credible news corpus from the r correlation mea- sure displayed by dimension. 49 Figure 4.3 Scatterplot matrix showing correlations within the deceptive news corpus. The legend for the variables: AWL - average word length, JJ - attributive adjectives, NOMZ - nominalization, PIN - prepositional phrases, VPRT - present tense verbs. 61 Figure 4.4 Scatterplot matrix showing the same variables in Figure 4.3 for the credible news corpus in contrast. 62 Figure 4.5 Scatterplot matrix showing correlations within the credible news cor- pus. The legend for the variables: NN - total nouns, TPP3 - third person pronouns, VBD - past tense verbs, XX0 - analytic negation, [CONT] - contractions. 65 Figure 4.6 Scatterplot matrix showing the same variables as in 4.5 within the deceptive news corpus as contrast. 66 Figure 4.7 Error bar plot showing the maximum and minimum scores for di- mension one for subtypes of news and editorial. 70 Figure 4.8 Error bar plot showing the maximum and minimum scores for di- mension two for subtypes of news and editorial. 71 Figure 4.9 Error bar plot showing the maximum and minimum scores for di- mension three. 72 Figure 4.10 Error bar plot showing the maximum, minimum, and mean scores for dimension four. 73 vii Figure 4.11 An error bar plot showing the maximum, minimum, and mean scores for dimension five for ten subtypes of news and editorial. 74 viii Chapter 1 Introduction Recently there has been growing concern around the validity of sources of information and the influence such information has on society. After the 2016 United States presidential election, the effect of fictitious and biased information on public opinion has been called into question. The bulk of this criticism has focused on social media, considering it a major vehicle for malicious misinformation and influencing public opinion (Spinney, 2017). The social media site Facebook has found itself facing the majority of the blame for the preva- lence and influence of misinformation during the 2016 electoral cycle. The website has been accused of abetting the spread of misinformation (Isaac, 2016), possibly contributing to the election of Donald Trump (Parkinson, 2016; Read, 2016). In 2016, 62% of American adults got their news from social media, specifically from Reddit, Facebook, and Twitter (Gottfried & Shearer, 2016). Facebook is reported to reach 67% of American adults, equating the users who consume news through the site to 44% of the population (Gottfried & Shearer, 2016). As these surveys were conducted in 2016, these numbers are likely to be even higher at present.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages101 Page
-
File Size-