Harvey Mudd College at Semeval-2019 Task 4: the D.X
Total Page:16
File Type:pdf, Size:1020Kb
Harvey Mudd College at SemEval-2019 Task 4: The D.X. Beaumont Hyperpartisan News Detector Evan Amason Jake Palanker Mary Clare Shen Julie Medero Harvey Mudd College Harvey Mudd College Harvey Mudd College Harvey Mudd College 301 Platt Boulevard 301 Platt Boulevard 301 Platt Boulevard 301 Platt Boulevard Claremont, CA 91711 Claremont, CA 91711 Claremont, CA 91711 Claremont, CA 91711 [email protected] [email protected] [email protected] [email protected] Abstract of words that characterize hyperpartisan writing. We use the 600 hand-labelled articles from Se- We also considered complexity features such as mEval Task 4 (Kiesel et al., 2019) to hand- type-to-token ratio and automated readability in- tune a classifier with 3000 features for the Hy- dex. Based on the performance of these features perpartisan News Detection task. Our final we attempt to answer the question of whether hy- system uses features based on bag-of-words perpartisan writing is more or less complex than (BoW), analysis of the article title, language non-hyperpartisan writing. A successful classifier complexity, and simple sentiment analysis in could be very useful in today’s society. For exam- a naive Bayes classifier. We trained our fi- nal system on the 600,000 articles labelled by ple, it could be used to create a browser plug-in publisher. Our final system has an accuracy of to check online articles for political bias in real 0:653 on the hand-labeled test set. The most time as the user reads. People on social media effective features are the Automated Readabil- could use it to verify the legitimacy of a political ity Index and the presence of certain words in article before sharing it with their followers. En- the title. This suggests that hyperpartisan writ- couraging people to share factual news rather than ing uses a distinct writing style, especially in inflammatory hyperpartisan articles would hope- the title. fully improve communication between opposing 1 Introduction parties and create a more informed population. Hyperpartisan news is becoming more mainstream The rest of this paper begins with a descrip- as online sources gain popularity. Hyperpartisan tion of previous work on the related task of fake news is news written from an extremely partisan news detection in Section2. We then describe our perspective, such that the goal is reinforcing exist- model and features in Section3, and our results ing belief structures in the party’s ideology rather in Section4. Section5 discusses some lessons than conveying facts. Such hyperpartisan writing learned with respect to what features are most use- tends to amplify political divisions and increase ful in identifying hyperpartisan news, and Sec- animosity between opposing political ideologies. tion6 closes with a brief description of our sys- Hyperpartisan news sources also output fake news tem’s namesake, fictional magazine editor D.X. at startling rates (Silverman et al., 2016). Auto- Beaumont. matic detection of fake news is difficult, but de- 2 Previous Work tecting hyperpartisan news can help, and it can also expose biases in journalism. This task is Since the 2016 election, there has been a lot of in- challenging to automate because it is even diffi- terest in fake news, which is closely related to the cult for humans: fake and biased news articles hyperpartisan news we focus on. Our approach get shared on social media at high rates, and even to the hyperpartisan news task leverages lessons labels that were hand-generated by professionals learned in prior work on fake news detection, and have errors (Silverman et al., 2016). We attempt explores the extent to which that work is success- to use various features of political news articles ful in a different but related task. Fake news de- to train a multinomial Naive Bayes classifier to tection has been widely studied (e.g., the survey complete this task. We use a set of bag-of-words paper by Fuhr et al. (Fuhr et al., 2018)), and we (BoW) features for words appearing in the title of base many of our classifier’s features on previous each article, and for words appearing in the arti- studies of fake news. cle text. With these features, we identified a set The content of fake and real news articles differ 967 Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019), pages 967–970 Minneapolis, Minnesota, USA, June 6–7, 2019. ©20191 Association for Computational Linguistics substantially. Fake news articles have been found each vocabulary word occurs in the full arti- to require a lower reading level than real news arti- cle text. We then drop a fixed number stop cles, to be less technical, and to use more personal words, selected automatically by frequency. pronouns. Further, their titles tend to be longer, We experimented with both 50 and 100 stop use more proper nouns, and use more words that words, and the run of our system that was are all capitalized (Horne and Adali, 2017). Our submitted to the SemEval task used 50 stop work differs in that we were trying to determine words. whether an article is hyperpartisan, which is sim- ilar to but not the same as identifying fake news Title Bag of Words: Next, using the same vocab- articles. In particular, a hyperpartisan news arti- ulary but without excluding stop words, we cle may be factually correct (i.e., not contain any add word counts for the title of the article. mistruths) but still be written with a hyperparti- We also count the number of words in the title san slant. We hypothesize, nonetheless, that the that are entirely capitalized, generally a fea- stylistic features that distinguish between real and ture of hyperpartisan titles (Horne and Adali, fake news may be useful in identifying hyperpar- 2017). tisan news articles. Potthast, et. al., also showed Sentiment Analyzer: We use two sentiment lex- that there are significant stylistic differences be- icons (Hu and Liu, 2004). The first con- tween hyperpartisan and mainstream news articles tains 2000 words with positive sentiment, and (Potthast et al., 2017). Consequently, we include the second contains 4000 words with neg- reading level and features of each article’s title as ative sentiment. We count the occurrence features in our model. of words from each list, hypothesizing that The success of these features on identifying hyperpartisan articles will likely have many fake news motivates our decision to focus on arti- more words with polarized sentiment than cle titles as a differentiating feature, and to include non-hyperpartisan articles. reading level in the set of features available to our model. Complexity Features: Finally, we include fea- Perez-Rosa et al. also examine fake news ar- tures designed to capture the articles’ com- ticles to create a classifier for them (Prez-Rosas plexity. This category includes features such et al., 2018). Their results identify additional fea- as Average Word Length, Type-Token-Ratio, tures related to text readability, with fake news and SMOG Readability Formula. Each of articles tending to be written at a lower reading these is designed to capture the complexity level than real news articles. We incorporate fea- of a given text; Average Word Length gives tures from their work, including Average Word us insights into the vocab choices and uses Length, Type-Token-Ratio, and SMOG Readabil- of ”advanced” words, Type-Token-Ratio mea- ity Formula . sures the amount of ”novel” words in the text, the SMOG Readability Formula is based on 3 Methodology the number of polysyllabic words per sen- Each article’s content and title was tokenized us- tence (which is influenced both by vocabu- ing spacy’s default English model (AI, 2016–). lary choice, and sentence length). Since prior We use a multinomial naive Bayes classifier work shows that hyper-partisan articles are from scikit-learn, extracting a large number often written at an easier reading level, with of features and then using feature selection to re- more repeating words, and simpler sentence duce the number of features available to our clas- structure, we expect that these complexity sifier. features will be useful in identifying hyper- partisan articles. 3.1 Features 3.2 Feature Selection We make use of features related to the words in the article as a whole, the title of the article, sentiment, The above feature space was very large compared and text complexity. to the number of available articles, so we imple- mented two different methods of feature selection: Bag of Words Features: Using a vocabulary of one using variance, and one using a χ2 test. In 30,000 words, we count the number of times each case, we perform statistics on the training set, 968 2 attempting to describe which features are the most Feature Title Category χ2 p-value ”trump” Polarity 416 1.77e-92 distinguishing. Given these statistics, we score A.R.I. Complexity 377 3.67e-84 each feature, and select a subset of the total feature ”*” Title 208 2.95e-47 set using either a threshold score or a target fea- ”class” Title 179 9.45e-41 ture count. By experimenting on the smaller hand- ”american” Title 170 7.41e-39 ”most” Title 143 5.04e-33 labeled data set, we found that reducing to the best ”political” Title 137 1.08e-31 3000 features maximized our performance for 10- ”israel” Title 133 1.16e-30 fold cross validation. This modification was made ”like” Polarity 128.7 7.85e-30 ”these” Title 126 2.92e-29 after the evaluation, however; our results on the SemEval task represent the performance of our Table 2: Highest ranked features from our hand labeled task without feature selection. data-set.