A Stylometric Inquiry into Hyperpartisan and Fake News
Martin Potthast∗, Johannes Kiesel†, Kevin Reinartz†, Janek Bevendorff†, Benno Stein† ∗Leipzig University, †Bauhaus-Universität Weimar webis.de ACL, July 16th, 2018
1 @KieselJohannes 2 @KieselJohannes 3 @KieselJohannes What are Fake News?
Disinformation displayed as news articles
4 @KieselJohannes What are Fake News?
Disinformation displayed as news articles
Image: Claire Wardle, First Draft
5 @KieselJohannes What are Fake News?
Disinformation displayed as news articles
Image: Claire Wardle, First Draft
6 @KieselJohannes A Stylometric Inquiry into Hyperpartisan “News” and “News” in False Context and/or with Content that is Impostered, Manipulated, and/or Fabricated
Martin Potthast∗, Johannes Kiesel†, Kevin Reinartz†, Janek Bevendorff†, Benno Stein† ∗Leipzig University, †Bauhaus-Universität Weimar webis.de ACL, July 16th, 2018
7 @KieselJohannes The Political Spectrum
The left-right political spectrum is a system of classifying political positions, ideologies and parties. Left-wing politics and right-wing politics are often presented as opposed, although either may adopt stances from the other side. [Wikipedia]
Alt-left Left Center Right Alt-right
8 @KieselJohannes The Political Spectrum
The left-right political spectrum is a system of classifying political positions, ideologies and parties. Left-wing politics and right-wing politics are often presented as opposed, although either may adopt stances from the other side. [Wikipedia]
Alt-left Left Center Right Alt-right
Liberal Conservative
9 @KieselJohannes The Political Spectrum
The left-right political spectrum is a system of classifying political positions, ideologies and parties. Left-wing politics and right-wing politics are often presented as opposed, although either may adopt stances from the other side. [Wikipedia]
Alt-left Left Center Right Alt-right
Liberal Conservative
Hyperpartisan Partisan Partisan Hyperpartisan
Partisan: someone with a psychological identification with one major party. [Wikipedia]
10 @KieselJohannes The Political Spectrum
The left-right political spectrum is a system of classifying political positions, ideologies and parties. Left-wing politics and right-wing politics are often presented as opposed, although either may adopt stances from the other side. [Wikipedia]
Alt-left Left Center Right Alt-right
Liberal Conservative
Hyperpartisan Partisan Partisan Hyperpartisan
Partisan: someone with a psychological identification with one major party. [Wikipedia]
News media reporting on politics can be aligned on this spectrum as well.
We are observing an increasing number of hyperpartisan news publishers.
11 @KieselJohannes Fake News and Hyperpartisan News
12 @KieselJohannes Why are Fake News Published by Hyperpartisan Pages?
Image: Claire Wardle, First Draft
13 @KieselJohannes Why are Fake News Published by Hyperpartisan Pages?
Image: Claire Wardle, First Draft
14 @KieselJohannes Fake News Detection Taxonomy of Approaches
Knowledge-based Fake news detection Knowledge-based (also called fact checking) K Requires political knowledge base Etzioni et al., 2018 K Information retrieval Magdy and Wanas, 2010 Unavailable ahead of time Ginsca et al., 2015 K We cannot trust the web Wu et al., 2014 Semantic web / LOD Ciampaglia et al, 2015 Shi and Weninger, 2016 Long et al., 2017 Mocanu et al., 2015 Context-based Context-based Acemoglu et al., 2010 Kwon et al., 2013 K Ma et al., 2017 Limited to social media platforms Social network analysis Volkova et al., 2017 K Part of damage already done Budak et al., 2011 Nguyen et al. 2012 Derczynski et al., 2017 Style-based Tambuscio et al., 2015 Style-based Wei et al., 2013 Chen et al., 2015 Deception detection Rubin et al., 2015 K Allows for pre-posting check Wang et al., 2017 Bourgonje et al., 2017 K Real-time reaction possible Afroz et al., 2012 Badaskar et al., 2008 K Hard to mask Rubin et al., 2016 Text categorization Yang et al., 2017 K But are style differences sufficient? Rashkin et al., 2017 Horne and Adali, 2017 Pérez-Rosas et al., 2017
15 @KieselJohannes Fake News and Hyperpartisan News Corpus Construction
16 @KieselJohannes Fake News and Hyperpartisan News Corpus Construction
Orientation Fact-checking results Publisher true mix false n/a Σ Center 80680 12 826 ABC News 90203 95 CNN 295408 307 Politico 421201 424 Left-wing 182 51 158 256 Addicting Info 95 2587 135 Occupy Democrats 59 2570 91 The Other 98% 28101 30 Right-wing 276 153 72 44 545 Eagle Rising 106 47 25 36 214 Freedom Daily 49 24 224 99 Right Wing News 121 82 254 232 Σ 1264 212 87 64 1627
Annotations provided by journalists at BuzzFeed
17 @KieselJohannes Fake News and Hyperpartisan News Selected Results
Orientation Fact-checking results Publisher true mix false n/a Σ Center 80680 12 826 ABC News 90203 95 CNN 295408 307 Politico Fake421 News201 Detection424 Left-wing 182 51 158 256 Precision ≈ 42% Addicting Info 95 2587 135 Occupy Democrats 59 Recall25≈70 41% 91 The Other 98% 28101 30 Right-wing 276 153 72 44 545 Eagle Rising 106 47 25 36 214 Freedom Daily 49 24 224 99 Right Wing News 121 82 254 232 Σ 1264 212 87 64 1627
Annotations provided by journalists at BuzzFeed
18 @KieselJohannes Fake News and Hyperpartisan News Selected Results
Orientation Fact-checking results Publisher true mix false n/a Σ Center 80680 12 826 ABC News 90203 95 CNN 295408 307 Politico Orientation421201 Detection424 Left-wing 182 51 158 256 Precision ≈ 21% Precision ≈ 56% Addicting Info 95 2587 135 Occupy DemocratsRecall ≈ 5920% 2570Recall ≈9159% The Other 98% 28101 30 Right-wing 276 153 72 44 545 Eagle Rising 106 47 25 36 214 Freedom Daily 49 24 224 99 Right Wing News 121 82 254 232 Σ 1264 212 87 64 1627
Annotations provided by journalists at BuzzFeed
19 @KieselJohannes Fake News and Hyperpartisan News Selected Results
Orientation Fact-checking results Publisher true mix false n/a Σ Center 80680 12 826 ABC News 90203 95 CNN 295408 307 Politico Hyperpartisanship421201 Detection424 Left-wing 182 51 158 256 Precision ≈ 69% Addicting Info 95 2587 135 Occupy Democrats 59 Recall2570 ≈ 89%91 The Other 98% 28101 30 Right-wing 276 153 72 44 545 Eagle Rising 106 47 25 36 214 Freedom Daily 49 24 224 99 Right Wing News 121 82 254 232 Σ 1264 212 87 64 1627
Annotations provided by journalists at BuzzFeed
20 @KieselJohannes Fake News and Hyperpartisan News How can it be that the alt left and the alt right cannot be distinguished from the mainstream, when both together (hyperpartisan news) can be?
Alt-left Left Center Right Alt-right
Liberal Conservative
Hyperpartisan Partisan Partisan Hyperpartisan
21 @KieselJohannes Fake News and Hyperpartisan News How can it be that the alt left and the alt right cannot be distinguished from the mainstream, when both together (hyperpartisan news) can be?
Center
Left Right
Partisan
Alt-left Hyperpartisan Alt-right
22 @KieselJohannes Fake News and Hyperpartisan News How can it be that the alt left and the alt right cannot be distinguished from the mainstream, when both together (hyperpartisan news) can be?
Center
Left Right
Partisan
Alt-left Hyperpartisan Alt-right
The horseshoe theory asserts that the alt left and the alt right, rather than being at opposite and opposing ends of a linear political continuum, in fact closely resemble one another, much like the ends of a horseshoe. [Wikipedia] 23 @KieselJohannes Horseshoe Validation Experiment I Leave-out Classification
left-wing center right-wing
24 @KieselJohannes Horseshoe Validation Experiment I Leave-out Classification
left-wing center right-wing
K Classifier is trained to distinguish left-wing and center articles K Right-wing articles are used for testing K Majority of right-wing articles are classified as left-wing rather than center
25 @KieselJohannes Horseshoe Validation Experiment I Leave-out Classification
left-wing center right-wing 74% | 26%
K Classifier is trained to distinguish left-wing and center articles K Right-wing articles are used for testing K Majority of right-wing articles are classified as left-wing rather than center
26 @KieselJohannes Horseshoe Validation Experiment I Leave-out Classification
left-wing center right-wing 74% | 26%
K Classifier is trained to distinguish left-wing and center articles K Right-wing articles are used for testing K Majority of right-wing articles are classified as left-wing rather than center
left-wing center right-wing 34% | 66%
27 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
? = A B
28 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
? = A B
A B
29 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
? = A B
0.3 0.1 0.1 0.2 0.0 0.2 0.5 0.3 0.2 0.2 0.4 0.1 0.1 0.3 0.1 0.4 0.0 0.4 0.1 0.2 0.2 0.3 0.0 0.1 0.1 0.0 0.2 0.3 0.2 0.1 0.4 0.5 0.2 0.2 0.1 0.6 0.3 0.2 0.2 0.2 0.0 0.0 0.3 0.1 0.6 0.0 0.4 0.2 0.3 0.3 0.4 0.2 0.1 0.3 0.2 0.2 0.1 0.1 0.1 0.3 0.1 0.2 0.2 0.1 0.0 0.1 0.2 0.2 0.2 0.1 0.2 0.0 0.1 0.2 0.6 0.1 0.1 0.2 0.4 0.6 0.4 0.4 0.2 0.3 0.5 0.1 0.4 0.2 0.5 0.1 0.3 0.2 0.5 0.2 0.2 0.5 0.2 0.2 0.1 0.1 0.0 0.3 0.1 0.2 A B 0.1 0.3 0.2 0.4 0.2 0.1 0.0 0.3 0.0 0.2 0.0 0.1 0.2 0.1 0.0 0.0 A B
30 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
? = A B
0.3 0.1 0.1 0.2 0.0 0.2 0.5 0.3 0.2 0.2 0.4 0.1 0.1 0.3 0.1 0.4 0.0 0.4 0.1 0.2 0.2 0.3 0.0 0.1 0.1 0.0 0.2 0.3 0.2 0.1 0.4 0.5 0.2 0.2 0.1 0.6 0.3 0.2 0.2 0.2 0.0 0.0 0.3 0.1 0.6 0.0 0.4 0.2 0.3 0.3 0.4 0.2 0.1 0.3 0.2 0.2 0.1 0.1 0.1 0.3 0.1 0.2 0.2 0.1 0.0 0.1 0.2 0.2 0.2 0.1 0.2 0.0 0.1 0.2 0.6 0.1 0.1 0.2 0.4 0.6 0.4 0.4 0.2 0.3 0.5 0.1 0.4 0.2 0.5 0.1 0.3 0.2 0.5 0.2 0.2 0.5 0.2 0.2 0.1 0.1 0.0 0.3 0.1 0.2 A B 0.1 0.3 0.2 0.4 0.2 0.1 0.0 0.3 0.0 0.2 0.0 0.1 0.2 0.1 0.0 0.0 A B a a a a a b b b b b
100
90
80
70
60
50 0 6 12 18 24 30
31 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
? = A B
0.3 0.1 0.1 0.2 0.0 0.2 0.5 0.3 0.2 0.2 0.4 0.1 0.1 0.3 0.1 0.4 0.0 0.4 0.1 0.2 0.2 0.3 0.0 0.1 0.1 0.0 0.2 0.3 0.2 0.1 0.4 0.5 0.2 0.2 0.1 0.6 0.3 0.2 0.2 0.2 0.0 0.0 0.3 0.1 0.6 0.0 0.4 0.2 0.3 0.3 0.4 0.2 0.1 0.3 0.2 0.2 0.1 0.1 0.1 0.3 0.1 0.2 0.2 0.1 0.0 0.1 0.2 0.2 0.2 0.1 0.2 0.0 0.1 0.2 0.6 0.1 0.1 0.2 0.4 0.6 0.4 0.4 0.2 0.3 0.5 0.1 0.4 0.2 0.5 0.1 0.3 0.2 0.5 0.2 0.2 0.5 0.2 0.2 0.1 0.1 0.0 0.3 0.1 0.2 A B 0.1 0.3 0.2 0.4 0.2 0.1 0.0 0.3 0.0 0.2 0.0 0.1 0.2 0.1 0.0 0.0 A B a a a a a b b b b b
100
90
80
70
60
50 0 6 12 18 24 30
32 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
? = A B
0.3 0.1 0.1 0.2 0.0 0.2 0.5 0.3 0.2 0.2 0.4 0.1 0.1 0.3 0.1 0.4 0.0 0.4 0.1 0.2 0.2 0.3 0.0 0.1 0.1 0.0 0.2 0.3 0.2 0.1 0.4 0.5 0.2 0.2 0.1 0.6 0.3 0.2 0.2 0.2 0.0 0.0 0.3 0.1 0.6 0.0 0.4 0.2 0.3 0.3 0.4 0.2 0.1 0.3 0.2 0.2 0.1 0.1 0.1 0.3 0.1 0.2 0.2 0.1 0.0 0.1 0.2 0.2 0.2 0.1 0.2 0.0 0.1 0.2 0.6 0.1 0.1 0.2 0.4 0.6 0.4 0.4 0.2 0.3 0.5 0.1 0.4 0.2 0.5 0.1 0.3 0.2 0.5 0.2 0.2 0.5 0.2 0.2 0.1 0.1 0.0 0.3 0.1 0.2 A B 0.1 0.3 0.2 0.4 0.2 0.1 0.0 0.3 0.0 0.2 0.0 0.1 0.2 0.1 0.0 0.0 A B a a a a a b b b b b
100
90
80
70
60
50 0 6 12 18 24 30
33 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
? = A B
0.3 0.1 0.1 0.2 0.0 0.2 0.5 0.3 0.2 0.2 0.4 0.1 0.1 0.3 0.1 0.4 0.0 0.4 0.1 0.2 0.2 0.3 0.0 0.1 0.1 0.0 0.2 0.3 0.2 0.1 0.4 0.5 0.2 0.2 0.1 0.6 0.3 0.2 0.2 0.2 0.0 0.0 0.3 0.1 0.6 0.0 0.4 0.2 0.3 0.3 0.4 0.2 0.1 0.3 0.2 0.2 0.1 0.1 0.1 0.3 0.1 0.2 0.2 0.1 0.0 0.1 0.2 0.2 0.2 0.1 0.2 0.0 0.1 0.2 0.6 0.1 0.1 0.2 0.4 0.6 0.4 0.4 0.2 0.3 0.5 0.1 0.4 0.2 0.5 0.1 0.3 0.2 0.5 0.2 0.2 0.5 0.2 0.2 0.1 0.1 0.0 0.3 0.1 0.2 A B 0.1 0.3 0.2 0.4 0.2 0.1 0.0 0.3 0.0 0.2 0.0 0.1 0.2 0.1 0.0 0.0 A B a a a a a b b b b b
100
90
80
70
60
50 0 6 12 18 24 30
34 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
? = A B
0.3 0.1 0.1 0.2 0.0 0.2 0.5 0.3 0.2 0.2 0.4 0.1 0.1 0.3 0.1 0.4 0.0 0.4 0.1 0.2 0.2 0.3 0.0 0.1 0.1 0.0 0.2 0.3 0.2 0.1 0.4 0.5 0.2 0.2 0.1 0.6 0.3 0.2 0.2 0.2 0.0 0.0 0.3 0.1 0.6 0.0 0.4 0.2 0.3 0.3 0.4 0.2 0.1 0.3 0.2 0.2 0.1 0.1 0.1 0.3 0.1 0.2 0.2 0.1 0.0 0.1 0.2 0.2 0.2 0.1 0.2 0.0 0.1 0.2 0.6 0.1 0.1 0.2 0.4 0.6 0.4 0.4 0.2 0.3 0.5 0.1 0.4 0.2 0.5 0.1 0.3 0.2 0.5 0.2 0.2 0.5 0.2 0.2 0.1 0.1 0.0 0.3 0.1 0.2 A B 0.1 0.3 0.2 0.4 0.2 0.1 0.0 0.3 0.0 0.2 0.0 0.1 0.2 0.1 0.0 0.0 A B a a a a a b b b b b
100
90
80
70
60
50 0 6 12 18 24 30
35 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
Typical learning characteristic for . . .
100
90
80
70 different authors (A 6= B) same author (A = B)
% c o r e t l a s i f n 60
50 0 6 12 18 24 30 # eliminated features
36 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
Typical learning characteristic for . . .
100
90
80
70 different authors (A 6= B) same author (A = B)
% c o r e t l a s i f n 60
50 0 6 12 18 24 30 # eliminated features
37 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
Typical learning characteristic for . . .
100
90
Decision: "different" 80
70 different authors (A 6= B) Decision: "same" same author (A = B)
% c o r e t l a s i f n 60
50 0 6 12 18 24 30 # eliminated features
The typical learning characteristic can be learned. § “Meta Learning”
We apply Unmasking to distinguish style genres.
38 @KieselJohannes Horseshoe Validation Experiment II Unmasking [Koppel/Schler 2004]
mainstream vs left 0.6 mainstream vs right left vs right
0.4
0.2 Nomralized accuracy Nomralized 0.0 0 3 6 9 12 15 Iterations
39 @KieselJohannes Summary and Outlook
K Hyperpartisan news pages produce relatively many fake news articles
K Hyperpartisan news can be distinguished quiet well based on style
K Style-based detection allows for real-time detection § Political extremism in news can be ousted or at least flagged
K The style of alt left and alt right news is very similar
K Linguistic evidence for the horseshoe theory of the political spectrum? § Large-scale analysis required
40 @KieselJohannes 41 @KieselJohannes webis.de/events/semeval-19
42 @KieselJohannes Style Model Features
K n-grams with n ∈ [1, 3] of characters, stop words, parts-of-speech K 10 readability scores K Dictionary features based on General Inquirer K Ratios of quoted words, external links, number of paragraphs, and their average length
Feature selection
K Discard word features (n-gram features) occurring in less than 2.5% (10%) of documents
Training set
K Balancing using oversampling K Publishers are not represented in both training and test set
Learning algorithm
K WEKA’s random forest with default parameters
43 @KieselJohannes