``TL;DR:'' Out-Of-Context Adversarial Text Summarization and Hashtag Recommendation
Total Page:16
File Type:pdf, Size:1020Kb
“TL;DR:” Out-of-Context Adversarial Text Summarization and Hashtag Recommendation Peter Jachim Filipo Sharevski Emma Pieroni DePaul University DePaul University DePaul University Chicago, IL Chicago, IL Chicago, IL [email protected] [email protected] [email protected] ABSTRACT format. The simplicity associated with text summarization is prefer- This paper presents Out-of-Context Summarizer, a tool that takes able to lengthy readings, and therefore leaves the technique ripe arbitrary public news articles out of context by summarizing them for adversarial misuse. to coherently fit either a liberal- or conservative-leaning agenda. Social media has become an outlet for news, and users on sites The Out-of-Context Summarizer also suggests hashtag keywords like Twitter or Parler rely on short pieces of text in order to un- to bolster the polarization of the summary, in case one is inclined derstand polarizing topics or narratives. Text summaries of news to take it to Twitter, Parler or other platforms for trolling. Out-of- articles are continuous feeds of fresh content that can be used by Context Summarizer achieved 79% precision and 99% recall when adversaries in a variety of different ways, including as newsletters, summarizing COVID-19 articles, 93% precision and 93% recall when or continuous Twitter or Parler content. If an adversary were to summarizing politically-centered articles, and 87% precision and upload a summary to Twitter, it would need to fall within the char- 88% recall when taking liberally-biased articles out of context. Sum- acter limit or use a conversational thread with few consecutive marizing valid sources instead of synthesizing fake text, the Out- tweets [57]. Similarly, Parler deploys a “Read More..." functional- of-Context Summarizer could fairly pass the “adversarial disclosure” ity which deters the use of messages longer than 1000 characters test, but we didn’t take this easy route in our paper. Instead, we [40]. Studies suggest that when users are scrolling social media, used the Out-of-Context Summarizer to push the debate of potential particularly looking for political content, they are conditioned to misuse of automated text generation beyond the boilerplate text of respond well to relatively short messages or threats that evoke a responsible disclosure of adversarial language models. feeling of being informed [2, 18]. Even more, people tend to take content from trusted sources like legitimate news articles at face CCS CONCEPTS value as a heuristic that saves time [64]. Text summarization is able to satiate the need for speedy news consumption, by giving users • Security and privacy → Social aspects of security and pri- the “tl;dr" critical information from an article, so that a user can vacy; • Computing methodologies → Information extraction; keep scrolling. KEYWORDS Summaries are usually accompanied with embedded features like links or hashtags in order to facilitate content diffusion to reach automated text summarization, adversarial language modeling a broad audience in a crowded social media landscape [35]. Under- 1 INTRODUCTION standing that this increases audiences’ exposure, adversaries were quick to diffuse alternative narratives on social media including The goal of online trolling is often assimilation into the public polarizing hashtags and links to posts on less regulated platforms conscience by maintaining fake narratives with improbable inter- like 4chan [25, 57]. The addition of links is to avoid hard moder- pretations while remaining undetected [63]. Yet this paper presents ation by mainstream platforms for openly disseminating trolling an alternative means of trolling, or trolling a content feeder, by content. The particular addition of hashtags, which help to organize extracting legitimate parts of public discourse from news articles, content and track discussions based on keywords, is not to avoid selectively isolating polarizing parts, and taking them out of context moderation but to yield inferences about the summaries that rein- arXiv:2104.00782v1 [cs.CL] 1 Apr 2021 in an effort to advance an adversarial agenda. Text summarization force the polarizing positions [64]. In anticipation of an adversary by Out-of-Context Summarizer, the adversarial language model pre- utilizing the approach to bolster a position, we have also designed sented in this paper, is the natural means to this end, as readers Out-of-Context Summarizer to predict potential hashtag keywords are often short on time and prefer the soundbite of issues in “tl;dr” to accompany the suggested out-of-context summary. Permission to make digital or hard copies of all or part of this work for personal or Often, in trying to understand trolling online or detect it, re- classroom use is granted without fee provided that copies are not made or distributed searchers utilize complex automated language models [4, 14]. While for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM these models certainly have the potential for adversarial misuse, must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, the complexity associated provides little incentive for trolls. Yet the to post on servers or to redistribute to lists, requires prior specific permission and/or a text summarization method used in Out-Of Context Summarizer fee. Request permissions from [email protected]. Conference’17, July 2017, Washington, DC, USA provides a relatively quick and efficient way to generate adversarial © 2021 Association for Computing Machinery. narratives from already existing pieces of public discourse like news ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 articles. Using machine learning, Out-of-Context Summarizer “bor- https://doi.org/10.1145/nnnnnnn.nnnnnnn rows” the experience of journalists who understand their audiences in order to weaponize that relationship, by swaying the summariza- tion to seem either left or right-leaning, without raising alarm from users who trust the new sources they read. After generating alter- example incorporating elements from medical documents or jour- native narratives under the guise of legitimate journalism, trolls nal papers to supplement the COVID-19 summaries. We also focus can upload summaries quickly to social media sites like Twitter on creating alternative narratives out of context from news arti- or Parler, to change the context of the narrative. By manipulating cles, but one might use Out-of-Context Summarizer for example to the summarization to favor either conservative or liberal sentences, summarize a legislative bill for the same purposes. the perspective of the polarizing topic is manipulated, but remains undetectable, as the selected sentences are actually present in pub- 3 ADVERSARIAL TEXT SUMMARIZATION licly available news articles. Traditional trolling may be more easily The scope of our adversarial text summarization mechanism in- detectable, as the alternative narratives are often known and also cludes: the gathering and structuring of data, data modelling, and widely dispelled, but by advancing adversarial agendas through text summarization aided by analysis of the fitted model. Figure legitimate pieces of public discourse, trolling becomes much harder 1 shows the high-level overview of the language model behind to identify and poses a security risk worth disclosing. the Out-of-Context Summarizer. The out-of-context summarization starts when a website of interest publishes an article. The text from 2 AUTOMATIC TEXT SUMMARIZATION the article is then scraped and formatted into a cohesive, labelled Automatically condensing large texts to contain few important dataset. The labelled dataset is passed to a script that fits a ma- sentences has been, and remains, an imperative of many language chine learning pipeline to the labelled dataset, and saves the fitted models in a sincere effort to maximize the entropy of informa- pipeline. Finally, the text summarization process takes the fitted tion delivered to a user for consumption in minimum required model, analyzes the prediction probability to decide the out-of- time [17, 26]. The sheer volume of online text content in particular context weights, and creates text summaries that can be passed makes it hard for users to sift through textual data that is often along to the delivery mechanism. The sources used can be adjusted, redundant, unverified, and incorrectly formatted. Robust automatic and new scrapers can be created to enable the adversary to choose text summarization systems, therefore, received a widespread atten- from a wide range of online news outlets. tion from the natural language processing and artificial intelligence For the purposes of this paper, we trained the classification model communities. These systems usually are classified based on the used in the Out-of-Context Summarizer on articles from Vox and language, input size, summarization approach, nature of the output Breitbart initially scraped in February of 2021. We selected these two summary, summarization algorithm, summary content, summariza- websites to establish liberal and conservative baselines, respectively tion domain, language, and type of summary [11]. [58], but future adaptations could select text from any number The Out-of-Context Summarizer manipulates a single neutral of news sources. To demonstrate the Out-of-Context Summarizer input article, based on a machine learning model trained on a cor-