Analyzing the Perceived Severity of Cybersecurity Threats Reported On

Analyzing the Perceived Severity of Cybersecurity Threats Reported on Social Media Shi Zong1, Alan Ritter1, Graham Mueller2 and Evan Wright3 1The Ohio State University, OH, USA 2Leidos Inc., VA, USA 3FireEye LLC, CA, USA fzong.56, [email protected] [email protected] [email protected] Abstract Breaking cybersecurity events are shared across a range of websites, including security blogs (FireEye, Kaspersky, etc.), in addition to social media platforms such as Face- book and Twitter. In this paper, we investi- Figure 1: Example tweet discussing the dirty copy-on- gate methods to analyze the severity of cyber- write (COW) security vulnerability in the Linux kernel. security threats based on the language that is used to describe them online. A corpus of 6,000 tweets describing software vulnerabilities is annotated with authors’ opinions toward response efforts, vulnerabilities in the NVD are as- their severity. We show that our corpus sup- signed severity scores using the Common Vulner- ports the development of automatic classifiers ability and Scoring System (CVSS). As the rate of with high precision for this task. Furthermore, discovered vulnerabilities has increased in recent we demonstrate the value of analyzing users’ years,3 the need for efficient identification and pri- opinions about the severity of threats reported oritization has become more crucial. However, it online as an early indicator of important soft- is well known that a large time delay exists be- ware vulnerabilities. We present a simple, yet effective method for linking software vulner- tween the time a vulnerability is first publicly dis- abilities reported in tweets to Common Vul- closed to when it is published in the NVD; a re- nerabilities and Exposures (CVEs) in the Na- cent study found that the median delay between tional Vulnerability Database (NVD). Using the time a vulnerability is first reported online and our predicted severity scores, we show that it the time it is published in the NVD is seven days; is possible to achieve a Precision@50 of 0.86 also, 75% of threats are first disclosed online giv- when forecasting high severity vulnerabilities, ing attackers time to exploit the vulnerability.4 significantly outperforming a baseline that is based on tweet volume. Finally we show how In this paper we present the first study of reports of severe vulnerabilities online are pre- whether natural language processing techniques dictive of real-world exploits.1 can be used to analyze users’ opinions about the severity of software vulnerabilities reported on- 1 Introduction line. We present a corpus of 6,000 tweets annotated with opinions toward threat severity, and em- Software vulnerabilities are flaws in computer sys- pirically demonstrate that this dataset supports au- tems that leave users open to attack; vulnerabili- tomatic classification. Furthermore, we propose ties are generally unknown at the time a piece of a simple, yet effective method for linking soft- software is first published, but are gradually iden- ware vulnerabilities reported on Twitter to entries tified over time. As new vulnerabilities are discov- in the NVD, using CVEs found in linked URLs. ered and verified they are assigned CVE numbers We then use our threat severity analyzer to con- (unique identifiers), and entered into the National duct a large-scale study to validate the accuracy Vulnerability Database (NVD).2 To help prioritize of users’ opinions online against experts’ severity 1 Our code and data are available at https://github.com/viczong/ 3 cybersecurity threat severity analysis. https://www.cvedetails.com/browse-by-date.php 2 4 https://nvd.nist.gov/ https://www.recordedfuture.com/vulnerability-disclosure-delay/ 1380 Proceedings of NAACL-HLT 2019, pages 1380–1390 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics ratings (CVSS scores) found in the NVD. Finally, 2.1 Data Collection we show that our approach can provide an early To collect tweets describing cybersecurity events indication of vulnerabilities that result in real ex- for annotation, we tracked the keywords “ddos” ploits in the wild as measured by the existence of and “vulnerability” from Dec 2017 to July 2018 Symantec virus signatures associated with CVEs; using the Twitter API. We then used the Twitter we also show how our approach can be used to ret- tagging tool described by Ritter et. al. (2011) to rospectively identify Twitter accounts that provide extract named entities,5 retaining tweets that con- reliable warnings about severe vulnerabilities. tain at least one named entity. To cover as many Recently there has been increasing interest in linguistic variations as possible, we used Jaccard developing NLP tools to identify cybersecurity similarity with a threshold of 0.7 to identify and events reported online, including denial of ser- remove duplicated tweets with same date.6 vice attacks, data breaches and more (Ritter et al., 2015; Chang et al., 2016; Chambers et al., 2018). 2.2 Mechanical Turk Annotation Our proposed approach in this paper builds on this We paid crowd workers on Amazon Mechanical line of work by evaluating users opinions toward Turk to annotate our dataset. The annotation was the severity of cybersecurity threats. performed in two phases; during the first phase, Prior work has also explored forecasting soft- we asked workers to determine whether or not the ware vulnerabilities that will be exploited in the tweet describes a cybersecurity threat toward a tar- wild (Sabottke et al., 2015). Features included get entity, in the second phase the task is to de- structured data sources (e.g., NVD), in addition to termine whether the author of the tweet believes the volume of tweets mentioning a list of 31 key- the threat is severe; only tweets that were judged words. Rather than relying on a fixed set of key- to express a threat were annotated in the second words, we analyze message content to determine phase. Each HIT contained 10 tweets to be anno- whether the author believes a vulnerability is se- tated; workers were paid $0.20 per HIT. In pilot vere. As discussed by Sabottke et al.(2015), meth- studies we tried combining these two annotations ods that rely on tracking keywords and message into a single task, but found low inter-rater agree- volume are vulnerable to adversarial attacks from ment, especially for the threat severity judgments, Twitter bots or sockpuppet accounts (Solorio et al., motivating the need for separation of the annota- 2013). In contrast, our method is somewhat less tion procedure into two tasks. prone to such attacks; by extracting users’ opin- Figure2 shows a portion of the annotation inter- ions expressed in individual tweets, we can track face presented to workers during the second phase the provenance of information associated with our of annotation. Details of each phase are described forecasts for display to an analyst, who can then below, and summarized in Table1. determine whether or not they trust the source of information. 2 Analyzing Users’ Opinions Toward the Severity of Cybersecurity Threats Given a tweet t and named entity e, our goal is to predict whether or not there is a serious cybersecurity threat towards the entity based on context. For example, given the context in Figure2, we aim at predicting the severity level towards adobe flash Figure 2: A portion of the annotation interface shown player. We define an author’s perceived severity to MTurk workers during the threat severity annotation. toward a threat using three criteria: (1) does the author believe that their followers should be wor- Threat existence annotation: Not all tweets in ried about the threat? (2) is the vulnerability easily our dataset describe cybersecurity threats, for ex- exploitable? and (3) could the threat affect a large ample many tweets discuss different senses of the 5 number of users? If one or more of these criteria https://github.com/aritter/twitter nlp are met, then we consider the threat to be severe. 6We sampled a dataset of 6,000 tweets to annotate. 1381 1st Annotation (5 workers per tweet) 2nd Annotation (10 workers per tweet) Anno. Tweets Total Label # Tweets % Label # Tweets % 2,543 Severe threat 506 25.7 With threat 42.4 6,000 (1,966 for 2nd anno.) Moderate threat 1,460 74.3 Without threat 3,457 57.6 / Table 1: Number of annotated tweets with break-down percentages to each category. In 1st annotation, a tweet contains a threat if more than 3 workers vote for it. In 2nd annotation, a threat is severe if more than 6 workers agree on it. Number of workers cut-offs are determined by comparing to our golden annotations in pilot studies. word “vulnerability” (e.g., “It’s OK to show vul- which are marked as containing severe or moder- nerability”). During the first phase of our annota- ate threats. For threat existence annotation, we tion process, workers judged whether or not there observe a 0.66 value of Cohen’s κ (Artstein and appears to be a cybersecurity threat towards the Poesio, 2008) between the expert judgements and target entity based on the content of the corre- majority vote of 5 crowd workers. Although our sponding tweet. We provide workers with 3 op- threat severity annotation task may require some tions: the tweet indicates (a) a cybersecurity threat cybersecurity knowledge for accurate judgment, towards given entity, (b) a threat, but not towards we still achieve 0.52 Cohen κ agreement by com- the target entity, or (c) no cybersecurity threat. paring majority vote from 10 workers with expert Each tweet is annotated by 5 workers. annotations. Threat severity annotation: In the second phase, we collect all tweets judged to contain threats by 2.3 Analyzing Perceived Threat Severity more than 3 workers in the first phase and anno- Using the annotated corpus described in Sec- tated them for severity.

Load more