810942HFSXXX10.1177/0018720818810942Human FactorsEmbedded Antiphishing Trainingresearch-article2018

Embedding Training Within Warnings Improves Skills of Identifying Webpages

Aiping Xiong, Robert W. Proctor, Weining Yang, and Ninghui Li, Purdue University, Lafayette, Indiana, USA

Objective: Evaluate the effectiveness of training Introduction embedded within security warnings to identify phishing webpages. Phishing is a social engineering attack that Background: More than 20 million malware and uses e-mail, social network webpages, and other phishing warnings are shown to users of Google Safe media to communicate messages intended to Browsing every week. Substantial click-through rate is persuade potential victims to perform certain still evident, and a common issue reported is that users actions or divulge confidential information for lack understanding of the warnings. Nevertheless, each the attacker’s benefit in the context of cyber- warning provides an opportunity to train users about phishing and how to avoid phishing attacks. security (Khonji, Iraqi, & Jones, 2013; Orgill, Method: To test use of phishing-warning instances Romney, Bailey, & Orgill, 2004). as opportunities to train users’ phishing webpage Because the mimics that of a reputa- detection skills, we conducted an online experiment ble organization, victims are tricked into enter- contrasting the effectiveness of the current Chrome ing personal information and credentials, which phishing warning with two training-embedded warning are then stolen by the attackers. Damages from interfaces. The experiment consisted of three phases. In Phase 1, participants made login decisions on 10 phishing attacks include financial losses, expo- webpages with the aid of warning. After a distracting sure of privacy information, and reputational task, participants made legitimacy judgments for 10 dif- harm to companies. Phishing is estimated to ferent login webpages without warnings in Phase 2. To have resulted in about $30 million in damages to test the long-term effect of the training, participants U.S. consumers and businesses in 2017 (FBI, were invited back a week later to participate in Phase 2018). Beyond financial loss, users reported 3, which was conducted similarly as Phase 2. Results: Participants differentiated legitimate and reduced trust in people and the technology as a fraudulent webpages better than chance. Performance consequence of phishing attacks (Kelley, Hong, was similar for all interfaces in Phase 1 for which the Mayhorn, & Murphy-Hill, 2012). warning aid was present. However, training-embedded Because of the negative consequences of interfaces provided better protection than the Chrome phishing attacks, considerable effort has been phishing warning on both subsequent phases. devoted to devising methods to protect users Conclusion: Embedded training is a complemen- tary strategy to compensate for lack of phishing web- from them. Detection and prevention of phish- page detection skill when phishing warning is absent. ing scams is the first line of protection to stop Application: Potential applications include devel- attacks from reaching people. Computer scien- opment of training-embedded warnings to enable secu- tists have developed several automated tools for rity training at scale. phishing detection: (1) e-mail classification at server and client levels to filter phishing e-mails Keywords: cybersecurity, phishing, training, action on (e.g., Fette, Sadeh, & Tomasic, 2007); (2) web- cybersecurity, procedural knowledge site blacklists consisting of phishing and IP addresses detected in the past (e.g., Google Address correspondence to Aiping Xiong, College of Safe Browsing; Whittaker, Ryner, & Nazif, Information Sciences and Technology, the Pennsylvania State University, E373 Westgate Building, University Park, 2010) or almost all possible variants of a URL PA 16802, USA; e-mail: [email protected] (e.g., Prakash, Kumar, Kompella, & Gupta, 2010); (3) heuristic solutions based on sets of HUMAN FACTORS Vol. 61, No. 4, June 2019, pp. 577­–595 rules from previous real-time phishing attacks to DOI: 10.1177/0018720818810942 detect zero-day (i.e., previously unknown) phish- Article reuse guidelines: sagepub.com/journals-permissions ing attacks (e.g., Zhang, Hong, & Cranor, 2007); Copyright © 2018, Human Factors and Ergonomics Society. (4) webpage visual-similarity assessments to 578 June 2019 - Human Factors block phishing (e.g., Fu, Liu, & Deng, webpages. We conducted an experiment to 2006). However, those tools and services do not address three research questions: protect against all phishing due to evolution of phishing attacks and the difficulty computers 1. What are the short- and long-term effects of train- have in accurately extracting the meaning of the ing that is embedded within a phishing warning? natural language messages in e-mails (Stone, 2. Which is the most effective way to present train- 2007). ing to help users learn skills of how to identify the When automatic detection fails, the user legitimacy of a webpage? makes the final decision on a webpage’s legiti- 3. Does presenting training-embedded warnings as macy (Proctor & Chen, 2015). Thus, researchers feedback of users’ actions facilitate the effect of developed decision-aid tools to warn users when training? a fraudulent website is detected. The tools include dynamic security skins (Dhamija & Action-Oriented Phishing Tygar, 2005), browser toolbars (Herzberg & Protection Strategies Gbara, 2004), and web browser phishing warn- ings and secure sockets layer (SSL) warnings Phishing Warning (Carpenter, Zhu, & Kolimi, 2014; Felt et al., When warnings were presented to aid users’ 2015). Those tools remind users of potential decisions, users who clicked through the warn- risks passively or actively. Passive warnings ings showed a lack of understanding of the employ principles, such as colored icons or warnings (e.g., Bravo-Lillo, Cranor, Downs, & highlighting, which signal potential dangers to Komanduri, 2011; Dhamija, Tygar, & Hearst, users without interrupting their primary tasks 2006). These findings are somewhat unexpected (Chou, Ledesma, Teraguchi, & Mitchell, 2004; because most of the warning designs followed Herzberg & Gbara, 2004; Lin, Greenberg, Trot- guidelines to improve users’ understanding of ter, Ma, & Aycock, 2011). Active warnings cap- the risks, for example, using direct language ture users’ attention by forcing them to choose and symbols to describe explicit consequences one of the options presented by the warnings of the risk (Felt et al., 2015; Yang et al., 2017). (Egelman, Cranor, & Hong, 2008; Felt et al., Nevertheless, scrutiny of the information pre- 2015; Wu, Miller, & Garfinkel, 2006). sented in those warnings revealed a focus on Yet, these decision-aid tools have evidenced facts about phishing (e.g., the definition and ineffectiveness (e.g., Xiong, Proctor, Yang, & potential costs), also known as declarative Li, 2017) and usability problems (e.g., Sheng knowledge (Anderson, 2013). et al., 2009; Wu et al., 2006). Specifically, people Downs, Barbagallo, and Acquisti (2015) showed a lack of understanding of the decision- investigated differences between declarative aid warnings in general (e.g., Felt et al., 2015; knowledge about phishing and procedural Wu et al., 2006). Training is one promising knowledge of the actions to determine URL approach to address users’ lack of comprehen- legitimacy (Anderson, 2013). In an online role- sion, and a prior study provided evidence that play study, participants chose possible actions knowledge gained from training enhanced the for legitimate and fraudulent e-mails and possi- effectiveness of a phishing warning (Yang, ble actions for webpages following each e-mail’s Xiong, Chen, Proctor, & Li, 2017). Currently, link. Declarative knowledge was closely related there is little work on integrating phishing train- to participants’ self-reported predictions on ing and warning. We conjectured that such awareness, susceptibility and intentions, but research is essential because of (a) the inability procedural knowledge was the only predictor of to require the large population of users the users’ ability to adjust their risk decisions. to take classroom training, and (b) minimal Xiong et al. (2017) conducted a study, in a warning protection for zero-day attacks. laboratory setting with an eye-tracker, investi- Our aim in the current study was to under- gating why a passive warning (domain high- stand the effect of embedded training within lighting) is ineffective at helping users identify phishing warnings in helping users detect phishing phishing webpages. They based their study on Embedded Antiphishing Training 579 the fact that the domain name embedded within goal. The results obtained in the warning-only the URL of a phishing site will always be differ- condition are similar to previous findings (e.g., ent from the legitimate one. Thus, the mismatch Felt et al., 2015), suggesting that security aware- between the real domain name and the imper- ness alone is not sufficient to protect users from sonated webpage serves as a reliable cue to phishing attacks. The power of using a combina- detect phishing attacks (Lin et al., 2011). tion of training and phishing warning to reduce Because users may overlook the domain name the likelihood of being phished provided evi- (Jagatic, Johnson, Jakobsson, & Menczer, 2007), dence that participants should not only be aware the domain of whichever site a user is currently of the risks but also equipped with skills to take viewing is highlighted. Specifically, the domain actions on the risks. Thus, it is critical to figure name portion within the URL in the browser’s out a way to integrate phishing training and address bar is in black, whereas the rest of the warning such that a large population of internet URL is in gray (e.g., Google Chrome, Firefox). users can be trained effectively. In Xiong et al.’s (2017) study, participants evaluated the safety of legitimate and fraudulent Phishing Training webpages in two phases, with instructions to Because phishing threats cannot be elimi- look at the address bar in the second phase but nated entirely through automated tools or users’ not initially. Although safety evaluation results compliance with phishing warnings, users nec- showed some benefit of attending to the address essarily must be trained about phishing attacks bar, domain highlighting did not provide effec- and how to avoid being phished. Despite train- tive protection against phishing attacks. Yet eye- ing being an essential aspect of cybersecurity, it tracking results (e.g., heat map) revealed that is the least popular approach (Hong, 2012). participants’ visual attention was attracted by the The most basic approach to training is to post highlighted domains. Thus, the ineffectiveness of information about phishing online, as done by domain highlighting seems due to participants’ academic organizations, government organiza- lack of knowledge concerning how to use the tions, nonprofit organizations, and companies. domain name to identify webpage’s legitimacy. For example, Anti-Phishing Work Group Equipping users with skills of how to identify (APWG) provides the STOP-THINK-CON- potential phishing webpages (procedural knowl- NECT global cybersecurity education and aware- edge) seems to be critical to improve the effec- ness campaign to improve the public under- tiveness of phishing warnings. Yang et al. (2017) standing of phishing. Although such education investigated the effectiveness of phishing train- and advice can improve users’ ability to avoid ing and its interaction with a phishing warning phishing attacks, most members of the public on the webpage. The training content focused on will not read them. how to evaluate the webpage’s legitimacy by In a classroom setting, Anandpara, Dingman, using the domain name. In a field experiment, Jakobsson, Liu, and Roinestad (2007) examined participants in four groups varying in the pres- the effectiveness of a phishing training (i.e., ence and absence of the phishing training and FTC Consumer Alert) at a test with the portion warning received a simulated phishing e-mail of phishing trials varied from 25% to 100%. attack targeting Amazon. Although, many par- Forty participants identified legitimate and ticipants who received only the training or only phishing e-mails before and after the training. the warning fell prey to the simulated phishing Across two test phases, there was no correlation attack, none of the participants who received between the actual phishing e-mails and the both interventions submitted their genuine number of phishing e-mails that participants account information. identified. Thus, Anandpara et al. claimed that That no advantage was evident for the condi- the traditional forms of education increase the tion with only training indicates the necessity of level of fear or concern among users but not the making users aware of security issues through ability to identify phishing scams. warnings when the issues arise, consistent with Ferguson (2005) evaluated a contextual train- the idea that security typically is a secondary ing approach, sending fake phishing e-mails to 580 June 2019 - Human Factors participants to explore their vulnerability to users with skills for using the knowledge to reg- phishing attacks in the real world. The study ulate security-related behaviors. tested participants’ ability to detect phishing attacks in the first phase. In the second phase, Retention and Transfer of Knowledge participants received phishing training and a Acquisition From Training lecture in a classroom and were then tested. Par- To detect zero-day phishing that has not ticipants’ ability to identify phishing e-mails been blacklisted or was missed by heuristic improved after the training (also see Dodge, techniques, users need to retain the knowledge Carver, & Ferguson, 2007). gained from training and transfer it to other Based on contextual training, Kumaraguru situations. Retention is the ability of people to et al. (2007, 2009) designed and evaluated an retrieve the concepts or procedures learned after e-mail embedded-training system called Phish- a period of time. Transfer is the ability to apply Guru to avoid phishing attacks. Participants the knowledge gained from one situation to received simulated phishing e-mails, and a train- another that differs from that of the knowledge ing page appeared whenever participants clicked acquisition (Roediger, Dudai, & Fitzpatrick, on a phishing link in the e-mail. Users’ immedi- 2007). Both abilities are essential to phishing ate and long-term ability to identify phishing detection due to the thousands of new phishing attacks improved after receiving embedded train- URLs that are generated monthly (PhishTank, ing of phishing e-mails in both laboratory and 2018). real-world settings. Most forms of security train- The retention and transfer of knowledge is ing take place in a classroom and give people few closely related to the process of acquisition, opportunities to test what they have learned. In which is largely determined by the nature of the contrast, embedded training teaches people knowledge and how the knowledge is presented within the specific context of use in which they during training. Thus, first of all, one must be would normally be attacked (Caputo, Pfleeger, aware of which type of knowledge is involved Freeman, & Johnson, 2014; Kumaraguru et al., during the training, namely, facts and events 2007, 2009; Kumaraguru, Sheng, Acquisti, (declarative knowledge) or knowing how to do Cranor, & Hong, 2010). Thus, among the alterna- something (procedural knowledge). Also, to tive training methods, embedded training, ensure effective and efficient training, a specific designed to teach users critical information dur- type of knowledge should be presented in line ing their typical online interactions, is the most with its form of function or nature of representa- promising (Al-Daeef, Basir, & Saudi, 2017). tion. That is, declarative knowledge should be Nevertheless, previous work revealed that the presented in a way that is available for recall or potential effectiveness of embedded training is recognition, and procedural knowledge should limited by the requirement that users read the be presented in a way that guides operations or training material. Kumaraguru et al. (2007, actions by specifying what is to be done under 2009) found that the training-embedded material which conditions (Oberauer, 2010). is only effective when users actually read it, Transfer and retention of declarative and pro- which they tend not to do if the training message cedural knowledge are widely accepted as hav- is long (Caputo et al., 2014). Inspection of the ing different properties (Healy & Bourne, 2012; training-embedded material used in prior stud- Lee & Vakoch, 1996). Declarative knowledge ies, in fact, reveals long descriptions that require declines quickly, whereas procedural knowl- much time and effort. However, security is a edge, once acquired, remains at the same level user’s secondary goal in general. Thus, it is criti- when retested after one week or longer. Transfer cal to implement the training information in is typically better for declarative knowledge. such a way that users can acquire and use it eas- Yet, the continuous practice of procedures is ily and quickly. Because warnings are present accompanied by accumulative learning of fac- when users encounter potential phishing web- tual information. Thus, retention and transfer of pages, embedding training within phishing both types of knowledge is expected from train- warning may be a good opportunity to equip ing focused on procedural knowledge. Due to Embedded Antiphishing Training 581

Figure 1. Declarative training-embedded warning interface. the procedural nature of using the domain name Proposed Training-Embedded to identify phishing webpages, presenting the Warning Interfaces training content through step-by-step procedural We developed two new training-embedded instructions was expected to result in better warning interfaces, one we call Declarative acquisition and subsequent retention and/or (Figure 1) and the other Procedural (Figure transfer than presenting it with declarative sen- 2). For both interfaces, a training intervention tences, i.e., descriptions of the legitimate and focusing on domain names is displayed within fraudulent domain names. an active security warning to help users develop the knowledge and skills to detect potential Action Effect phishing webpages. Due to focusing on the In most popular browsers, e.g., Google webpage’s domain name, for both interfaces, Chrome, after a user clicks the link within a a screenshot of the URL part is enlarged and phishing e-mail, the phishing warning blocks linked to the URL by an arrow, indicating that the whole webpage and any potential interac- this warning is specific for the domain name. tions with the webpage (Felt et al., 2015). But For the Declarative interface, the highlighted the anticipated consequence of an action can domain is listed below the URL screenshot, have an effect on the information processing marked in red, and described as not owned by that is required to initiate the action subse- the related brand name with a sentence, such as quently (Hommel, Müsseler, Aschersleben, & “amawazon.com is not owned by Amazon.” Prinz, 2001). Thus, the current warning imple- Because people have difficulty discriminating mentation method may eliminate the possibility the credibility of websites based on domain of users acquiring the knowledge and skills for names (Wogalter & Mayhorn, 2008), the legiti- phishing webpage detection. Instead of using mate domain name is also listed for comparison. the warning as a block to action, we proposed to The pairwise phishing and legitimate domain implement the phishing warning as an immedi- names serve as instances to train users about the ate action effect, or feedback, to provide guid- domain-name spoof methods (the similar and ance toward correct behavior (Schmidt & Bjork, complex methods used in current study, which 1992). we explain in the Method section). 582 June 2019 - Human Factors

Figure 2. Procedural training-embedded warning interface.

The embedded training content within the Pro- Safety” button highlighted in blue. Participants cedural interface is the same as the Declarative had the opportunity to go back to the phishing one except that the two sentences are replaced by webpage by the “Visit this unsafe site” button. actions of how to avoid phishing risks in three steps: (1) Find the domain name highlighted in the URL; (2) Compare it with the legitimate domain The Experiment name; (3) If the two domain names are different, Using a between-subject design, we studied click “Back to Safety.” Note that the first two steps six conditions involving two factors: warn- embed not only the pairwise domain names com- ing interface and time at which the warning parison as in the Declarative interface, but also was presented. The three interfaces were: (a) how to get the information explicitly. Chrome: current Chrome phishing warning; (b) In compliance with the guidelines of efficient Declarative: the declarative training-embedded warning that Laughery and Wogalter (2006) pro- warning interface; (c) Procedural: the proce- posed, for both interfaces the signal words dural training-embedded warning interface. “STOP! Phishing Alert” are included at the top Training alone was not included because we to indicate the level of risk present. Both inter- previously found that a training-only condition faces describe the nature of the phishing risk and was ineffective (Yang et al., 2017). The warning its explicit consequences in a specific, complete, was presented before the webpage (Before con- but not too lengthy text, to enhance the aware- dition), which is the same as the current phish- ness of phishing. We used simple language to ing warning presentation in Chrome, or after the make the training message accessible to as many webpage (After condition), in which case the users as possible. In addition, the interface warning was presented only when participants includes directive action (i.e., opinionated entered any information on phishing webpages. action; Felt et al., 2015), which is the “Back to For the After condition, based on participants’ Embedded Antiphishing Training 583 responses to the two phishing webpages, the fre- saw the warning at least once for each interface. quency with which they would see the warning Participants’ ages ranged from 18 to over 50 was zero, once, or twice. Only participants who years, with 74% between 18 and 40 years. 91% saw the warning at least once were included were college students or professionals who had to examine the effect of warning interface. a bachelor’s or higher degree. 83% of the par- Because the resulting warning frequencies could ticipants claimed that they do not have a degree reflect participants’ prior and acquired knowl- or work experience in computer science or similar edge, we also examined the three frequencies’ fields. The demographic distributions between results of the After condition and their interac- conditions were similar. Each participant was tion with the warning interfaces. compensated $0.50 for Phases 1 and 2. A total of We conducted a three-phase experiment to 639 participants (432 of whom saw warnings evaluate the short- and long-term effectiveness of initially) returned for Phase 3. Return rates and the two training-embedded warning interfaces demographics were similar across conditions, against the control condition (Chrome). In Phase and participants who finished Phase 3 received 1, participants made login decisions on legitimate an extra $0.25. and fraudulent webpages and got warning/train- This experiment complied with the American ing on phishing webpages. After a distraction Psychological Association Code of Ethics and task, participants judged webpages’ legitimacy was approved by the institutional review board without warnings to evaluate the short-term effect at Purdue University. Informed consent was of the embedded training in Phase 2. One week obtained from each participant. The experiment later, we invited each participant to return for data that were stored and analyzed are anony- Phase 3 to evaluate the legitimacy of extra web- mized. pages, again without warnings, to examine the Apparatus and stimuli. The study was per- long-term effect of the embedded training. formed with participants’ own laptop or com- We predicted that the two training-embedded puter. To ensure the training content’s readability, warning interfaces would yield better phishing- we did not allow participants to continue the detection performance than the control condi- study if they were using any mobile device. We tion, particularly in Phases 2 and 3 when there limited data collection to participants from the was no warning and participants could use the United States because the websites used in the knowledge learned from the training to identify study are popular in this country. phishing webpages. We expected this effect of The details of phishing and legitimate web- training to be more evident for the Procedural pages for each phase are listed in Tables 1 and 2, interface than for the Declarative interface respectively. Each webpage was an exact replica because of the former’s stepwise depiction of of the original website except the URL of each using domain names to identify the webpage’s phishing webpage, which was a valid phishing legitimacy. Finally, presenting the warning as an URL listed in PhishTank. We included SSL for action effect rather than as a block to the web- legitimate webpages as in the real world. This page may be more effective for learning, which difference between legitimate and fraudulent would yield better performance when warnings sites was constant across conditions and phases were absent. and should have no differential impact on the comparison among the three warning interfaces. Method The six most-targeted phishing industries (see Participants. We recruited 1,080 participants Table 1), such as bank and e-commerce, were (63% female) through Amazon Mechanical selected. For each phase, phishing trials came Turk (MTurk) in July and August 2016. In the from only two categories and were selected from Before condition, 120 participants received each the most popular websites within each category. of the three interfaces. For the After condition, To evaluate retention of the embedded- because some participants would not see the training, the same two spoof methods were used warning based on the actions they selected, we across phases, which also made the difficulty of doubled the number of participants to 240 for identifying phishing webpages equal. One is the each interface. Approximately 120 participants similar method, in which fraudulent URLs are encoding=UTF8&openid.assoc_handle=usflex&… SAPI.dll?SignIn&ru=http% 3A%2F%2Fwww.ebay. com%2F &passive=true&rm=false&continue… lang=en-US&.done=https%3a//mail.yahoo.com Original URL https://www.amazon.com/ap/signin?_ https://signin.ebay.com/ws/eBayI https://www.bankofamerica.com/sitemap/hub/signin.go https://chaseonline.chase.com/ https://www.wellsfargo.com/ https://www.facebook.com/ https://twitter.com/login?lang=en https://accounts.google.com/ServiceLogin?service=mail https://outlook.office.com/owa/#authRedirect=true https://login.yahoo.com/?.src=ym&.intl=us&. https://appleid.apple.com/#!&page=signin https://www.irs.gov/refunds Ed… OnlineChase/ microsoftexchange1… yinput... html… Spoofed URL http://www.achyro89.com/google/business/google/ http://irs.gov.irs-qus.com http://umpapa.It/account999865… http://www.tulsicomputers.com/system/logs/ http://plaskit.fr/ibraries/wellsfargo/wellsfargo/… http://twiller.org http://365-outlook.com-useronlineereset72. http://www.assomabauru.org.br/Yahoo/Yahoo-2014/ http://www.amawazon.com http://www.arfcorretora.com.br/BofA/signon.php… http://info-setings.usite.pro/facebook-support. http://www.steaksmore.com/files/apple... Fargo America Website Gmail IRS eBay Chase Wells Microsoft Yahoo Amazon Bank of Facebook Apple storage Category E-mail Government E-commerce Bank Social media Cloud URLs (Spoofed, Original) of Phishing Webpages for Each Phase, Category, and Website for Each Phase, Category, a b le 1: URLs (Spoofed, Original) of Phishing Webpages T Phase Phase 1 Phase 2 Phase 3

584 Embedded Antiphishing Training 585

Table 2: URLs of Legitimate Webpages for Each Phase and Website

Experiment Website URL

Phase 1 BestBuy https://www-ssl.bestbuy.com/identity/signin?token=tid%3A792f2c17- 7d57-11e6-a4b4-005056920f07 Economist http://www.economist.com/ Expedia https://www.expedia.com/user/signin?ckoflag=0 Glassdoor https://www.glassdoor.com/profile/login_input.htm Pinterest https://www.pinterest.com/login/ Alamo https://www.alamo.com/en_US/car-rental/reservation/ startReservation.html Dropbox https://www.dropbox.com/login Walmart https://www.walmart.com/account/login?tid=0&returnUrl=%2F Phase 2 TripAdvisor https://rentals.tripadvisor.com/login LinkedIn https://www.linkedin.com/uas/login Skype https://login.skype.com/login?message=signin_continue Budget http://www.budget.com/budgetWeb/home/home.ex Southwest https://www.southwest.com/flight/login Macy’s https://m.macys.com/account/signin Phase 3 Ibis https://www.ibis.com/gb/northamerica/index.shtiml Uber https://login.uber.com/login Comcast https://login.comcast.net/login?r=comcast.net&%=oauth... Fitbit https://www.fitbit.com/login Priceline https://www.priceline.com/dashboard/#/login Hilton https://secure3.hilton.com/en/hh/customer/login/index.htm visually similar to the legitimate URLs. The other Procedure. Participants were allowed to par- is the complex method, in which fraudulent URLs ticipate in only one of the six conditions. Each expand the length of the legitimate ones to make study started with a questionnaire about partici- interpretation of the URL difficult (Lin et al., pants’ daily online browsing experience, such as 2011). Also, we used different webpages across browsing time every day, online time distribution phases to test the transfer effect of the training. of different activities, etc. The questionnaire did Three sets of 10 different webpages were not mention phishing, or any other cybersecurity used as stimuli. Phase 1 included eight legiti- concern. mate trials from Table 2 and two phishing trials After the questionnaire, Phase 1 started, from the bank and e-commerce categories in which was designed based on Dhamija et al.’s Table 1, respectively. Phishing trials of Phase 2 (2006) study of users’ ability to identify phishing were selected from social media and e-mail websites. Participants were told to imagine that categories. The eight legitimate trials included they had an account with one website (e.g., the six listed in Table 2 and legitimate versions Chase), and they just received an e-mail from of the two phishing pages of Phase 1. For Phase the website asking them to click on one link 3, URLs for the two phishing trials are listed in within the e-mail. Then, supposing they clicked Table 1, and the legitimate trials included the six on the link and were directed to a webpage, par- listed in Table 2 and legitimate versions of the ticipants were asked to choose their immediate two phishing trials in Phase 2. In each phase, all action on the webpage. Participants received 10 trials were presented randomly, and the possible different login webpages (8 legitimate, 2 fraudu- combinations of phishing trials were presented lent), making binary decisions for each webpage in approximately equal number. (i.e., Enter e-mail address and password or 586 June 2019 - Human Factors

Leave or close the webpage). For each decision, Results we also asked participants how confident they Over 68% of participants reported that they were in their decision on a scale of 1 to 5 (1 = not spent more than 2 hr online every day. They indi- confident at all; 5 = very confident). Warning cated spending 22% of the time on social media, was presented to help participants make an 20% on work or study, 15% on e-mail, 14% on informed decision on phishing trials. We mea- a search engine, and 10% on online shopping. sured the viewing time of webpage/warning pre- These results and others from the initial ques- sentation and the corresponding decision. tionnaire were similar across conditions. After completing Phase 1, participants per- We measured the selected decision, confidence formed 24 trials of a Stroop color-identification rating, and webpage/warning viewing time of task (MacLeod, 1991) as a cognitively demanding each participant for each webpage. In Phase 1, distraction, in which they responded with a left or decisions were coded as accurate when partici- right keypress to the color (red or green) of a con- pants responded “Enter e-mail address and pass- gruent or incongruent color word (red or green). word” on legitimate trials and “Back to safety” on The distraction task took about 3 min. Then, in warning interfaces for phishing trials. Choices of Phase 2, participants made legitimacy judgments “Leave or close the webpage” on legitimate web- (i.e., Legitimate or Phishing) for 10 different login pages and “Visit this unsafe site” on warning webpages without warning. We changed the task interfaces for phishing webpages were coded as from webpage login decisions to legitimacy judg- inaccurate. For the After condition in Phase 1, ments for two reasons: (a) Our primary interest warnings were presented when participants was to evaluate whether participants had learned to selected “Enter e-mail address and password” for discriminate phishing webpages from the embed- phishing trials, and decisions were measured ded training; (b) Participants should be aware that based on their final decisions. That is, if a partici- this study was about phishing after Phase 1, and pant chose to enter the ID and password on a previous studies showed that informed participants phishing webpage but corrected the decision later were significantly better at discriminating between on the warning, we counted it as a correct deci- phishing and genuine e-mails than uninformed sion. For legitimacy decisions in Phases 2 and 3, participants (Parsons, McCormac, Pattinson, Buta- choices were coded as accurate when participants vicius, & Jerram, 2015). We also measured partici- selected “Legitimate” for legitimate trials and pants’ confidence rating for each decision and “Phishing” for phishing trials. viewing time of each webpage. For each phase, the number of correct deci- After the judgment task, participants com- sions for phishing trials and legitimate trials was pleted a questionnaire that asked for demo- determined for each participant and grouped as a graphic information (e.g., age, gender, educa- function of warning presentation (Before, After) tion, computer science related work experience). × warning interface (Chrome, Declarative, Pro- Additionally, the questionnaire asked partici- cedural). We used signal detection theory meth- pants to select a potential outcome of phishing ods that allow assessment of sensitivity (d′) to from a list of four options, to check their com- phishing and response bias (c) (e.g., Canfield, prehension of the warning. Participants also esti- Fischhoff, & Davis, 2016; Xiong et al., 2017) mated their possibility of falling for a phishing based on correct responses to phishing trials attack before and after the study on a 5-point (hits) and incorrect responses to legitimate trials scale (1 definitely will not be phished; 5 will fall (false alarms). To accommodate hit rates and for phishing attack for sure). false-alarm rates of 0 or 1, a log-linear correction Phase 3 was conducted a week after Phases 1 added 0.5 to the number of hits and 0.5 to the and 2. Each participant received an e-mail mes- number of false alarms and 1 to the number of sage inviting him/her to evaluate another 10 signals (phishing webpages) or noise (legitimate webpages’ legitimacy as in Phase 2. After com- webpages; Canfield et al., 2016; Hautus, 1995). pleting their legitimacy decisions for those The d′ values of log-linear corrected data under- webpages, participants were tested by choosing estimate the true d′ values (Hautus, 1995), but the legitimate URL from among another five differences across the warning conditions should spoofed phishing URLs. reflect differences apparent in the raw accuracy Embedded Antiphishing Training 587 data (see Table 3). The d′ and c measures were second phishing trial), and some saw it twice. submitted to analysis of variance (ANOVA) with For participants who did not see the warning, warning presentation × warning interface, as was their decision accuracy was 100% for phishing viewing times. Participants’ confidence ratings trials. We conducted ANOVAs on d′ and c across were generally high and did not vary much across the three frequencies (zero, once, twice) and the conditions, so we do not report the statistical test three warning interfaces. We did the same analy- results in the text but list mean values of each sis for average viewing times. condition in Table 4. Participants’ sensitivities were similar across

Phase 1: Effect of warning interface. Table the three warning frequencies (d′zero = 1.18, d′once 2 3 includes correct decision rates of phishing and = 1.14, d′twice = 1.30), F(2,711) = 1.64, p = .194, hp legitimate trials of each condition collapsed = 005, and did not differ across the interfaces, 2 cross participants, as well as means of signal- F(4,711) = 1.18, p = .315, hp = .007. Participants detection parameters for each condition. Table 4 who did not see the warning showed similar sen- provides the means of webpage viewing time sitivity as those who saw the warning, indicating and confidence rating. their awareness and knowledge of phishing Signal-detection parameters. The three inter- scams without any aid. Bias toward judging faces showed similar sensitivity (d′chrome = 1.22, webpages as fraudulent differed across frequen- d′declarative = 1.24, d′procedural = 1.24) and bias toward cies (czero = −0.38, conce = −0.35, ctwice = −0.17), 2 judging webpages as fraudulent (cchrome = −0.28, F(2,711) = 13.57, p < .001, hp = .037. Post-hoc cdeclarative = −0.28, cprocedural = −0.24), Fs < 1.02. comparisons showed that participants who saw Whether the warning was presented Before (d′ = the warning twice had less bias than those who 1.27, c = −0.25) or After (d′ = 1.20, c = −0.28) the saw it once and those who did not see the warn- webpages’ presentation showed no influence on ing (ps < .001), which did not differ (p = .715). participants’ sensitivity or bias, Fs < 1.21. When The bias was similar across interfaces and the making login decisions with the aid of warnings, difference across frequencies was similar among participants demonstrated moderate detection the three interfaces, Fs < 1.02. ability, along with a bias toward judging webpages Viewing times differed across the three warn- 2 as fraudulent. The signal-detection parameters ing frequencies, F(2, 711) = 48.29, p < .001, hp = revealed that the two training-embedded inter- .120. Post-hoc pairwise comparisons were all sig- faces were comparable to the Chrome warning. nificant, ps < .027, being longest for participants Viewing times. The viewing-time measures for who saw the warning once (15.5 s), intermediate phishing trials differed across the three interfaces for those who saw the warnings twice (11.4 s), 2 (see Table 4), F(2,708) = 34.61, p < .001, hp = .089. shortest for those who did not see the warning Post-hoc Bonferroni analysis showed that all (9.5 s). There was an interaction of frequency × 2 pairwise tests were significant (ps < .001). View- warning interface, F(4, 711) = 9.50, p < .001, hp = ing time was longest with the Procedural inter- .051. Participants who did not see the warning face (14.8 s), intermediate with the Declarative spent similar time across the three interfaces, but interface (12.0 s), and shortest with the Chrome participants who saw the warnings spent longer interface (9.5 s). The longer viewing times for the time on the two training-embedded interfaces. new interfaces imply that participants processed Participants who saw the warnings twice the extra embedded-training messages. Viewing spent less time and showed less bias to judge the time was longer for the After condition (13.9 s) legitimate trials as phishing than participants than the Before condition (10.3 s), F(1,708) = 48.37, who saw the warning once. This outcome sug- 2 p < .001, hp = .064, but this difference did not gests that participants who saw the warnings vary across the interfaces, F < 1.0. twice might not have processed the content of Warning frequencies in the After condition. the warning as much as participants who saw the In the After condition, participants saw a warn- warning once. ing only when they chose to enter information For legitimate webpages, the viewing times on a phishing webpage. Thus, for each interface, differed across frequencies, F(2, 711) = 3.84, p = 2 some participants never saw the warning, some .022, hp = .011. Pairwise comparisons showed saw the warning once (on either the first or that participants who saw the warning once spent c 0.31 0.25 0.19 0.29 0.21 0.31 0.36 0.14 0.20 0.10 0.10 0.13 d ′ 1.28 1.54 1.65 1.27 1.15 1.25 1.37 1.56 1.59 1.69 1.65 1.63 Trials 83.9% 85.8% 85.4% 83.0% 78.1% 84.0% 87.1% 83.2% 84.8% 83.6% 82.2% 83.5% Legitimate Trials 66.9% 76.9% 82.6% 67.9% 68.8% 66.0% 67.1% 83.0% 80.9% 88.4% 87.5% 85.6% Phase 3 (1-week later) Phishing 77 80 72 28 24 25 35 44 47 73 68 66 No. Subjects Returned c 0.53 0.29 0.33 0.42 0.54 0.41 0.37 0.28 0.27 0.19 0.16 0.19 d ′ 1.40 1.65 1.64 1.47 1.31 1.48 1.41 1.58 1.80 1.84 1.94 1.81 Trials Phase 2 91.0% 88.9% 86.7% 88.1% 88.9% 89.9% 88.8% 86.4% 88.9% 87.3% 88.0% 86.9% Legitimate Trials 58.8% 66.3% 67.4% 77.9% 75.4% 56.0% 67.1% 76.3% 82.7% 87.7% 91.7% 86.7% Phishing c –0.27 –0.18 –0.40 –0.26 –0.22 –0.14 –0.19 –0.38 –0.29 –0.35 –0.39 –0.40 d ′ 1.28 1.28 1.06 1.23 1.29 1.47 1.15 1.15 1.21 1.23 1.16 1.14 Trials 64.0% 66.1% 55.3% 63.2% 65.7% 70.5% 64.1% 57.1% 61.9% 60.9% 58.1% 57.2% Legitimate Phase 1 Trials 100% 100% 100% 97.1% 92.3% 97.7% 95.4% 94.6% 95.2% 89.5% 99.4% 96.2% Phishing 52 66 42 38 78 78 No. 120 122 120 120 120 124 Subjects Warning Warning Interface Chrome Chrome Chrome Chrome Declarative Procedural Declarative Procedural Declarative Procedural Declarative Procedural 2 2 1 0 Warning Warning Frequency a b le 3: Mean Decision Results for Each Condition T Warning Presentation Before After (Before, decisions of phishing and legitimate trials, signal-detection parameters ( d ′ c ) by warning presentation of correct Note. Subjects number (No.), percentage for each phase. Declarative, Procedural) and warning interface (Chrome, After), warning frequency,

588 4.1 4.2 4.0 4.3 4.0 4.2 4.0 4.1 4.1 4.4 4.5 4.4 4.1 4.1 4.2 4.2 4.2 4.1 4.0 4.2 4.3 4.2 4.1 4.3 Rating Confidence 14.6 15.2 13.6 12.3 17.8 11.8 11.3 10.7 15.0 11.2 11.7 11.6 14.7 13.1 13.1 14.4 14.2 11.1 12.2 11.9 14.4 14.5 12.5 10.6 Time (s) Viewing Phase 3 (1-week later) 73 68 66 35 44 47 28 24 25 73 68 66 77 80 72 35 44 47 28 24 25 77 80 72 No. Subjects Returned 4.3 4.3 4.2 4.3 4.2 4.3 4.2 4.2 4.3 4.5 4.4 4.5 4.3 4.3 4.4 4.3 4.3 4.4 4.2 4.1 4.2 4.3 4.4 4.5 Rating Confidence Phase 2 9.7 9.6 9.9 9.6 8.4 8.6 8.6 7.7 8.0 7.8 7.7 8.0 8.7 9.8 9.7 8.4 8.6 8.3 8.8 8.8 10.0 10.0 11.6 10.7 Time (s) Viewing 4.2 4.4 4.4 4.4 4.3 4.4 4.3 4.3 4.3 4.7 4.8 4.7 4.3 4.3 4.4 4.6 4.5 4.5 4.4 4.5 4.3 4.7 4.5 4.6 Rating Confidence 9.8 8.9 9.1 8.6 8.0 7.9 9.3 8.3 9.3 8.9 7.9 10.7 11.0 11.1 10.7 10.4 12.0 15.7 18.2 10.0 10.7 14.2 10.1 12.9 Time (s) Viewing Phase 1 66 78 78 52 42 38 78 78 66 42 38 52 122 120 124 122 120 124 120 120 120 120 120 120 No. Subjects Warning Warning Interface Chrome Declarative Procedural Chrome Declarative Procedural Chrome Declarative Procedural Chrome Declarative Procedural Chrome Declarative Procedural Declarative Procedural Chrome Declarative Procedural Chrome Chrome Declarative Procedural 0 1 2 0 2 1 2 2 Warning Warning Frequency After After Before Before Warning Warning Presentation a b le 4: Mean Viewing Time and Confidence Rating of Each Condition Legitimate Trial Type Trial After), warning frequency, (Before, Note. Subjects number (No.), mean viewing time, and confidence rating by trial type (Phishing, Legitimate), warning presentation for each phase. Declarative, Procedural) and warning interface (Chrome, T Phishing

589 590 June 2019 - Human Factors less time (8.8 s) than participants who did not see regardless of interfaces or when the warning was the warning (10.5 s), p = .037, suggesting partici- presented, Fs < 1.0. However, the 2-way interac- pants who did not see the warning may develop tion of interface × Before/After warning presen- the habit of checking a webpage’s legitimacy. tation was significant, F(2,708) = 4.14, p = .016, 2 hp = .012. Viewing times did not differ in the Phase 2: Short-term effect of embedded After condition across Chrome, Declarative, and training. Participants judged the legitimacy of Procedural conditions (9.3 s, 9.0 s, 8.5 s), F < another 10 webpages without warnings being 1.0, but increased in the Before condition (8.0 s, presented for the two phishing webpages. See 2 8.7 s, 10.0 s), F(2,357) = 3.55, p = .030, hp = .010. results in Tables 3 and 4. Warning frequencies in the After condition. Signal-detection parameters. Participants’ Participants’ sensitivity differed across the three sensitivity to phishing webpages differed across frequencies (d′zero = 1.86, d′once = 1.61, d′twice = 2 2 the three interfaces, F(2,708) = 3.87, p = .021, hp = 1.42), F(2,711) = 12.19, p < .001, hp = .033. Post- .011. Post-hoc comparisons indicated that the hoc analysis revealed that participants who did sensitivity for the Procedural condition (d′ = not see warnings showed greater sensitivity than 1.67) was larger than that of the Chrome condi- those who saw the warning once (p = .003) and tion (d′ = 1.42), p = .016, but not significantly twice (p < .001), which did not differ, p < .209. different from that of the Declarative condition The bias to judge webpages as legitimate dif- (d′ = 1.57), p = .526. The difference between the fered across frequencies as well, F(2,711) = 23.19, 2 Chrome and the Declarative conditions was not p < .001, hp = .061. There were differences significant, p = .210. Whether the warning was between each pair, ps < .002, with bias being presented Before or After the webpages’ presen- smallest for participants who never saw a warn- tation did not influence participants’ sensitivity ing (c = 0.18), intermediate for those who saw

(d′before = 1.57, d′after = 1.54), F < 1.0. the warning once (c = 0.30), and largest for those Positive c values indicate that participants had who saw it twice (c = 0.45). Warning interface a bias to identify webpages as safe when there did not show a main effect nor interact with was no warning present. Although participants warning frequency for both measures, Fs < 1.35. showed similar bias regardless of whether the After the distraction task, the average view- warning was presented Before or After phishing ing time for phishing webpages differed across webpages (cbefore = 0.38, cafter = 0.36), F < 1.0, the the three frequencies, F(2,711) = 5.14, p = .006, 2 bias was smaller for the two training-embedded hp = .014. Post-hoc comparisons showed that conditions (cdeclarative = 0.33, cprocedural = 0.32) than participants who saw the warning once in Phase for the Chrome condition (cchrome = 0.46), F(2,708) 1 spent longer viewing time (10.3 s) than par- 2 = 7.43, p = .001, hp = 021. Moreover, the benefit ticipants who did not see it (7.8 s), p = .005, but for the two training-embedded interfaces was neither differed significantly from that for par- evident in the Before condition (cchrome = 0.53, ticipants who saw the warning twice (9.2 s; ps > cdeclarative = 0.29, cprocedural = 0.33) but not the After .462). Viewing times for legitimate webpages condition (cchrome = 0.39, cdeclarative = 0.37, cprocedural approached significance across frequencies, 2 2 = 0.31), F(2,708) = 3.85, p = .022, hp = .011. F(2,711) = 2.81, p = .061, hp = .008 (zero = 9.8 s; Larger sensitivity but less bias provides evi- once = 9.3 s, twice = 8.3 s). dence for the short-term effect of the two train- Post-session questionnaire. For their esti- ing-embedded warnings. The benefit for the mates of falling for a phishing attack, partici- Procedural interface was suggested by the best pants’ ratings of 1 (definitely not falling for sensitivity across the three interfaces. phishing) increased by 12.4% after the first two Viewing times. Participants spent a similar phases, mainly due to a change of the ratings of amount of time on phishing webpages regard- 2 and 3 to rating of 1. The increases of not-fall- less of which warning interface was presented ing-for-phishing were similar for the Before and initially or when the warning was presented (see After conditions (12% vs. 13%), across the three Table 4), Fs < 2.78. For legitimate webpages, interfaces (Chrome vs. Declarative vs. Proce- participants spent a similar amount of time dural: 10% vs. 14% vs. 13%), and, in the After Embedded Antiphishing Training 591 condition, between the two warning frequencies the warning once (d′ = 1.52) was similar to those (once vs. twice: 14% vs. 12%). With regard to a who did not see warnings in Phase 1 (d′ = 1.66), potential outcome of a phishing attack, 84% of p = .337, indicating an effect of embedded train- participants chose the correct answer (i.e., ing. However, participants who saw the warning Someone may steal your number and twice continued to show less sensitivity discrim- make bad charges). The results were similar inating noise and signal (d′ = 1.23) than partici- regardless of when the warning was presented pants in the other two conditions, ps < .055. (Before vs. After: 84% vs. 84%), the interface After one week, participants’ bias to judge web- seen (Chrome vs. Declarative vs. Procedural: pages as safe differed across frequencies as well, 2 82% vs. 86% vs. 84%), and the warning fre- F(2,401) = 7.37, p < .001, hp = .034. Participants quencies for the After condition (once vs. twice: who did not see warnings still showed less bias 81% vs. 86%). (c = 0.11) than those who saw warnings once (c = 0.22, p = .014) and twice (c = 0.27, p = Phase 3: Long-term effect of embedded .002), which did not differ (p = .638). The effect training. The results of Phase 3 (see Tables of frequency did not interact with interface for 3 and 4) were analyzed similarly as Phase 2. both measures, Fs < 1.32. Only 59% of people from Phase 1 participated. Participants spent similar time on phishing We compared the results of Phases 1 and 2 webpages across the three frequencies F(2,401) = 2 for participants who returned with those who 1.93, p = .165, hp = .005. For legitimate web- did not, and the pattern of results was similar. pages, viewing time did not show any difference Signal-detection parameters. After one across frequencies either, Fs < 2.49. week, participants’ detectability still differed Post-experiment questionnaire. About 43% of across interfaces (d′chrome = 1.30, d′declarative = the participants chose the correct answer www. 1.49, d′procedural = 1.56), F(2,426) = 3.03, p = .049, pages.ebay.com/community/index.html among 2 hp = .015. Post-hoc analysis revealed that only the other five spoofed phishing URLs. The ­correct the difference between the Chrome and the decision rates were similar for the Before (43%) Procedural conditions was significant (p = and After (36%) conditions. Correct decision rate .048), but not the other two pairs (ps > .215), tended to be larger for the Procedural condition suggesting a benefit of the Procedural inter- (46%) than for the Declarative (32%) or Chrome 2 face. Participants’ bias to identify webpages as (37%) condition, χ (2) = 5.29, p = .070. For the safe also differed across interfaces, F(2,426) = After condition, correct decision rates were the 2 3.41, p = .034, hp = .016. Bias of the Chrome same 36% for the two warning frequencies. condition (c = 0.32) was larger than that of The top two incorrect answers were www the Declarative (c = 0.21, p = .054) and the .ebay.com.ebay-billing.us/login (43%) and www. ­Procedural (c = 0.22, p = .068) conditions, goecities.com/www.paypal.com (10%), both of which did not differ (p = .997). Whether the which used a spoof method that is different from warning was presented before or after the those used in the current study. For the most- ­webpages showed no impact on participants’ selected wrong answer, the selection rates were sensitivity (d′before = 1.49, d′after = 1.41) or bias similar for the Before and After conditions (46% (cbefore = 0.25, cafter = 0.24), Fs < 1.0. It did not vs. 42%). Across the three interfaces, the Proce- interact with interface either, Fs < 1.10. dural condition showed the smallest error rate Viewing times. One week after training, for (36%), followed by Chrome (45%) and the both phishing and legitimate trials, viewing Declarative (50%) conditions. For the After con- times were similar regardless of interfaces or dition, error rates were 42% for both warning fre- when the warning was presented, Fs < 1.84. quencies. Only 4% of participants chose one of Warning frequencies in the After condition. the spoofed phishing URLs that were similar to The main effect of frequency was also signifi- the methods used in training (www.account-veri- 2 cant after one week, F(2,401) = 6.87, p = .001, hp fyication.com/ebay/verify, www.147.46.236.66/ = .033. The sensitivity of participants who saw paypal/login.html, or www.paypa1.com). 592 June 2019 - Human Factors

Discussion different from those used in the current study. We proposed two training-embedded warn- This result suggests that the transfer benefit evi- ing interfaces and evaluated their effects in help- dent in Phases 2 and 3 may be restricted to the ing users to identify phishing webpages. The specific spoof methods, also termed near-trans- signal-detection analyses of Phase 1 showed fer effects (Perkins & Salomon, 1992). There- that the two new interfaces were comparable to fore, in practice, different spoof methods need to the current Chrome warning. When no warn- be implemented in training-embedded warnings ings were displayed in Phases 2 and 3, larger to give users more varied training opportunities. sensitivity and smaller bias were obtained for the training-embedded interfaces, most clearly Action Effect the Procedural interface. Together with the lon- When warnings were presented as a conse- ger viewing time of the two warning-embedded quence of users’ action selection, the effect of the interfaces at Phase 1, these results suggest that embedded-training was evident for participants participants processed the training messages. who saw the warning once but not those who Participants improved their ability to identify saw it twice. In the After condition, participants fraudulent webpages from viewing the training- who were knowledgeable about phishing scams embedded warning interfaces in both short-term never saw the warning and showed better per- and long-term (see also Kumaraguru et al., 2007, formance than participants who saw the warning 2009), indicating that participants can retain and once or twice in Phase 1. However, one week transfer the knowledge gained from the embed- later, participants who saw the warning once ded training despite a limited opportunity for showed a similar detectability of phishing web- training. Furthermore, participants’ improved pages as the knowledgeable participants, without performance was obtained without the expense spending longer time. Also, their performance of extra time to identify the phishing webpages, was better than that of the participants who saw suggesting that it is efficient to use the domain the warning twice. Therefore, participants who name to identify phishing webpages. saw the warning once acquired knowledge from the training message, whereas participants who Procedural vs. Declarative Interfaces saw the warning twice did not learn much. By using different website categories in Compared with the After condition, a benefit of evaluating the short- and long-term effects of the Before condition is that all users see the warning interfaces, the retention and transfer of embedded training and get a training opportunity. knowledge gained from both embedded-training Also, presenting the warning ahead has been interfaces was evident. Moreover, the Proce- shown to capture users’ attention initially (Wogal- dural interface showed better sensitivity at iden- ter et al., 1987), which should increase the likeli- tifying phishing webpages in both short term hood of users processing the training message. and long term compared with the Chrome inter- Presenting the warning before or after the web- face. However, different from our expectations, page did not have a significant impact on the cor- the Procedural interface showed only numeri- rect decision rate on phishing webpages in gen- cally better sensitivity than the Declarative eral. Thus, even when all participants were interface, which may be due to there being only exposed to two training opportunities in the Before two critical steps in identifying a webpage’s condition, some participants may have learned legitimacy based on the domain name. A benefit from the interfaces and some may have not. Future of the Procedural interface was also implied work is needed to investigate what factors contrib- by the test of identifying phishing URLs one ute to the difference between participants. week later. Participants who saw the Procedural interface tended to select more correct answers Limitations and fewer incorrect answers, indicating better We used an experimental research method, transfer of the declarative knowledge (domain manipulating warning interfaces and when the names) by using the stepwise instructions. warning was presented, to obtain webpage legiti- Over 40% of participants mistakenly selected macy decisions in different phases. We are aware URLs as correct that employed a spoof method of studies using more ecologically valid methods Embedded Antiphishing Training 593

(e.g., Felt et al., 2015), but we decided to present Lablet through North Carolina State University and screenshots to exclude extraneous variables that by the National Science Foundation under Grant No. may have an effect on the outcomes. By doing 1314688. this, we are confident about the internal validity of the obtained results. Because the question of Key Points how far a study’s results can be generalized to the •• Embedding training within phishing warning real world is important, it is essential to evaluate addresses simultaneously two key factors impact- training-embedded effectiveness in more natu- ing users’ decisions of webpage legitimacy, aware- ralistic settings. Another possible confound was ness and action. that the two new interfaces are novel, whereas •• A short and simple embedded training that focuses participants may have experienced the Chrome on how to use the domain names of phishing and interface previously. Although the novelty effect the legitimate webpages to identify a phishing may play a role in Phase 1, better performance webpage can be retained and transferred to other of the two new interfaces was evident without webpages even one week later. warnings afterwards. Finally, our participants •• Using browser warnings as a medium enables were highly educated and young, so generalizing cybersecurity training at scale to reach the general the findings to other user populations needs to be user population. further examined. References Practical Implications Al-Daeef, M. M., Basir, N., & Saudi, M. M. (2017). Security This study extended prior research about awareness training: A review. In Proceedings of the World embedded training and showed a proof-of-con- Congress on Engineering (pp. 446–451). London, U.K. Anandpara, V., Dingman, A., Jakobsson, M., Liu, D., & Roines- cept for training-embedded warning. First, we tad, H. (2007). Phishing IQ tests measure fear, not ability. In showed the effectiveness of including training S. Dietrich & R. Dhamija (Eds.), Financial cryptography and within a phishing warning, which provides a data security. (Lecture Notes in Computer Science, vol. 4886, solution for implementing security training and pp. 362–366). Berlin: Springer. Anderson, J. R. (2013). The architecture of cognition. New York, reaching the general user population at scale. NY: Psychology Press. Second, instead of using lots of intensive training Bravo-Lillo, C., Cranor, L. F., Downs, J., & Komanduri, S. (2011). content, our embedded trainings are simple and Bridging the gap in warnings: A mental short, focusing on the domain name, the most model approach. IEEE Security & Privacy, 9, 18–26. Canfield, C. I., Fischhoff, B., & Davis, A. (2016). Quantifying reliable cue for phishing detection. Such light- phishing susceptibility for detection and behavior decisions. weighted training is easy for implementation and Human Factors, 58, 1158–1172. does not cost users much time and effort. Third, Caputo, D. D., Pfleeger, S. L., Freeman, J. D., & Johnson, M. E. displaying training-embedded warnings before (2014). Going spear phishing: Exploring embedded training and awareness. IEEE Security & Privacy, 12, 28–38. webpage presentation addresses the issue of users Carpenter, S., Zhu, F., & Kolimi, S. (2014). Reducing online failing to attend to the training message. Fourth, identity disclosure using warnings. Applied Ergonomics, 45, both short- and long-term benefits of the Proce- 1337–1342. Chou, N., Ledesma, R., Teraguchi, Y., & Mitchell, J. C. (2004). dural interface suggest an advantage of the com- Client-side defense against web-based . In Pro- patibility between training format and training ceedings of the 11th Annual Network and Distributed System content. Finally, our training-embedded warning Security Symposium. http://crypto.stanford.edu/SpoofGuard/ interfaces validate the idea of combining differ- webspoof.pdf Dhamija, R., & Tygar, J. D. (2005). The battle against phishing: ent strategies to protect people from phishing Dynamic security skins. In Proceedings of the 2005 Sympo- attacks. The hybrid of warning/training suggests sium on Usable Privacy and Security (pp. 77–88). New York, that security researchers and practitioners should NY: ACM. consider combining different strategies to solve Dhamija, R., Tygar, J. D., & Hearst, M. A. (2006). Why phishing works. In Proceedings of the SIGCHI Conference on Human phishing and other cybersecurity issues. Factors in Computing Systems (pp. 581–590). New York, NY: ACM. Acknowledgments Dodge, R. C., Carver, C., & Ferguson, A. J. (2007). Phishing for user security awareness. Computers & Security, 26, 73–80. This research was supported by a National Secu- Downs, J. S., Barbagallo, D., & Acquisti, A. (2015). Predictors rity Agency Grant as part of a Science of Security of risky decisions: Improving judgment and decision making 594 June 2019 - Human Factors

based on evidence from phishing attacks. In E. A. Wilhelms & Laughery, K. R., & Wogalter, M. S. (2006). Designing effective warn- V. F. Reyna (Eds.), Neuroeconomics, judgment, and decision ings. Reviews of Human Factors and Ergonomics, 2, 241-271. making (pp. 239–253). New York, NY: Psychology Press. Lee, Y. S., & Vakoch, D. A. (1996). Transfer and retention of Egelman, S., Cranor, L. F., & Hong, J. (2008). You’ve been warned: implicit and explicit learning. British Journal of Psychology, An empirical study of the effectiveness of web browser phish- 87, 637–651. ing warnings. In Proceedings of the SIGCHI Conference on Lin, E., Greenberg, S., Trotter, E., Ma, D., & Aycock, J. (2011). Human Factors in Computing Systems (pp. 1065–1074). New Does domain highlighting help people identify phishing sites? York., NY: ACM. In Proceedings of the SIGCHI Conference on Human Factors FBI (2018). 2017 Internet report. Retrieved from https://pdf in Computing Systems (pp. 2075–2084). New York, NY: ACM. .ic3.gov/2017_IC3Report.pdf MacLeod, C. M. (1991). Half a century of research on the Stroop effect: Felt, A. P., Ainslie, A., Reeder, R. W., Consolvo, S., Thyagaraja, An integrative review. Psychological Bulletin, 109, 163–203. S., Bettes, A., & Grimes, J. (2015). Improving SSL warnings: Oberauer, K. (2010). Declarative and procedural working memory: Comprehension and adherence. In Proceedings of the 33rd Common principles, common capacity limits? Psychologica Annual ACM Conference on Human Factors in Computing Belgica, 50, 3–4. Systems (pp. 2893–2902). New York, NY: ACM. Orgill, G. L., Romney, G. W., Bailey, M. G., & Orgill, P. M. (2004). Ferguson, A. J. (2005). Fostering e-mail security awareness: The The urgency for effective user privacy-education to counter West Point carronade. Educause Quarterly, 28, 54–57. social engineering attacks on secure computer systems. In Fette, I., Sadeh, N., & Tomasic, A. (2007). Learning to detect Proceedings of the 5th Conference on Information Technology phishing . In Proceedings of the 16th International Con- Education (pp. 177–181). New York, NY: ACM. ference on World Wide Web (pp. 649–656). New York, NY: Parsons, K., McCormac, A., Pattinson, M., Butavicius, M., & Jer- ACM. ram, C. (2015). The design of phishing studies: Challenges for Fu, A. Y., Liu, W. Y., & Deng, X. (2006). Detecting phishing web researchers, Computers & Security, 52, 194–206. pages with visual similarity assessment based on earth mov- Perkins, D. N., & Salomon, G. (1992). Transfer of learning. In er’s distance (EMD). IEEE Transactions on Dependable and T. N. Postlethwaite & T. Husen (Eds.), International encyclo- Secure Computing, 3, 301–311. pedia of education (2nd ed.; pp. 6452–6457). Oxford, England: Hautus, M. J. (1995). Corrections for extreme proportions and their Pergamon Press. biasing effects on estimated values of d′. Behavior Research PhishTank. (2018). Stats. Retrieved from https://www.phishtank Methods, Instruments, & Computers, 27, 46–51. .com/stats.php Healy, A. F., & Bourne, L. E., Jr. (Eds.) (2012). Training cognition: Prakash, P., Kumar, M., Kompella, R. R., & Gupta, M. (2010). Phish- Optimizing efficiency, durability, and generalizability. New net: Predictive blacklisting to detect phishing attacks. In Proceed- York, NY: Psychology Press. ings of INFOCOM, IEEE (pp. 1–5). Piscataway, NJ: IEEE. Herzberg, A., & Gbara, A. (2004). Trustbar: Protecting (even Proctor, R. W., & Chen, J. (2015). The role of human factors/ergo- naive) web users from spoofing and phishing attacks. Cryp- nomics in the science of security: Decision making and action tology ePrint Archive, Report 2004/155. http://eprint.iacr. selection in cyberspace. Human Factors, 57, 721–727. org/2004/155. Roediger III, H. L., Dudai, Y., & Fitzpatrick, S. M. (Eds.) (2007). Hommel, B., Müsseler, J., Aschersleben, G., & Prinz, W. (2001). Science of memory: Concepts. New York, NY: Oxford Univer- The theory of event coding (TEC). A framework for perception sity Press. and action. Behavioral & Brain Sciences, 24, 849–937. Schmidt, R. A., & Bjork, R. A. (1992). New conceptualizations of Hong, J. (2012). The state of phishing attacks. Communications of practice: Common principles in three paradigms suggest new the ACM, 55, 74–81. concepts for training. Psychological Science, 3, 207–217. Jagatic, T. N., Johnson, N. A., Jakobsson, M., & Menczer, F. (2007). Sheng, S., Wardman, B., Warner, G., Cranor, L. F., Hong, J., & Social phishing. Communications of the ACM, 50, 94–100. Zhang, C (2009, July). An empirical analysis of phishing Kelley, C. M., Hong, K. W., Mayhorn, C. B., & Murphy-Hill, E. blacklists. In Proceedings of the 6th Conference on and (2012). Something smells phishy: Exploring definitions, con- Anti-Spam, CEAS’09. Mountain View, CA. sequences, and reactions to phishing. In Proceedings of the Stone, A. (2007). Natural-language processing for intrusion detec- 56th Human Factors and Ergonomics Society Annual Meeting tion. Computer, 40, 103–105. (pp. 2108–2112). Santa Monica, CA: Human Factors and Whittaker, C., Ryner, B., & Nazif, M. (2010). Large-scale auto- Ergonomics Society. matic classification of phishing pages. In Proceedings of the Khonji, M., Iraqi, Y., & Jones, A. (2013). Phishing detection: A Network and Distributed System Security Symposium, NDSS literature survey. IEEE Communications Surveys & Tutorials, 2010, San Diego, CA. 15, 2091–2121. Wogalter, M. S., Godfrey, S. S., Fontenelle, G. A., Desaulniers, D. Kumaraguru, P., Cranshaw, J., Acquisti, R., Cranor, L., Hong, J., R., Rothstein, P. R., & Laughery, K. R. (1987). Effectiveness of Blair, M. A., & Pham, T. (2009). A real-word evaluation of warnings. Human Factors, 29, 599–612. anti-phishing training (Technical report) Pittsburgh, PA: Carn- Wogalter, M. S., & Mayhorn, C. B. (2008). Trusting the internet: egie Mellon University. Cues affecting perceived credibility. International Journal of Kumaraguru, P., Rhee, Y., Acquisti, A., Cranor, L. F., Hong, J., Technology and Human Interaction, 4, 75–93. & Nunge, E. (2007). Protecting people from phishing: The Wu, M., Miller, R. C., & Garfinkel, S. L. (2006). Do security tool- design and evaluation of an embedded training email system. bars actually prevent phishing attacks? In Proceedings of the In Proceedings of the SIGCHI Conference on Human Factors SIGCHI Conference on Human Factors in Computing Systems in Computing Systems (pp. 905–914). New York, NY: ACM. (pp. 601–610). New York, NY: ACM. Kumaraguru, P., Sheng, S., Acquisti, A., Cranor, L. F., & Hong, J. Xiong, A., Proctor, R. W., Yang, W., & Li, N. (2017). Is domain (2010). Teaching Johnny not to fall for phish. ACM Transac- highlighting actually helpful in identifying phishing web tions on Internet Technology, 10 (2), Article 7. pages? Human Factors, 59, 640–660. Embedded Antiphishing Training 595

Yang, W., Xiong, A., Chen, J., Proctor, R. W., & Li, N. (2017). University, West Lafayette, Indiana. He received his Use of phishing training to improve security warning compli- PhD in experimental psychology from the Univer- ance: Evidence from a field experiment. In Proceedings of the Hot Topics in Science of Security: Symposium and Bootcamp sity of Texas at Arlington in 1975. (pp. 52–61). New York, NY: ACM. Zhang, Y., Hong, J. I., & Cranor, L. F. (2007). Cantina: A content- Weining Yang works at Google, Inc. He received his based approach to detecting phishing web sites. In Proceed- PhD in computer science from Purdue University in ings of the 16th International Conference on World Wide Web (pp. 639–648). New York, NY: ACM. August, 2016.

Aiping Xiong is an assistant professor in the College of Ninghui Li is a professor in the Computer Science Information Sciences and Technology at the Pennsyl- Department at Purdue University. He received his vania State University in University Park. She earned PhD in computer science from New York University her MS in industrial engineering in 2014 and PhD in in 2000. Cognitive Psychology in 2017 from Purdue University.

Robert W. Proctor is a distinguished professor in the Date received: June 2, 2017 Department of Psychological Sciences at Purdue Date accepted: October 7, 2018