<<

PhishFarm: A Scalable Framework for Measuring the Effectiveness of Evasion Techniques Against Browser Blacklists

Adam Oest˚, Yeganeh Safaei˚, Adam Doupe´˚, Gail-Joon Ahn˚§, Brad Wardman:, Kevin Tyers: ˚Arizona State University, § Samsung Research, :PayPal, Inc. {aoest, ysafaeis, doupe, gahn}@asu.edu, {bwardman, ktyers}@paypal.com

Abstract—Phishing attacks have reached record volumes in lucrative data, phishers are engaged in a tireless cat-and- recent years. Simultaneously, modern phishing are grow- mouse game with the ecosystem and seek to stay a step ahead ing in sophistication by employing diverse techniques of mitigation efforts to maximize the effectiveness of their to avoid detection by security infrastructure. In this paper, we present PhishFarm: a scalable framework for methodically testing attacks. Although new phishing attack vectors are emerging the resilience of anti-phishing entities and browser blacklists to (e.g. via social media as a distribution channel [5]), malicious attackers’ evasion efforts. We use PhishFarm to deploy 2,380 actors still primarily deploy “classic” phishing websites [2]. live phishing sites (on new, unique, and previously-unseen .com These malicious sites are ultimately accessed by victim users domains) each using one of six different HTTP request filters who are tricked into revealing sensitive information. based on real phishing kits. We reported subsets of these sites to 10 distinct anti-phishing entities and measured both the Today’s major web browsers, both on desktop and mobile occurrence and timeliness of native blacklisting in major web platforms, natively incorporate anti-phishing blacklists and browsers to gauge the effectiveness of protection ultimately display prominent warnings when a user attempts to visit a extended to victim users and organizations. Our experiments known malicious site. Due to their ubiquity, blacklists are revealed shortcomings in current infrastructure, which allows a user’s main and at times only technical line of defense some phishing sites to go unnoticed by the security community while remaining accessible to victims. We found that simple against phishing. Unfortunately, blacklists suffer from a key cloaking techniques representative of real-world attacks— in- weakness: they are inherently reactive [6]. Thus, a mali- cluding those based on geolocation, device type, or JavaScript— cious will generally not be blocked until its nature were effective in reducing the likelihood of blacklisting by over is verified by the blacklist operator. Phishing sites actively 55% on average. We also discovered that blacklisting did not exploit this weakness by leveraging cloaking techniques [7] to function as intended in popular mobile browsers (Chrome, , and ), which left users of these browsers particularly avoid or delay detection by blacklist crawlers. Cloaking has vulnerable to phishing attacks. Following disclosure of our only recently been scrutinized in the context of phishing [8]; findings, anti-phishing entities are now better able to detect and to date, there have been no formal studies of the impact mitigate several cloaking techniques (including those that target of cloaking on blacklisting effectiveness (despite numerous mobile users), and blacklisting has also become more consistent empirical analyses of blacklists in general). This shortcoming between desktop and mobile platforms— but work remains to be done by anti-phishing entities to ensure users are adequately is important to address, as cybercriminals could potentially be protected. Our PhishFarm framework is designed for continuous causing ongoing damage without the ecosystem’s knowledge. monitoring of the ecosystem and can be extended to test future In this paper, we carry out a carefully-controlled experiment state-of-the-art evasion techniques used by malicious websites. to evaluate how 10 different anti-phishing entities respond to reports of phishing sites that employ cloaking techniques I.INTRODUCTION representative of real-world attacks. We measure how this cloaking impacts the effectiveness (i.e. site coverage and Phishing has maintained record-shattering levels of volume timeliness) of native blacklisting across major desktop and in recent years [1] and continues to be a major threat to today’s mobile browsers. We performed preliminary tests in mid-2017, Internet users. In 2018, as many as 113,000 unique monthly disclosed our findings to key entities (including Safe phishing attacks were reported to the APWG [2]. Beyond dam- Browsing, , browser vendors, and the APWG), and aging well-known brands and compromising victims’ identi- conducted a full-scale retest in mid-2018. Uniquely and unlike ties, financials, and accounts, cybercriminals annually inflict prior work, we created our own (innocuous) PayPal-branded millions of dollars of indirect damage due to the necessity phishing websites (with permission) to minimize confounding of an expansive anti-abuse ecosystem which serves to protect effects and allow for an unprecedented degree of control. the targeted companies and consumers [3]. With an ever- Our work reveals several shortcomings within the anti- increasing number of Internet users and services— in particu- phishing ecosystem and underscores the importance of robust, lar on mobile devices [4]— the feasibility of social engineering ever-evolving anti-phishing defenses with good data sharing. on a large scale is also increasing. Given the potential for Through our experiments, we found that cloaking can prevent browser blacklists from adequately protecting users by signif- icantly decreasing the likelihood that a phishing site will be phisher, who will attempt to fraudulently use it for monetary blacklisted, or substantially delaying blacklisting, in particular gain either directly or indirectly [14]. when geolocation- or device-based request filtering techniques Phishing attacks are ever-evolving in response to ecosystem are applied. Moreover, we identified a gaping hole in the standards and may include innovative components that seek protection of top browsers: shockingly, mobile to circumvent existing mitigations. A current trend (mimick- Chrome, Safari, and Firefox failed to show any blacklist ing the wider web) is the adoption of HTTPS by phishing warnings between mid-2017 and late 2018 despite the presence sites, which helps them avoid negative security indicators in of security settings that implied blacklist protection. As a browsers and may give visitors a false sense of security [15], result of our disclosures, users of the aforementioned mobile [16]. At the end of 2017, over 31% of all phishing sites browsers now receive comparable protection to desktop users, reported to the APWG used HTTPS, up from less than 5% and anti-phishing entities now better protect against some of a year prior [2]. Another recent trend is the adoption of the cloaking techniques we tested. We propose a number of redirection , which allows attackers to distribute a link additional improvements which could further strengthen the that differs from the actual phishing landing page. Redirection ecosystem, and we will freely release the PhishFarm - chains commonly consist of multiple hops [17], [18], each work1 to interested researchers and security organizations potentially leveraging a different URL shortening service or for continued testing of anti-phishing systems and potential open redirection vulnerability [19]. Notably, redirection allows adaptation for measuring variables beyond just cloaking. Thus, the number of unique phishing links being distributed to grow the contributions of this work are as follows: well beyond the number of unique phishing sites, and such ‚ A controlled empirical study of the effects of server-side links might thus better slip past spam filters or volume-based request filtering on blacklisting coverage and timeliness phishing detection [20]. Furthermore, redirection through well- in modern desktop and mobile browsers. known services such as bit.ly may better fool victims [5], ‚ A reusable, automated, scalable, and extensible frame- though it also allows the intermediaries to enact mitigations. work for carrying out our experimental design. Ultimately, phishers cleverly attempt to circumvent existing ‚ Identification of actionable real-world limitations in the controls in an effort to maximize the effectiveness of their current anti-phishing ecosystem. attacks. The anti-phishing community should seek to predict ‚ Enhancements to blacklisting infrastructure after disclo- such actions and develop new defenses while ensuring that sure to anti-phishing entities, including phishing protec- existing controls remain resilient. tion in mobile versions of Chrome, Safari, and Firefox.

II.BACKGROUND B. Phishing Kits Phishing is a type of social engineering attack that seeks A phishing kit is a unified collection of tools used to to trick victims into disclosing sensitive information, often via deploy a phishing site on a web server [21]. Kits are generally a fraudulent website which impersonates a real organization. designed to be easy to deploy and configure; when combined Attackers (phishers) then use the stolen data for their own with mass-distribution tools [9], they greatly lower the barrier monetary gain [9], [10], [11]. Cybercriminals are vehement to entry and enable phishing on a large scale [22]. They are in their attacks and the scale of credential theft cannot be part of the cybercrime-as-a-service economy [23] and are often overstated. Between March 2016 and 2017, malware, phishing, customized to facilitate a specific attack [24]. In the wild, they and data breaches led to 1.9 billion usernames and are deployed on compromised infrastructure or infrastructure being offered for sale on black market communities [12]. managed directly by malicious actors. A. Phishing Attacks 1) Cloaking: A recent study found that phishing kits com- An online phishing attack consists of three stages: prepara- monly use server-side directives to filter (i.e. block or turn tion, distribution, and data exfiltration, respectively. First, prior away) unwanted (i.e. non-victim) traffic, such as to involving any victims, an attacker deploys a spoofed version bots, anti-phishing crawlers, researchers, or users in locations of a legitimate website (by copying its look and feel) such that that are incompatible with the phishing kit [8]. Attributes such it is difficult for an average user to discern that it is fake. This as the visitor’s IP address, hostname, , or referring deployment can be done using a phishing kit, as discussed URL are leveraged to implement these filtering techniques. in Section II-B. Second, the attacker sends messages (such Similar approaches, known as cloaking, have historically as spam e-mails) to the user (leveraging social engineering to been used by malicious actors to influence search engine rank- insist that action is needed [13]) and lures the user to click on ings by displaying different web content to bots than human a link to the phishing site. If the victim is successfully fooled, visitors [7]. Users follow a misleading search result and are he or she then visits the site and submits sensitive information thus tricked into visiting a site which could contain malware such as account credentials or credit card numbers. Finally, or adware. Relatively little research has focused on cloaking the phishing site transmits the victim’s information back to the outside of search engines; our experiment in Section III is the first controlled study, to the best of our knowledge, that 1Framework details are available at ://phishfarm-project.com measures the impact of cloaking within phishing attacks. TABLE I: Overview of market share and blacklist providers against phishing and malware sites, as they are enabled by of major web browsers (* denotes browsers we tested). default in modern web browsers and thus automatically protect Phishing Est. Market Share even unaware users. In the absence of third-party security Browser Name Blacklist (Worldwide) [28], [4] Provider 7/2017 7/2018 software, once a phishing message reaches a potential victim, Desktop Web Browsers browser blacklists are the only technical control that stands * Google 63.48% 67.60% between the victim and the display of the phishing content. Mozilla Firefox* Safe Browsing 13.82% 11.23% Safari* (GSB) 5.04% 5.01% Studies have shown that blacklists are highly effective at (IE)* Microsoft 9.03% 6.97% stopping a phishing attack whenever a warning is shown [6]. * SmartScreen 3.95% 4.19% * Opera 2.25% 2.48% Such warnings are prominent, but typically only appear after Others Varies 2.43% 2.52% the blacklist operator’s web crawling infrastructure verifies the Mobile Web Browsers Google Chrome* GSB 50.07% 55.98% attack; some are also based on proactive heuristics [6]. Safari* GSB 17.19% 17.70% Today’s major web browsers are protected by one of three UC Browser None 15.55% 12.49% different blacklist providers, as shown in Table I. While None 6.54% 5.12% Opera* (non-mini) Opera 5.58% 4.26% (GSB) and SmartScreen are well- Android Browser None 3.41% 1.95% documented standalone blacklists, Opera does not publicly Mozilla Firefox* GSB 0.06% 0.31% Others Varies 1.60% 2.19% disclose the sources for its blacklist. Prior work by NSS Labs suggests that PhishTank and Netcraft are among Opera’s C. Anti-phishing Ecosystem current third-party partners [38]; this is supported by our Even though phishing attacks only directly involve the experimental results. We know that blacklists are effective attacker, the victim, and the impersonated organization, a large when they function as intended [39], but is this always the amount of collateral damage is caused due to the abuse of case? Do blacklists offer adequate protection against phishers’ a plethora of independent systems which ultimately facilitate evasion efforts such as cloaking, or are phishers launching the attack [25]. Moreover, credential re-use, fueled by the attacks with impunity? We focus on answering these questions sale of credentials in underground economies [9], [26], causes in the rest of this paper. phishing threats aimed at one organization to potentially affect III.EXPERIMENTAL METHODOLOGY others. Over time, an anti-phishing ecosystem has matured; it consists of commercial security vendors, consumer antivirus Our primary research goal is to measure how cloaking af- organizations, web hosts, domain registrars, e-mail and mes- fects the occurrence and timeliness of blacklisting of phishing saging platforms, and dedicated anti-phishing entities [27], sites within browsers. On a technical level, cloaking relies on [8]. These entities consist of enterprise firms that operate on filtering logic that restricts access to phishing content based behalf of victim organizations, clearinghouses that aggregate on metadata from an HTTP request. Filtering is widely used and share abuse data, and the blacklists which directly protect in phishing kits [8]; attackers aim to maximize their return on web browsers and other software. investment by only showing the phishing content to victims rather than anti-abuse infrastructure. If a phishing kit suspects D. Detecting Phishing that the visitor is not a potential victim, a 404 “not found”, The distribution phase of phishing attacks is inevitably 403 “forbidden”, or 30x redirection response code [40] may be noisy, as links to each phishing site are generally sent to a returned by the server in lieu of the phishing page. Attackers large set of potential victims. Thus, the thousands of phishing could also choose to display benign content instead. attacks reported each month easily translate into messages with Prior studies of browser blacklists have involved observation millions of recipients [29]. Detection can also occur during the or honeypotting of live phishing attacks [25], [6], [39], but preparation and exfiltration stages. As artifact trails propagate these tests did not consider cloaking. It is also difficult to through the security community, they can be used to detect and definitively identify cloaking techniques simply by observing respond to attacks. Detection methods include classification live sites. Our experimental design addresses this limitation. of e-mails [30], [31], analyzing underground tools and drop A. Overview zones [10], identifying malicious hostnames through passive DNS [32], URL and content classification [33], [20], [34], At a high level, we deploy our own (sterilized) phishing [35], malware scanning by web hosts [36], monitoring domain websites on a large scale, report their to anti-phishing registrations [26] and certificate issuance [37], and receiving entities, and make direct observations of blacklisting times direct reports. All of these potential detection methods can across major web browsers. We divide the phishing sites into result in reports which may be forwarded to anti-phishing multiple batches, each of which targets a different entity. entities that power blacklists [6]. We further sub-divide these batches into smaller groups with different types of cloaking. Once the sites are live, we report E. Browser Blacklists their URLs to the anti-phishing entity being tested, such that The final piece of the puzzle in defending against phishing is the entity sees a sufficient sample of each cloaking technique. the conversion of intelligence about attacks into the protection We then monitor the entity’s response, with our primary metric of users. Native browser blacklists are a key line of defense being the time of blacklisting relative to the time we submitted each report. We collect secondary metrics from web traffic logs TABLE II: Entities targeted by our experiments. of each phishing site. Our approach is thus empirical in nature, Entity (Report Location) Report Type URLs however it is distinct from prior work because we fully control Full + Preliminary Tests APWG ([email protected]) E-mail the phishing sites in question, and, therefore, have ground truth Google Safe Browsing ([42]) Web 40 (prelim.) on their deployment times and cloaking techniques. Microsoft SmartScreen ([43]) Web 396 (full) All the phishing sites that we used for all of our experiments PayPal ([email protected]) E-mail per entity PhishTank (phish-{username}@phishtank.com) E-mail spoofed the PayPal.com login page (with permission from Preliminary Tests Only PayPal, Inc.) as it appeared in mid-2017 (we discuss the ESET ([44]) Web Netcraft ([45]) Web 40 hosting approach in Section IV-C2). Our phishing sites each McAfee ([46]) Web per entity used new, unique, and previously-unseen .com domain names, US CERT ([email protected]) E-mail and were hosted across a diverse set of IP addresses spanning WebSense ([email protected]) E-mail 10 Entities Total 2,380 URLs three continents. As part of our effort to reliably measure the time between our report and browser blacklisting for each site, TABLE III: Cloaking techniques used by our phishing sites. we never reported the same phishing site to more than one Cloaking Filter Name HTTP Request Criteria entity, nor did we re-use any domains. To summarize, each A Control Allow all experiment proceeded as follows: Allow if user agent indicates: B Mobile Devices 1) Selecting a specific anti-phishing entity to test. Android or iOS US Desktop Allow if IP country (is/is not) US and user agent indicates: 2) Deploying a large set of new, previously-unseen PayPal C GSB Browsers Chrome, Firefox, or Safari; and Windows, Macintosh, or Linux; and phishing sites with desired cloaking techniques. Non-US Desktop not Opera, IE, or Edge; and D GSB Browsers 3) Reporting the sites to the entity. not iOS or Android 4) Measuring if and when each site becomes blacklisted Block Known Allow if user agent, referrer, hostname, and IP: E across major web browsers. Security Entities not known to belong to a security entity or bot “Real” Web Allow all; content retrieved asynchronously F We split our experiments into two halves: preliminary Browsers during JavaScript onload event testing of 10 anti-phishing entities (mid-2017) and full follow- up testing of five of the best-performing entities (mid-2018). phishing sites are blacklisted, coinciding with the recent uptick The latter test involved a large sample size designed to in mobile users and phishing victims [28], [2]. Filters C support statistically significant inferences. In between the two and D focus specifically on desktop browsers protected by tests, we disclosed our preliminary findings to key entities GSB, which span the majority of desktop users today. We to afford them the opportunity to evaluate any shortcomings also included geolocation, which while not as frequent in the we identified. We discuss our approach in more detail in the wild as other cloaking types [8], is highly effective due to low following sub-sections and present the results in Section V. detectability and characteristics of spearphishing. A secondary 1) Targeted Web Browsers: We strove to maximize to- motivation was to see how well entities other than GSB protect tal market share while selecting the browsers to be tested. this group. Filter E is the most elaborate, but it also directly In addition to traditional desktop platforms, we wanted to emulates typical real-world phishing kits. It is based on top measure the extent of blacklisting on mobile devices— a .htaccess filters from a large dataset of recent phishing kits [8]. market which has grown and evolved tremendously in recent This filter seeks to block anti-phishing entities by hundreds of years [41]. We thus considered all the major desktop and IP addresses, hostnames, referrers, and user agents. Finally, mobile web browsers with native anti-phishing blacklists, as Filter F only displays phishing content if the client browser listed in Table I. Although we also identified a handful of other can execute JavaScript, which may defeat simple script-based web browsers with blacklist protection, such as CM Browser, web crawlers. This filter is common in modern search engine we did not test them due to their low market share [4]. cloaking [7] and further motivated by an ongoing study we 2) Filter Types: We chose a set of request filtering tech- are conducting of JavaScript use in real-world phishing sites. niques based on high-level cloaking strategies found in a recent Although today’s phishing kits tend to favor straightforward study of .htaccess files from phishing kits [8]. Exhaustively filtering techniques (namely approaches such as Filter E or measuring every possible combination of request filters was geolocation), growing sophistication and adoption of cloaking not feasible with a large sample size; we therefore chose a (and other types of evasion) is technically feasible and to be manageable set of filtering strategies which we felt would be expected as a risk to the anti-phishing ecosystem. effective in limiting traffic to broad yet representative groups 3) Tested Entities: In addition to the blacklist operators of potential victims while remaining simple (for criminals) to themselves, major clearinghouses, and PayPal’s internal anti- implement and drawing inspiration from techniques found in phishing system, we wanted to test as many of the other types the wild [7]. Table III summarizes our filter selections. of anti-phishing entities discussed in Section II-C as possible. It would not be responsible of us to disclose the exact condi- We started with a recently-published list of entities commonly tions required of each filter, but we can discuss them at a high targeted for evasion by phishers [8]. We then made selections level. Filter A served as our control group; our expectation was from this list, giving priority to the more common (and thus for every site in this group to be blacklisted at least as quickly potentially more impactful) entities in today’s ecosystem. We as other sites. Filter B sought to study how well mobile-only had to exclude some entities of interest, as not all accept direct external phishing reports and thus do not fit our experimental TABLE IV: URL and filter distribution for each experiment. Phishing URL Content design. Also, we could not be exhaustive as we had to consider (Type [8], [49]) Qty. Filters Used Sample URL the domain registration costs associated with conducting each Full Tests Non-deceptive (random) (V) A (66), B (66), C (66), experiment. Table II summarizes our entity selections. 396 http://www.florence-central.com/logician/retch/ D (66), E (66), F (66) 4) Reporting: When the time came for us to report our Preliminary Tests Non-deceptive (random) (V) A (1), B (2), C (2), phishing sites to each entity being tested, we would submit 10 http://receptorpeachtreesharp.com/cultivable/ D (2), E (2), F (1) the site URLs either via e-mail or the entity’s web interface Brand in Domain (IVa) 6 (if available and preferred by the entity). In all cases, we used http://www.https-official-verifpaypal.com/signin Deceptive Domain (IVb) 6 publicly-available submission channels; we had no special http://services-signin.com/login/services/account/ Brand in Subdomain (IIIa) A (1), B (1), C (1), agreements with the entities, nor did they know they were 6 http://paypal1.com.835anastasiatriable.com/signin D (1), E (1), F (1) being tested. In the case of the web submission channel, we Deceptive Subdomain (IIIb) 6 http://services.account.secure.lopezben.com/signin simply used a browser to submit each phishing site’s exact Deceptive Path (II) 6 URL to the entity. E-mail reports were slightly more involved, http://simpsonclassman.com/paypa1.com/signin as the industry prefers to receive attachments with entire phishing e-mails in lieu of a bare URL. We thus used PayPal- registrar, which is among the most-abused registrars by real branded HTML e-mail templates to create lookalike phishing phishers [1]. Furthermore, to prevent crawlers from landing messages. Each message contained a random customer name on our phishing sites by chance (i.e. by requesting the bare and e-mail address (i.e. of a hypothetical victim) within one of hostname), paths were non-empty across all of our URLs. many body templates. We sent e-mails from unique accounts Using URLs that appear deceptive is a double-edged sword: across five security-branded domains under our control. while it allows us to gain insight into how various URL types Reports of real-world phishing sites might be submitted in are treated by entities, it is also a factor which may skew the exact same manner by agents on behalf of victim brands. blacklisting speed. However, we decided to proceed with this Our reporting approach is thus realistic, though many larger in mind as the purpose of the preliminary tests was to observe victim organizations contract the services of enterprise security more than measure, given the use of a relatively small sample firms, which in turn have private communication channels to size per batch. Table IV shows the distribution of URL types streamline the reporting process. and filters per batch of sites. We registered the required domain names and finalized B. Preliminary Tests configuration of our infrastructure in May 2017. In July, Our preliminary testing took place in mid-2017 and carried we started our experiments by systematically deploying and out the full experimental approach from Section III-A on a reporting our phishing sites. For each day over the course small scale. We will now detail the execution of each test. of a 10-day period, we picked one untested entity at random One of our key goals was to minimize confounding effects (from those in Table II), fully reported a single batch of URLs on our experimental results. In other words, we did not want (over the course of several minutes), and started monitoring any factors other than our report submission and cloaking blacklist status across each targeted . Monitoring strategy to influence the blacklisting of our phishing sites. As of each URL continued every 10 minutes for a total of 72 part of this, we wanted to secure a large set of IP addresses, hours after deployment. Over this period and for several days such that no anti-phishing entity would see two of our sites afterward, we also logged web traffic information to each hosted on the same IP address. With the resources available of our phishing sites in an effort to study crawler activity to us, we were able to provision 40 web servers powered related to each entity. Prior empirical tests found blacklists to by Digital Ocean, a large cloud hosting provider [47] (we show their weakness in early hours of a phishing attack [6]. informed Digital Ocean about our research to ensure the Our 72-hour observation window allows us to study blacklist servers would not be taken down for abuse [48]). Each server effectiveness during this critical period while also observing had a unique IP address hosted in one of the host’s data centers slower entities and potentially uncovering new trends. in Los Angeles, New York, Toronto, Frankfurt, Amsterdam, or Singapore. Our batch size was thus 40 phishing sites per C. Responsible Disclosure preliminary test, for a total of 400 sites across the 10 entities Our analysis of the preliminary tests yielded several security being tested. Within each batch, six sites used filter types A recommendations (discussed in Section VI). We immediately and F, while seven used filter types B through E. proceeded to disclose our findings to the directly impacted We also wanted to mimic actual phishing attacks as closely entities (i.e. browser vendors, the brand itself, and major as possible. We studied the classification of phishing URL blacklist operators) which we also intended to re-test. We held types submitted to the Anti-phishing Working Group’s eCrime our first disclosure meeting with PayPal in August 2017. Fol- Exchange in early 2017 and crafted our URLs for each of the lowing PayPal’s legal approval, we also disclosed to Google, 40 phishing sites while following this distribution as closely Microsoft, Apple, Mozilla, and the APWG in February 2018. as possible [8], [49]. We registered only .com domains for Each meeting consisted of a detailed review of the entity’s our URLs, as the .com TLD accounts for the majority of real- performance in our study, positive findings, specific actionable world phishing attacks. In addition, we chose GoDaddy as our findings, and high-level comparisons to other entities. Our disclosures were generally positively received and resulted in close follow-up collaboration with Google, Mozilla, and the APWG; this ultimately resulted in the implementation of effective blacklisting within mobile GSB browsers and general mitigations against certain types of cloaking. We clearly stated that we would repeat our experiments in 4-6 months and thereafter publish our findings. We did not immediately disclose to the blacklists powering Opera as we had not originally expected to have the resources to re-test a fifth entity. Additionally, given lesser short-term urgency with respect to the remaining entities (and in the absence of close relationships with them), at the time we felt it would be more impactful to disclose to them the prelimi- nary findings alongside the full test findings. This approach ultimately allowed us to better guide our recommendations by sharing deeper insight into the vulnerabilities within the broader ecosystem. After the completion of the full tests, we reached out the all remaining entities via e-mail; all but Fig. 1: PhishFarm framework architecture. PhishTank and Opera responded and acknowledged receipt of a textual report containing our findings. US CERT and ESET case of unforeseen technical difficulties. All sites ultimately additionally followed up for clarifications once thereafter. delivered 100% uptime during deployment, thus we ended up with an effective sample size of 396 per experiment. D. Full-scale Tests While the assumption of f=0.25 introduced some risk into We believed that key ecosystem changes resulting from our our experimental design, we accepted it given the high cost of disclosures would still be captured by our original experi- registering a new .com domain for each site. mental design. Thus, we did not alter our retesting approach IV. PHISHFARM TESTBED FRAMEWORK beyond increasing the scale to enable statistically significant In order to execute our experimental design at scale, we de- observations. In addition, rather than using the URL distribu- signed PhishFarm: a comprehensive framework for deploying tion from the preliminary experiments, we solely used non- phishing sites, reporting them to anti-abuse entities, and mea- deceptive paths and hostnames (i.e. with randomly-chosen suring blacklist response. The framework operates as a web English words) in order to remove URL classification as a service and satisfies three important requirements: automation, possible confounding factor in our cloaking evaluation. scalability, and reliability. We eliminated as many manual We registered the required domains in May 2018 and actions as technically feasible to ensure that the difference initiated sequential deployment of our full-scale tests in early between launching hundreds and thousands of sites was only July. We re-tested all entities to which we disclosed. As our a matter of minutes. Actual site deployment can thus happen budget ultimately allowed for an additional entity, we decided instantly and on-demand. Apart from up-front bulk domain to also include PhishTank for its promising performance in registration (IV-A) and reporting phishing through a web , the preliminary test. Each of the resulting five experiment all framework components feature end-to-end automation. batches consisted of 396 phishing sites, evenly split into six The framework consists of five interconnected components groups for each cloaking technique. Our reporting method as shown in Figure 1: bulk domain registration and DNS did not change, though we throttled e-mailing such that the setup, stateless cloud web servers to display the phishing reports were spread over a one-hour period. Reporting through sites, an API and database that manages configuration, a client Google and Microsoft’s web interfaces spanned a slightly that issues commands through the API, and a monitoring longer period of up to two hours due to the unavoidable manual system that regularly checks in which browsers each phishing work involved in solving required CAPTCHA challenges. is blacklisted. In total, we wrote over 11,000 lines of PHP, Java, and Python code for these backend components. We E. Sample Size Selection extensively tested each component— in particular, the hosting For our full tests, we chose a sample size of 384 phishing and monitoring infrastructure— to verify correct operation. sites for each entity. Our goal was to obtain a power of 0.95 at the significance level of 0.05 in a one-way inde- A. Domain and DNS Configuration pendent ANOVA test, which, for each entity, could identify A core component of any phishing site is its URL, which the presence of a statistically significant difference in mean consists of a hostname and path [49]. Our framework includes blacklisting speed between the six cloaking filters. Based on a script to automatically generate hostnames and paths per the Cohen’s recommendation [50] and our preliminary test results, experimental design. We then manually register the required we assumed a medium effect size (f ) of 0.25. We added 12 domain names in bulk and point them to our cloud hosting sites (2 per filter) to each experiment to serve as backups in provider’s nameservers. Finally, we automatically create DNS records such that the domains for each experiment are evenly on each server to avoid confounding detection through external spread across the IP addresses under our control. We set up web requests to these files. In the wild, we have observed wildcard CNAME records [51] such that we could program- that sophisticated phishing kits follow a similar strategy to matically configure the subdomain of each phishing site on reduce detectability, though today’s typical kits merely embed the server side. In total, registration of the 400 preliminary resources from the legitimate website. domains took about 15 minutes, while registration of the 1,980 3) Monitoring Infrastructure: The purpose of the moni- domains for the full tests took 30 minutes; apart from the toring system is to identify, at a fine granularity, how much domain registration itself, no manual intervention is needed. time elapses before a reported phishing site is blacklisted B. Research Client (if at all). To obtain a clear picture of the early hours of blacklist response, we configured the monitoring system to We implemented a cross-platform client application to access each live phishing URL at least once every 10 minutes control the server-side components of the framework and in each target desktop browser (the shortest interval feasible execute our research. This client enables bulk configuration given our resources). We check the blacklist status of each of phishing sites and cloaking techniques, bulk deployment of URL by analyzing a screenshot of the respective browser experiments, automated e-mail reporting, semi-automated web window; this is similar to the approach taken by Sheng et reporting, monitoring of status, and data analysis. al [6]. We chose this approach for its universal applicability C. Server-Side components and lack of dependencies on browser automation libraries, The server-side components in our framework display the which commonly force-disable phishing protection. For our actual phishing sites based on dynamic configuration. They purposes, it sufficed to check if the dominant color in each are also responsible for logging request data for later analysis image [52] was similar to red (used in the warning messages and monitoring blacklist status. of the browsers we considered). Although this image-based 1) Central Database and API: At the heart of our frame- approach proved reliable and can be fully automated, it does work is a central API that serves as an interface to a database not scale well because each browser requires exclusive use of with our phishing site and system state. For each site, the a single system’s screen. database also maintains attributes such as date activated, To satisfy the scalability requirement, our monitoring sys- reported, and disabled; blacklisting status; site and e-mail tem follows a distributed architecture with multiple au- HTML templates; server-side and JavaScript request filtering tonomous nodes that communicate with the API; each node code; and access logs. We interact with this API via the client is a virtual machine (VM) that runs on a smaller set of whenever we define new sites or deploy sites as part of an host systems. Software on each VM points a browser to the experiment. All traffic to and from the API is encrypted. desired URL via command line and sends virtual keystrokes, 2) Hosting Infrastructure: Our hosting infrastructure con- as needed, to close stale tabs and allow for quick page sists of Ubuntu cloud servers which run custom intermediate loads. During our experiments, we used three types of nodes: software on top of Apache. The intermediate software captures Mac OS X 10.12 VMs with Chrome, Safari, Opera, and all requests and enables dynamic cloaking configuration and Firefox; VMs with Edge and Internet Explorer enhanced traffic logging. Each server obtains all of its config- (IE) 11; and .1 VMs with IE 11. In total, the uration from the API, which means that the number of servers full experiments required 31 desktop nodes across four host can flexibly be chosen based on testing requirements (40 in our machines (collectively with 18 Intel Core i7 CPU cores and case). When an HTTP request is received by any server, the 96 Gb of RAM) to deliver the required level of monitoring server checks the request URL against the list of live phishing performance and redundancy. Each node informs the API of sites from the API, processes any cloaking rules, responds with the browsers it supports; it then awaits a command consisting the intended content, and logs the request. In order to quickly of a set of URLs for a target browser, and in real time, reports handle incoming requests, system state is cached locally on the monitoring results back to the API. We freshly installed each server to minimize API queries. the latest stable version of each browser at the time of each We served all our phishing sites over HTTP rather than test and kept default security settings (or, in the case of IE HTTPS, as this was typical among real-world phishing sites and Edge, recommended settings when prompted). at the time we designed the preliminary tests [1]. While our We were unable to obtain such a large number of mobile framework supports HTTPS, we did not wish to alter the devices, and official emulators at the time would force-disable experimental design between tests. Both approaches generate blacklist warnings. Thus, we only tested mobile browsers potential artifacts that might benefit the anti-phishing ecosys- hourly and relied on our observation that their behavior was tem: unencrypted sites allow network-level packet sniffing, tied to the behavior their desktop counterparts. We used a while encrypted sites leave an evidence trail at the time a physical Samsung Galaxy S8 and Google Pixel phone to certificate is issued. We mitigated the former risk to a large test mobile Chrome, Firefox, and Opera; and an iPhone 7 extent through the design of our monitoring system. (preliminary) or 8 (full) to test mobile Safari. A future version Our framework supports arbitrary, brand-agnostic web page of the framework could be improved to leverage Android VMs, content to be displayed for each phishing site. For our exper- in lieu of physical or emulated devices, to perform monitoring iments, we hosted all resources (e.g. images and CSS) locally similar to that of desktop browsers. TABLE V: Overview of crawler and blacklisting activity across all experiments.

Phishing Sites Crawler(s) Attempted Successful Crawler Crawled Sites Mean Time Before Mean Page Loads Deployed Retrieval Retrievals Blacklisted 1st Blacklisting per Site w/ Cloaking w/o w/ w/o w/ w/o w/ w/o w/ w/o w/ w/o 294 51 156 49 124 42 173 79 Prelim. 340 60 80 464 (86.5%) (85.0%) (53.1%) (96.1%) (42.2%) (82.4%) min. min 1333 271 818 264 306 134 238 126 Full 1650 330 162 334 (80.8%) (82.7%) (61.4%) (97.4%) (23.0%) (49.4%) min. min.

Lastly, we wanted to ensure that the large number of re- breakdown of crawler and blacklisting activity on our phishing quests from our monitoring system did not affect the blacklist sites. Overall, we found that sites with cloaking were both status of our phishing sites (i.e. by triggering heuristics or slower and less likely to be blacklisted than sites without network-level analysis [6]). Therefore, rather than displaying cloaking: in the full tests, just 23.0% of our sites with cloaking the phishing content in the monitoring system’s browsers, we (which were crawled) ended up being blacklisted in at least simply displayed a blank page with a 200 status code. While one browser— far fewer than the 49.4% sites without cloaking we had the option of querying the GSB API directly as a which were blacklisted. Cloaking also slowed the average time supplement to empirically monitoring the browsers it protects, to blacklisting from 126 minutes (for sites without cloaking) we chose not to do so for the same reason. Finally, our to 238 minutes. monitors used an anonymous VPN service (NordVPN) in an However, closer scrutiny is required to make insightful effort to bypass any institution- and ISP-level sniffing. conclusions about the conditions under which cloaking is effective, as we found that each anti-phishing entity exhib- D. Ethical and Security Concerns ited distinctive blacklisting of different cloaking techniques Over the course of our experiments, we were careful in alongside varying overall speed. For example, the mobile-only ensuring that no actual users would visit (let alone submit Filter B showed 100% effectiveness against blacklisting across credentials to) our phishing sites: we only ever shared the site all entities. On the other hand, the JavaScript-based Filter F URLs directly with anti-phishing entities, and we sterilized was 100% effective for some entities, delayed blacklisting the login forms such that passwords would not be transmitted for others, but was in fact more likely to be blacklisted in the event of form submission. In the hands of malicious than Filter A by others still. In many cases, there was also actors with access to the necessary hosting and message a lack of blacklisting of Filter A sites (i.e. those without distribution infrastructure, and with malicious modifications, cloaking). Entities— in particular, the clearinghouses— at our framework could potentially be used to carry out real and times failed to ultimately blacklist a reported site despite evasive phishing attacks on a large scale. We will thus not extensive crawling activity. Although it is possible that our release the framework as open source software; however, we direct reporting methodology led to some sites being ignored will share it privately with vetted security researchers. altogether, the exceptional performance of GSB with respect Another potential concern is that testing such as ours could to Filter A shows that a very high standard is realistic. We degrade the response time of live anti-phishing systems or detail each entity’s behavior in Section V-F. negatively impact data samples used by their classifiers. Given To provide meaningful measurements of entity performance the volume of live phishing attacks today [2], we do not with respect to different browsers and cloaking filters, we believe that our experiments carried any adverse side-effects; propose a scoring system in Section V-B which seeks to the entities to which we disclosed did not raise any concerns capture both the response time and number of sites blacklisted regarding this. With respect to classifiers based on machine by each anti-phishing entity for each browser and/or filter. In learning, effective methodology has already been proposed to addition, we summarize our findings visually in Figures 2 and ensure that high-frequency phishing sites do not skew training 3 by plotting the cumulative percentage of sites blacklisted samples [20]. Nevertheless, it would be prudent for any party over time, segmented by each browser or by each cloaking engaged in long-term or high-volume testing of this nature to technique, respectively. Lastly, we analyze web traffic logs to first consult with any targeted entities. make observations about the distinctive behavior of each anti- phishing entity. Although much of our analysis focuses on the V. EVALUATION results of the large-scale full tests, we also make comparisons PhishFarm proved to deliver reliable performance over the to the preliminary test data when appropriate. course of our tests and executed our experimental methodology A. Crawling Behavior as designed. Following the completion of all tests, we had collected timestamps of when blacklisting occurred, relative Of all sites that we launched during the preliminary and full to the time we reported the site, for each of our phishing tests, most (81.9%) saw requests from a , and a sites in each of the desktop and mobile browsers tested. This majority of sites therein (66.0%) was successfully retrieved at totaled over 20,000 data points across the preliminary and least once (i.e. bypassing cloaking if in use). In a handful of full tests, and had added dimensionality due to the various cases, our reports were ignored by the entities and thus resulted entities and cloaking techniques tested. Table V shows the in no crawling activity, possibly due to volume or similarity TABLE VI: Aggregate entity blacklisting performance scores TABLE VII: Formulas for aggregate scores (per test). in the full tests (colors denote subjective assessment: green— good, yellow— lacking, red— negligible blacklisting) @ b, URL P B U, SURLb “ (1) T ´ T URL blacklistedb URLą reported GSB Filter A Filter B Filter C Filter D Filter E Filter F Sb 1 ´ blacklisted in b Twindow GSB 0.942 0 0.030 0.899 0.104 0.692 0.533 0 otherwise IE 0 0 0 0 0 0 0 # Sbf Edge 0 0 0 0 0 0 0 SURL @ b, f P B F,S “ b (2) Opera 0 0 0 0 0 0 0 bf n URL, F “f PBf 0.970 0 0.031 0.953 0.106 0.712 0.457 S ą ÿURL TBf (min.) 112 N/A 50 100 81 107 0.947 C Sbf @ b P B,S “ (3) b n f, accessible b, f SmartScreen Filter A Filter B Filter C Filter D Filter E Filter F Sb ÿ p q GSB 0.005 0 0.005 0 0 0.009 0.004 MSb IE 0.176 0 0.003 0 0.301 0.411 0.296 S “ ¨ Sb (4) Sbf MS Edge 0.183 0 0.003 0 0.329 0.421 0.311 b B Opera 0 0 0.005 0 0 0 0 ÿ PBf 0.212 0 0.016 0 0.364 0.455 0.038 S TBf 548 N/A 2889 N/A 391 298 0.649 C (e.g. a site blacklisted after 36 hours would have SURLb “ APWG Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0.563 0 0.356 0 0.777 0 0.339 0.5). We take the average of all these URL-browser scores for IE 0.113 0 0 0 0.626 0 0.246 S bf Edge 0.129 0 0 0 0.761 0 0.297 each browser-filter combination, as Sbf , per Formula 2. Opera 0.242 0 0.185 0 0.545 0 0.262 PBf 0.576 0 0.344 0 0.803 0 0.328 S We further aggregate the Sbf scores to meaningfully sum- TBf 194 N/A 243 N/A 125 N/A 1 C marize the protection of each browser by each entity: the PhishTank Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0.077 0 0 0 0 0.026 0.024 browser score Sb, as per Formula 3, is the average of all Sbf IE 0.096 0 0 0 0 0 0.026 S for a given browser b, but limited to filters f accessible in bf Edge 0.085 0 0 0 0 0 0.028 Opera 0.074 0 0 0 0 0.024 0.032 b. To gauge the blacklisting of each cloaking filter, we report PBf 0.106 0 0 0 0 0.136 0.025 S TBf 386 N/A N/A N/A N/A 2827 0.467 C PBf as the raw proportion of sites blacklisted in at least one

PayPal Filter A Filter B Filter C Filter D Filter E Filter F Sb browser for each respective filter. Note that PBf will always GSB 0.133 0 0.149 0.052 0.198 0.167 0.104 IE 0.102 0 0.040 0 0.074 0.137 0.140 be greater than or equal to any Sbf because the former is not S bf Edge 0.123 0 0.056 0.046 0.193 0.163 0.160 Opera 0.119 0 0.029 0 0.191 0.120 0.143 reduced by poor timeliness; we thus additionally report TBf , PBf 0.167 0 0.172 0.078 0.288 0.182 0.138 S the mean time to blacklisting for each filter f (in minutes). TBf 675 N/A 440 1331 1077 338 0.995 C To capture the overall real-world performance of each entity, to previous reports; we mitigated this risk through the large we average all Sb, weighted by the market share of each sample size and discuss it in more detail in Section V-G. The browser, to produce S, as per Formula 4. The scores S allow distribution of crawler hits was skewed left, characterized by us to efficiently compare the performance between entities a small number of early requests from the entity to which we and would be useful in modeling long-term trends in future reported, followed by a large stream of traffic from it as well deployments of PhishFarm. We also include and the proportion as other entities. Importantly, different cloaking techniques of all sites crawled, C, to illustrate the entity’s response effort. showed no significant effect on the time of the first crawling We present the aforementioned aggregate scores in Table VI attempt; the median response time ranged from 50 to 53 for all entities in the full tests. Indeed, the large number minutes, from the time of reporting, for all filter types. of 0 or near-0 scores was disappointing and representative During our full experiments, only sites which were crawled of multiple ecosystem weaknesses which we discuss in the were ultimately blacklisted. Generally, the crawling also had following sections. Scores for the preliminary tests are found to result in successful retrieval of the phishing content for in Table VIII in Appendix II. Note that because Chrome, blacklisting to occur, though in 10 cases (all in the GSB Firefox, and Safari showed nearly identical scores across experiment with Filter D), a site would be blacklisted despite all experiments, we simplify the table to report the highest a failed retrieval attempt; possible factors for this positive respective score under the GSB heading. We make a similar phishing classification are described in prior work [20]. simplification for IE 11 on Windows 10 and 8.1. We do not separately report the performance of mo- B. Entity Scoring bile browsers because we observed the behavior of mobile For each entity tested, let U be the set of phishing URLs browsers to be directly related to their desktop counterparts. reported to the entity, let B be the set of browsers monitored, During the preliminary tests, mobile Firefox and Opera mir- let T be the set of observed blacklisting times, let F be the rored the blacklisting— and thus also the scores— of their set of cloaking filters used, and let MS denote the worldwide desktop versions. Mobile Chrome and Mobile Safari showed browser market share. Additionally, we define the value of the no blacklist warnings whatsoever for any of our phishing function accessible(b, f) to be true if and only if a phishing sites and thus receive Sb scores of 0. During the full tests, site with filter f is designed to be accessible in browser b. the behavior of mobile Chrome, Safari, and Opera remained For each URL-browser combination in B U, per Formula unchanged. Firefox stopped showing blacklist warnings, and

1, we define a normalized performance score SURLb on the its scores thus dropped to 0 (the 0 scores of mobile browsers range r0, 1s, where 0 represents no blacklistingŚ and 1 repre- represented a serious issue; this was corrected after we con- sents immediate blacklisting relative to the time reported. The tacted Google and Mozilla after the full tests, as discussed in score decays linearly over our 72-hour observation window Section VI-A1). We did not test mobile versions of Microsoft Fig. 2: Blacklisting over time in each browser (full tests). Fig. 3: Effect of cloaking on blacklisting over time (full tests). browsers because mobile IE is no longer supported; Edge for absence of deceptive phishing URLs, Edge and IE simply lost Android and iOS was released after we began testing. their edge in the full tests; Figure 5c in Appendix II shows their superior performance in the preliminary tests. C. Preliminary vs. Full Tests 1) Consistency Between Browsers: Chrome, Firefox, and We observed many core similarities when comparing the Safari are all protected by the GSB blacklist; our tests performance same entity between the preliminary and full confirmed these browsers do consistently display blacklist tests. We also saw improvements related to the recommenda- warnings for the same set of websites. However, during our full tions we shared during our disclosure meetings, in particular tests, Chrome displayed warnings up to 30 minutes earlier than with respect to the APWG’s treatment of Filters C and E. Safari, and up to one hour earlier than Firefox; the warnings Notably, during the full tests, crawler traffic to sites with became consistent after the five-hour mark. Upon further cloaking increased by 44.7% relative to sites without cloaking, investigation, we believe that this disparity was caused by while the overall traffic volume also increased by 89.7%. We different caching implementations of the GSB Update API (v4) discuss all entity-specific improvements in section V-F. in each browser (this API ensures privacy and quick lookups The comparison also revealed some surprises, however. The but requires periodic refreshes of the entire blacklist [53]). main experimental difference between the two sets of tests, Latency is a notable shortcoming of blacklists [6] and it apart from sample size, was our exclusive use of random URLs appears that browser blacklist caching can still be improved in the full tests. On the other hand, the preliminary tests in- from a security perspective. cluded a sampling of deceptive URLs. In the preliminary tests, Edge and IE 11 (both protected by SmartScreen) proved we observed that Edge and IE were quick to display blacklist far less consistent. In the full tests, Edge displayed warnings warnings for sites with certain deceptive URLs. In fact, many up to two hours earlier and covered more sites than IE until Type IV URLs (with domain names containing either the the 24-hour mark. However, in the preliminary tests— which PayPal brand or deceptive keywords) saw proactive zero-hour involved deceptive URLs detectable by heuristics— IE would warnings in these browsers without any prior crawler activity. often display preemptive warnings for sites which would not Figure 5b in Appendix II shows the effect of URL type on be blocked in Edge for several hours, if at all. These deviations blacklisting in the preliminary tests. During the full tests, no were evident in the preliminary APWG, SmartScreen, and phishing site was blacklisted unless it had previously been PayPal tests as per Table VIII in Appendix II. visited by a crawler. In the absence of deceptive URLs, we thus observed all the blacklists to be purely reactive; this, in E. Filter Performance turn, led to lower Sb scores of Edge, IE, and sites with Filter Figure 3 shows the percentage of our phishing sites black- B, and a lower overall score S for SmartScreen. listed over the course of all the full experiments, grouped by cloaking filter type. Because this summary view masks some D. Browser Performance of the distinctive per-entity behavior we observed in Table VI, Figure 2 shows the percentage of our phishing sites black- Figure 4 in Appendix I should be consulted as a supplement. listed over the course of all the full experiments, grouped by Note that because these figures consider all browser-filter browser, but limited to sites intended to be accessible in each combinations, their Y-axes differ from Figure 2 and the Sbf respective desktop browser (i.e. Filter B was excluded for all scores in Table VI. Nevertheless, they all convey the finding and Filters C and D were excluded for IE, Edge, and Opera). that there exist considerable gaps in today’s blacklists. Chrome, Safari, and Firefox consistently exhibited the high- The overall earliest blacklisting we observed occurred after est overall blacklisting speed and coverage, but were still far approximately 40 minutes, in Chrome. Significant growth took from covering all cloaked sites. Opera generally outperformed place between 1.5 and 6.5 hours and continued at a slower the Microsoft browsers during the first four hours, but was later rate thereafter for the full 72-hour period. Compared to sites overtaken for the remainder of our experimental period. In the without cloaking, our cloaking techniques showed moderate to high effectiveness throughout the life of each phishing site. 3) APWG: The APWG was the second-highest scoring Filter B saw no blacklisting whatsoever across any desktop entity in our full tests and showed consistent protection of or we tested. Filters E and F proved most all browsers. Its score S increased substantially compared to effective in the early hours of deployment, while the geo- the preliminary tests due to improvements to bypass Filters C specific Filters C and D saw the lowest amount of blacklisting and E, which allowed APWG reports to result in blacklisting in the long term. Between 48 and 72 hours after deployment, of such sites in GSB browsers— something not achieved when sites with Filter E overtook Filter A sites by a small margin; we reported directly to GSB. The APWG also generated the upon further analysis, we found that this was due to a high highest level of crawler traffic of any entity we tested. level of interest in such sites following reporting to the APWG. Unfortunately, the APWG failed to blacklist any sites with All other types of sites with cloaking were on average less Filter D or F in the full tests; its preliminary success proved likely to be blacklisted than sites without. to have been related solely to the detection of deceptive URLs by IE and Edge. Interestingly, we saw a large increase in the F. Entity Performance blacklisting of sites with Filter E after the 24-hour mark; after Although no single entity was able to overcome all of analyzing the traffic logs we believe that this is due to data the cloaking techniques on its own, collectively the entities sharing with PayPal (this trend is also reflected in Figure 4e). would be successful in doing so, with the exception of Filter 4) PhishTank: PhishTank is a community-driven clearing- B (this has since been corrected as per the discussion in house that allows human volunteers to identify phishing con- Section VI-A1). tent [54]; it also leverages a network of crawlers and partners 1) Google Safe Browsing: Due to the high market share to aid in this effort. It was the second-highest performer in of the browsers it protects, GSB is the most impactful anti- our preliminary tests thanks to its expeditious blacklisting in phishing blacklist today. It commanded the highest score S GSB browsers. In the full tests, we were surprised to see that in both the preliminary and full tests. GSB’s key strength lies only 46.7% of sites reported were crawled, and very few sites in its speed and coverage: we observed that a crawler would were ultimately blacklisted. Despite this, PhishTank generated normally visit one of our sites just seconds after we reported it. the second-highest volume of total crawler traffic. We do not 94.7% of all the sites we reported in the full tests were in fact know the reasons for its shortcomings and PhishTank did not crawled, and 97% of sites without cloaking (Filter A) ended reply to our disclosure; we suspect that the manual nature of up being blacklisted. Blacklisting of Filter D was comparable, classification by PhishTank may be a limiting factor. and Filter F improved over the preliminary tests. 5) PayPal: During the preliminary tests, PayPal’s own However, GSB struggled with Filter E, which blocked abuse reporting service struggled to bypass Filters D and F, but hostnames specific to Google crawlers. It also struggled with overcame the latter in the full experiments while maintaining Filter C, which denied non-US traffic. The reason the re- a moderate degree of blacklisting. Its protection of Opera also spective PBf scores are low is that if the initial crawler improved between the two tests. Despite crawling all but two hit on the phishing site failed, GSB would abandon further of the sites we reported in the full tests, the average response crawling attempts; the initial hit almost always originated from time and browser protection ended up being poor overall. We a predictable non-US IP address. Another weakness appears cannot disclose the reasons for this but expect to see a future to be data sharing, as none of the sites we reported to GSB improvement as a result of our findings. ended up being blacklisted in Edge, IE, or Opera. 6) Remaining Entities: Performance scores for entities 2) Microsoft SmartScreen: SmartScreen proved to be the only included in the preliminary tests are found in Table VIII only native anti-phishing blacklist to leverage proactive URL within Appendix II. High-level descriptions of each follow. heuristics to blacklist phishing sites, which allowed Microsoft All sites we reported to ESET ended up being crawled, but browsers to achieve high scores during the preliminary tests. only a fraction of those– all with Filter A– were actually black- These heuristics were mainly triggered by URLs with a listed. Overall timeliness was poor, though the blacklisting did deceptive . In the preliminary tests, Edge proved span multiple browsers. Netcraft yielded the best protection to be exceptionally well-protected, achieving a top Sb score of Opera in the preliminary tests, but overall it struggled of 0.87— the highest of any browser. with most of our cloaking techniques and did not deliver In the full tests, the performance of IE improved over the timely blacklisting. In retrospect, given the unexpectedly poor preliminary tests and became more consistent with that of performance of PhishTank in the full tests, we would have Edge. Surprisingly, SmartScreen was more likely to blacklist been interested in re-testing Netcraft. Reports to the US CERT sites with cloaking than those without, possibly due to use led to minimal crawler activity and blacklisting; disregarding of classification of cloaking (which would be commendable) heuristics from the Microsoft browsers, only a single site was alongside low trust of our phishing reports (see Limitations). blacklisted. Phishing reports we sent to McAfee effectively Reporting to SmartScreen did not seem to significantly bypassed Filter A and Filter E but only appeared to lead to affect any non-Microsoft browsers; the entity thus shares a timely blacklisting in Microsoft Edge. Reports to WebSense similar shortcoming with GSB. SmartScreen was also among had no effect beyond crawler traffic related to the URL the slowest entities to reactively respond to phishing reports, heuristics used by Microsoft; while we were hopeful the e- and its overall coverage was poor, which is its key weaknesses. mail reporting channel we used would prove fruitful, this is understandable given that the company focuses on protection than what the entities otherwise believe; regardless, to improve of its enterprise customers. the future effectiveness of reporting, we propose an alternative approach in Section VI-B2. G. Limitations Ultimately, if each report represents a chance that a phishing Our findings are based on a controlled (to the extent possible site will be blacklisted, we believe that our experimental without sending out real spam e-mails) empirical experiment design still captures trends therein; moreover, our findings with and observations from a large set of supporting metadata respect to cloaking effectiveness are consistent with internal and a high volume of anti-abuse crawler traffic. Our study PayPal e-mail and web traffic data pertaining to actual victims focuses exclusively on native phishing blacklist protection that of phishing. To address its current limitations, our framework is available by default in the browsers and platforms tested. could be adapted to follow a different (possibly collaboratively Systems with third-party security software may see enhanced arranged) reporting methodology, consider a broader range of protection [6], though cloaking can also have an impact on cloaking techniques, or even be applied to proactively-detected systems powering such software. live phishing URLs for which cloaking can be profiled. Within its scope, our analysis should still be considered 1) Statistical Tests: In the full tests, we were surprised to with certain limitations in mind. We suspect that real-world find the blacklisting performance of each entity with respect to blacklisting of phishing attacks may be more timely than the different filters to have far more clear-cut gaps than in the what our results otherwise suggest, as our efforts to isolate preliminary tests; 11 of the 30 the per-entity filter groups saw cloaking as a variable in our experimental design (i.e. by no blacklisting whatsoever (per Table VI). Although ANOVA using randomized domain names, never rendering the actual as originally planned could still be meaningfully applied to phishing content in browsers being monitored, and using .com the subset of entities which had three or more filter groups domains months after registration) also eliminated many of that satisfied homogeneity of variance [50] (i.e. had some the methods that the ecosystem can use to detect or classify blacklisting activity), we chose not to perform such tests as phishing (e.g. URL-, network-, or DNS-based analysis). How- the resulting power would be below our target, and instead ever, this reduced detectability is offset, to an extent, by the relied on direct observations supported by crawler metadata. If possibility of malicious actors to likewise evade such forms of our experiment were to be repeated to validate improvements detection. We observed in the preliminary tests that only URLs in blacklisting or continuously test the ecosystem, we believe containing brand names were quicker to be blacklisted than that statistical tests could be highly useful in assessing whether others; in the wild, there is also a shifting tendency to abuse per-entity blacklisting performance improved significantly for compromised infrastructure and distribute random phishing each respective cloaking filter. URLs in lieu of more deceptive alternatives [8], [49]. In terms of factors under our control, it was not financially feasible to VI.SECURITY RECOMMENDATIONS achieve a one-to-one mapping between IP addresses and all of Based on analysis of our experimental findings, we propose our domains; this is a skewing factor which may have acted in several possible improvements to the anti-phishing ecosystem. favor of blacklists in the full tests, such as with the 10 Filter D sites which were blacklisted despite not being successfully A. Cloaking retrieved during the GSB experiment. Our first set of recommendations focuses specifically on the Finally, we only submitted a single and direct report for each cloaking techniques we tested. phishing site deployed. Although real-world phishing sites 1) Mobile Users: We believe that the highest priority might see a much higher volume of automated reports (e.g. within the current ecosystem should be effective phishing from sources such as spam filters), the volume of per-URL protection for mobile users. Such users are not only inher- phishing reports in the wild can in fact be reduced by attackers ently more vulnerable to phishing attacks [41], but now also (e.g. through the use of redirection links). More importantly, comprise the majority of web traffic [4]. direct reports such as ours (in particular to the blacklist Our work has been impactful in better securing mobile users operators) might be subject to more suspicion because anti- by enhancing existing anti-phishing systems. For over a year phishing entities must account for adversaries who willingly between mid-2017 and late 2018, GSB blacklists (with nearly submit false reports or seek to profile crawling infrastructure. 76% global market share) simply did not function properly on Although the blacklist operators to which we disclosed did not mobile devices: none of our phishing sites with Filter A, E express concern with our reporting methodology, we learned or F (targeting both desktop and mobile devices) showed any that crawling infrastructure used to respond to direct reports warnings in mobile Chrome, Safari, or Firefox despite being is indeed designed to be unpredictable to mitigate adversarial blacklisted on desktop. We confirmed the disparity between submissions. SmartScreen’s lower crawl rate may be explained desktop and mobile protection through periodic small-scale by this; GSB, on the other hand, consistently responded testing and by analyzing an undisclosed dataset of traffic to quickly and seemed to give our direct reports a high level real phishing sites. Following our disclosure, we learned that of priority. It is therefore possible that either classification of the inconsistency in mobile GSB blacklisting was due to the reporting channel abuse works very accurately, or that report- transition to a new mobile API designed to optimize data ing channels are more vulnerable to adversarial submissions usage, which ultimately did not function as intended. Because blacklisting was not rectified after our full tests, we contacted malicious e-mail. While the latter is useful in identifying spam the entities anew. As a result, in mid-September 2018 Mozilla origin, we believe a better solution would be the implementa- patched Firefox (from version 63) such that all desktop tion of standardized trusted phishing reporting systems that warnings were also shown on mobile. Google followed suit allow the submission of specific metadata (such as victim days thereafter with a GSB API fix that covered mobile GSB geolocation or device). Trusted channels could allow detection browsers retroactively; mobile Chrome and Safari now mirror efforts to more precisely target high-probability threats while desktop listings, albeit with a shorter-lived duration to lower minimizing abuse from deliberate false-negative submissions; bandwidth usage. Chrome, Safari, and Firefox thus again join they could also simplify and expedite collaboration efforts Opera as mobile browsers with effective blacklists, though between anti-phishing entities and abused brands, which may some popular mobile browsers still lack such protection [41]. hold valuable intelligence about imminent threats. Upon close inspection of the preliminary test results, we 3) Blacklist Timeliness: The gap between the detection found that Filter B sites were solely blacklisted due to their of a phishing website and its blacklisting across browsers suspicious URLs rather than our reports. During our full tests, represents the prime window for phishers to successfully carry not a single site with Filter B was blacklisted in any browser, out their attacks. At the level of an individual entity, cloaking interestingly despite the fact that crawlers did successfully has a stronger effect on the occurrence rather than the time- retrieve many of these sites. GSB addressed this vulnerability liness of blacklisting. However, if we look at the ecosystem in mid-September 2018, together with the aforementioned API as a whole in Figure 3, cloaking clearly delays blacklisting fix. Through a subsequent final redeployment of PhishFarm, overall. Our test results show that blacklisting now typically we verified that sites with Filter B were being blacklisted occurs in a matter of hours— a stark improvement over the following reports to GSB, the APWG, and PayPal. Other day- or week-long response time observed years ago [6], entities— including ones we did not test— should ensure that [55]. However, given the tremendous increase in total phishing sites targeted at mobile users are being effectively detected. attacks since then (on the order of well over 100 attacks per 2) Geolocation: Although our experiments only considered hour in 2018 [2]), we believe that even today’s 40-minute two simple geolocation filters (US and non-US), our findings best-case blacklist response time is too slow to deter phishers are indicative of exploitable weaknesses in this area. Given the and effectively protect users. The gap needs to be narrowed, overwhelming amount of crawler traffic from the US (79%, especially by slower entities (i.e. those not directly in control per Table IX in Appendix III), Filter C should not have of blacklists). Future work should investigate the real-world been as effective as it proved to be. We hypothesize that impact of delays in blacklisting on users and organizations other geo-specific filters would have similarly low blacklisting victimized by phishing attacks in order to accurately establish rates, in part due to the crawler characteristics discussed in an appropriate response threshold. Section V-F1. Country- or region-specific filtering paired with 4) Volume: GSB, the APWG, and PayPal crawled nearly localized page content is not an unusual sight in real-world all of the phishing sites we reported. In particular, GSB proved PayPal phishing kits that we have analyzed. to deliver a consistently agile response time despite the high 3) JavaScript: It is trivial to implement JavaScript-based number of reports we submitted. Other entities fell short of cloaking such as Filter F. This technique proved to be effective this level of performance. With increasing volumes of phishing in slowing blacklisting by three of the five entities in our full attacks, it is essential that all players in the ecosystem remain tests. Fortunately, SmartScreen bypasses this technique well, robust and capable of delivering a consistent response. and PayPal started doing so following our disclosure. The 5) Data Sharing: Data sharing has long been a concern broader ecosystem should better adapt to client-side cloaking, within the ecosystem [56]. We found that the two main in particular if its sophistication increases over time. blacklist operators (GSB and SmartScreen) did not appear B. Anti-phishing Entities to effectively share data with each other, as per Table VI. However, clearinghouse entities (APWG and PhishTank) and We also offer a number of more general recommendations PayPal itself showed consistent protection across all browsers. for anti-phishing systems of tomorrow. Unfortunately, the timeliness and overall coverage of clearing- 1) Continuous Testing: The mere failure of mobile black- houses appear to be inferior to those of the blacklist operators listing that we observed during our experiments is sufficient to in their respective browsers. Closer cooperation could thus not warrant the need for continuous testing and validation of black- only speed up blacklisting, but also ensure that malicious sites list performance. Periodic deployment of PhishFarm could are blocked universally. Strengthening this argument, perhaps be used for such validation. In addition, continuous testing a breakdown in communication between infrastructure used by could help ensure that future cloaking techniques— which may different entities accounted for those of our sites which were grow in sophistication— can effectively be mitigated without successfully crawled but not ultimately blacklisted. compromising existing defenses. As an added benefit, trends in the relative performance of different entities and the overall VII.RELATED WORK timeliness and coverage of blacklisting could be modeled. 2) Trusted Reporting Channels: The phishing reporting To the best of our knowledge, our work is the first controlled channels we tested merely capture a suspected URL or a effort to measure the effects of cloaking in the context of phishing. Several prior studies measured the general effective- The differences between today’s phishing trends and those ness of anti-phishing blacklists and the behavior of phishing seen in prior work show that the ecosystem is evolving kits; none of the prior work we identified considered cloaking, quickly. This warrants regular testing of defenses and re- which may have had a skewing effect on the datasets and evaluation of criminals’ circumvention techniques and attack ecosystem findings previously reported. Cloaking itself has vectors; it also underscores the importance of scalable and previously been studied with respect to malicious search automatable solutions. Our testbed shares some similarities engine results; Invernizzi et al. [7] proposed a system to detect with previous work [39], [6] but it is the only such testbed to such cloaking with high accuracy. Oest et al. [8] later studied offer full automation without the need for intervention during the nature of server-side cloaking techniques within phishing the execution of experiments, and the only one to actively kits and proposed approaches for defeating each. deploy sites and directly send reports to entities. The work most similar to ours is NSS Labs’ [39] recent use of a proprietary distributed testbed [57] to study the timeliness VIII.CONCLUSION of native phishing blacklisting in Chrome, Firefox, and Edge. By launching and monitoring a large set of phishing sites, The main limitation of NSS Labs’ approach is the reliance on we carried out the first controlled evaluation of how cloaking feeds of known phishing attacks; any delay in the appearance techniques can hamper the timeliness and occurrence of phish- of a site in each source feed can affect the accuracy of ing blacklist warnings in modern web browsers. As a result blacklisting time measurements. Furthermore, phishing sites of our disclosure to anti-phishing entities, mobile blacklisting could be overlooked in the case of successful cloaking against is now more consistent, and some of the cloaking techniques the feeds. We address these limitations by having full control we tested are no longer as effective; others represent ongoing over the deployment and reporting time of phishing sites. vulnerabilities which could be addressed through tweaks to Sheng et al. [6] took an empirical approach to measure existing detection systems. Such tweaks should also seek to the effectiveness of eight anti-phishing toolbars and browsers improve overall timeliness of blacklists to better counter the powered by five anti-phishing blacklists. The authors found modern onslaught of phishing attacks. Blacklist protection that heuristics by Microsoft and Symantec proved effective should not be taken for granted; continuous testing— such in offering zero-hour protection against a small fraction of as that supported by our framework— is key to ensuring that phishing attacks, and that full propagation across phishing browsers are secured sufficiently, and as intended. blacklists spanned several hours. This work also found that Cloaking carries a detrimental effect on the occurrence of false positive rates in blacklists are near-zero; we thus did blacklisting due to the fundamentally-reactive nature of the not pursue such tests in our experiments. While Sheng et al.’s main detection approach currently used by blacklists; it thus work was based on phishing sites only 30 minutes old, and has the potential to cause continuous damage to the organiza- was thus better controlled than earlier blacklist tests [58], the tions and users targeted by phishers. Although no single entity datasets studied were of limited size and heterogeneous in we tested was able to individually defeat all of our cloaking terms of victim brands; the anti-phishing tools evaluated are techniques, collectively the requisite infrastructure is already now dated. In addition, Sheng et al. checked blacklist status in place. In the short term, collaboration among existing with a granularity of one hour— longer than our 10 minutes. entities could help address their individual shortcomings. We Han et al. [25] analyzed the lifecycle of phishing sites observed that proactive defenses (such as the URL heuristics by monitoring cybercriminals’ behavior on a honeypot web of SmartScreen) proved to deliver superior protection— but server. The authors timed the blacklisting of the uploaded only under the right circumstances. In the long term, the sites across Google Safe Browsing and PhishTank. Uniquely, ecosystem should move to more broadly implement general- this work sheds light on the time between the creation and purpose proactive countermeasures to more reliably negate deployment of real phishing sites. A key difference in our work cloaking. It is important for the ecosystem to be able to is our ability to customize and test different configurations of effectively bypass cloaking because it is merely one way phishing kits instead of waiting for one to be uploaded. For in which phishing sites can be evasive. For instance, with instance, we could target specific brands or configure our own cloaking alongside redirection chains or bulletproof hosting, cloaking techniques to directly observe the ecosystem’s re- phishing sites might otherwise avoid existing mitigations far sponse. Our experiments suggest a significantly faster blacklist more successfully than what we have observed. response time by the ecosystem than what Han et al. found; Phishing has proven to be a difficult problem to solve due our test sample size was also nearly five times larger. to attackers’ unyielding persistence, the cross-organizational Virvilis et al. [41] carried out an evaluation of mobile nature of infrastructure abused to facilitate phishing, and the web browser phishing protection in 2014 and found that reality that technical controls cannot always compensate for major mobile web browsers at the time included no phishing the human weakness exploited by social engineers. We believe protection. Like that of Sheng et al., this evaluation was that continuous and close collaboration between all anti-abuse empirical and based on a set of known phishing URLs. In entities, which can lead to a deep understanding of current our work, we found that mobile Chrome, Safari, and Firefox threats and development of intelligent defenses, is the crux of now natively offer blacklist protection, but that this protection optimizing controls and delivering the best possible long-term was not functioning as advertised during our tests. protection for phishing victims. ACKNOWLEDGMENTS [22] M. Cova, C. Kruegel, and G. Vigna, “There is no free phish: An analysis of “free” and live phishing kits,” in Proceedings of the 2nd Confer- The authors would like to thank the reviewers for their ence on USENIX Workshop on Offensive Technologies, ser. WOOT’08. insightful feedback and suggestions. This work was partially Berkeley, CA, USA: USENIX Association, 2008, pp. 4:1–4:8. [23] D. Manky, “Cybercrime as a service: a very modern business,” Computer supported by PayPal, Inc. and a grant from the Center for Cy- Fraud & Security, vol. 2013, no. 6, pp. 9–13, 2013. bersecurity and Digital Forensics at Arizona State University. [24] D. Birk, S. Gajek, F. Grobert, and A. R. Sadeghi, “Phishing phishers - observing and tracing organized cybercrime,” in Second International REFERENCES Conference on Internet Monitoring and Protection, July 2007, p. 3. [25] X. Han, N. Kheir, and D. Balzarotti, “Phisheye: Live monitoring of [1] “Anti-Phishing Working Group: APWG Trends Report Q4 sandboxed phishing kits,” in Proceedings of the 2016 ACM SIGSAC 2016,” (Date last accessed 23-August-2017). [Online]. Available: Conference on Computer and Communications Security. ACM, 2016, https://docs.apwg.org/reports/apwg trends report q4 2016.pdf pp. 1402–1413. [2] “Anti-Phishing Working Group: APWG Trends Report Q1 [26] S. Hao, M. Thomas, V. Paxson, N. Feamster, C. Kreibich, C. Grier, 2018,” (Date last accessed 31-August-2018). [Online]. Available: and S. Hollenbeck, “Understanding the domain registration behavior https://docs.apwg.org/reports/apwg trends report q1 2018.pdf of spammers,” in Proceedings of the 2013 conference on Internet [3] J. Hong, “The state of phishing attacks,” Communications of the ACM, measurement conference. ACM, 2013, pp. 63–76. vol. 55, no. 1, pp. 74–81, 2012. [27] Z. Ramzan, “Phishing attacks and countermeasures,” in Handbook of [4] “Desktop vs mobile vs tablet market share worldwide,” information and communication security. Springer, 2010, pp. 433–448. http://gs.statcounter.com/platform-market-share/desktop-mobile-tablet, [28] “Statcounter: Desktop browser market share worldwide,” 2018, [Online; accessed 31-Aug-2018]. http://gs.statcounter.com/browser-market-share/, [Online; accessed [5] S. Chhabra, A. Aggarwal, F. Benevenuto, and P. Kumaraguru, “Phi. sh/$ 20-Aug-2017]. ocial: the phishing landscape through short urls,” in Proceedings of the [29] M. Matsuoka, N. Yamai, K. Okayama, K. Kawano, M. Nakamura, and 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam M. Minda, “Domain registration date retrieval system of urls in e-mail Conference. ACM, 2011, pp. 92–101. messages for improving spam discrimination,” in Computer Software [6] S. Sheng, B. Wardman, G. Warner, L. F. Cranor, J. Hong, and C. Zhang, and Applications Conference Workshops (COMPSACW), 2013 IEEE “An empirical analysis of phishing blacklists,” 2009. 37th Annual. IEEE, 2013, pp. 587–592. [7] L. Invernizzi, K. Thomas, A. Kapravelos, O. Comanescu, J.-M. Picod, [30] M. Khonji, Y. Iraqi, and A. Jones, “Enhancing phishing e-mail clas- and E. Bursztein, “Cloak of visibility: Detecting when machines browse sifiers: A lexical analysis approach,” International Journal for Proceedings of the 37th IEEE Symposium on Security a different web,” in Information Security Research (IJISR), vol. 2, no. 1/2, p. 40, 2012. and Privacy, 2016. [31] S. Duman, K. Kalkan-Cakmakci, M. Egele, W. Robertson, and E. Kirda, [8] A. Oest, Y. Safaei, A. Doupe,´ G. Ahn, B. Wardman, and G. Warner, “Emailprofiler: Spearphishing filtering with header and stylometric fea- “Inside a phisher’s mind: Understanding the anti-phishing ecosystem tures of ,” in Computer Software and Applications Conference through phishing kit analysis,” in 2018 APWG Symposium on Electronic (COMPSAC), vol. 1. IEEE, 2016, pp. 408–416. Crime Research (eCrime), May 2018, pp. 1–12. [32] L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi, “Exposure: Finding [9] K. Thomas, D. McCoy, C. Grier, A. Kolcz, and V. Paxson, “Trafficking malicious domains using passive dns analysis.” in NDSS, 2011. fraudulent accounts: The role of the underground market in spam and abuse.” in USENIX Security Symposium, 2013, pp. 195–210. [33] B. Liang, M. Su, W. You, W. Shi, and G. Yang, “Cracking classifiers for evasion: A case study on Google’s phishing pages filter,” in Proceedings [10] T. Holz, M. Engelberth, and F. Freiling, “Learning more about the of the 25th International Conference on . International underground economy: A case-study of keyloggers and dropzones,” World Wide Web Conferences Steering Committee, 2016, pp. 345–356. Computer Security–ESORICS 2009, pp. 1–18, 2009. [11] D. K. McGrath and M. Gupta, “Behind phishing: An examination of [34] G. Xiang, J. Hong, C. P. Rose, and L. Cranor, “Cantina+: A feature-rich phisher modi operandi.” LEET, vol. 8, p. 4, 2008. machine learning framework for detecting phishing web sites,” ACM Trans. Inf. Syst. Secur. [12] K. Thomas, F. Li, A. Zand, J. Barrett, J. Ranieri, L. Invernizzi, , vol. 14, no. 2, pp. 21:1–21:28, Sep. 2011. Y. Markov, O. Comanescu, V. Eranti, A. Moscicki, D. Margolis, V. Pax- [35] Y. Zhang, J. I. Hong, and L. F. Cranor, “Cantina: A content-based son, and E. Bursztein, Eds., Data breaches, phishing, or malware? approach to detecting phishing web sites,” in Proceedings of the 16th Understanding the risks of stolen credentials, 2017. International Conference on World Wide Web, 2007, pp. 639–648. [13] R. Dhamija, J. D. Tygar, and M. Hearst, “Why phishing works,” in [36] D. Canali, D. Balzarotti, and A. Francillon, “The role of web hosting Proceedings of the SIGCHI Conference on Human Factors in Computing providers in detecting compromised websites,” in Proceedings of the Systems. New York, NY, USA: ACM, 2006, pp. 581–590. 22nd International Conference on World Wide Web, 2013, pp. 177–188. [14] A. Emigh, “ITTC report on online identity theft technology and coun- [37] A. Lukovenko, “Let’s Automate Let’s Encrypt,” Linux Journal, no. 266, termeasures 1: Phishing technology, chokepoints and countermeasures,” Jun. 2016. Radix Labs, Oct 2005. [38] J. P. Randy Abrams, Orlando Barrera, “ [15] C. Jackson, D. Simon, D. Tan, and A. Barth, “An evaluation of extended Comparative Analysis - Phishing Protection,” NSS Labs, 2013, validation and picture-in-picture phishing attacks.” Microsoft Research, https://www.helpnetsecurity.com/images/articles/Browser%20Security January 2007. %20CAR%202013%20-%20Phishing%20Protection.pdf. [16] M. Wu, R. C. Miller, and S. L. Garfinkel, “Do security toolbars actually [39] NSS Labs, “NSS Labs Conducts First Cross-Platform prevent phishing attacks?” in Proceedings of the SIGCHI Conference on Test of Leading Web Browsers,” Oct 2017. [Online]. Human Factors in Computing Systems. ACM, 2006, pp. 601–610. Available: https://www.nsslabs.com/company/news/press-releases/nss- [17] G. Stringhini, C. Kruegel, and G. Vigna, “Shady paths: Leveraging labs-conducts-first-cross-platform-test-of-leading-web-browsers/ surfing crowds to detect malicious web pages,” in Proceedings of [40] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, the 2013 ACM SIGSAC Conference on Computer & Communications and T. Berners-Lee, “Hypertext transfer protocol – HTTP/1.1,” United Security, ser. CCS ’13, 2013, pp. 133–144. States, 1999. [18] Y. Takata, S. Goto, and T. Mori, “Analysis of redirection caused by [41] N. Virvilis, N. Tsalis, A. Mylonas, and D. Gritzalis, “Mobile devices: web-based malware,” Proceedings of the Asia-Pacific advanced network, A phisher’s paradise,” 2014 11th International Conference on Security vol. 32, pp. 53–62, 2011. and Cryptography (SECRYPT), pp. 1–9, 2014. [19] “CWE-601: URL Redirection to Untrusted Site (’Open Redirect’).” [42] “Google Safe Browsing: Report Phishing Page.” [Online]. Available: [Online]. Available: http://cwe.mitre.org/data/definitions/601.html https://safebrowsing.google.com/safebrowsing/report phish/ [20] C. Whittaker, B. Ryner, and M. Nazif, “Large-scale automatic [43] “SmartScreen: Report a Website.” [Online]. Available: classification of phishing pages,” in NDSS ’10, 2010. [Online]. https://feedback.smartscreen.microsoft.com/feedback.aspx?t=0url= Available: http://www.isoc.org/isoc/conferences/ndss/10/pdf/08.pdf [44] “ESET: Report a Phishing Page.” [Online]. Available: [21] H. McCalley, B. Wardman, and G. Warner, “Analysis of back-doored http://phishing.eset.com/report/ phishing kits.” in IFIP Int. Conf. Digital Forensics, vol. 361. Springer, [45] “NetCraft: Report a Phishing URL.” [Online]. Available: 2011, pp. 155–168. http://toolbar.netcraft.com/report url [46] “McAfee: Custom URL ticketing System.” [Online]. Available: ”https://www.trustedsource.org/en/feedback/url?action=checksingle [47] “Digitalocean: , simplicity at scale.” [Online]. Available: https://www.digitalocean.com/ [48] T. Moore and R. Clayton, “Examining the impact of website take-down on phishing,” in Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit. ACM, 2007, pp. 1–13. [49] S. Garera, N. Provos, M. Chew, and A. D. Rubin, “A framework for detection and measurement of phishing attacks,” in Proceedings of the 2007 ACM Workshop on Recurring Malcode, ser. WORM ’07, pp. 1–8. [50] J. Cohen, “Statistical power analysis for the behavioral sciences.” 1988. [51] E. Lewis, “The role of wildcards in the ,” United States, 2006. (a) GSB [52] K. Ravishankar, B. Prasad, S. Gupta, and K. K. Biswas, “Dominant color region based indexing for cbir,” in Image Analysis and Processing, 1999. Proceedings. International Conference on. IEEE, 1999, pp. 887–892. [53] “Overview — safe browsing apis (v4) — google developers.” [Online]. Available: https://developers.google.com/safe-browsing/v4/ [54] D. G. Dobolyi and A. Abbasi, “Phishmonger: A free and open source public archive of real-world phishing websites,” in Intelligence and Security Informatics, 2016 IEEE Conference on, pp. 31–36. [55] T. Moore, R. Clayton, and H. Stern, “Temporal correlations between spam and phishing websites.” in LEET, 2009. [56] T. Moore and R. Clayton, “How hard can it be to measure phishing?” Mapping and Measuring Cybercrime, 2010. [57] NSS Labs, “Web browser security: Phish- ing protection test methodology v3.0,” Jul 2016. [Online]. Available: https://research.nsslabs.com/reports/free- (b) SmartScreen 90/files/TestMethodology WebB/Page4 [58] C. Ludl, S. McAllister, E. Kirda, e. H. B. Kruegel, Christopher, and R. Sommer, “On the effectiveness of techniques to detect phishing sites,” in Detection of Intrusions and Malware, and Vulnerability Assessment. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 20–39.

APPENDIX I:FULLTEST PER-ENTITY BLACKLISTING Each entity we tested exhibited a characteristic blacklisting behavior with respect to the different cloaking techniques; the aggregate view of filter performance in Figure 3 masks some of these characteristics. Figure 4 includes similar charts for each individual entity as a supplement to Figure 3 and the entity scores in Table VI. Each chart shows detailed blacklisting (c) APWG performance over time, aggregated for all browsers, for each filter type.

(d) PhishTank

(e) PayPal Fig. 4: Blacklisting over time by filter (full tests). APPENDIX II:PRELIMINARY TEST DATA Table VIII includes detailed performance scores for all entities in the preliminary tests. These scores are based on the formulas in Section V-B and are the basis of our comparative discussion in Section V-C. We visualize the preliminary test performance of only the subsequently re-tested entities (GSB, SmartScreen, AWPG, PhishTank, and PayPal) in Figure 5a. Figure 5b illustrates the increased likelihood of URLs with a deceptive domain (Type IV [8], [49]) to be blacklisted during the preliminary tests. As previously discussed, this increase is (a) By filter (re-tested entities only) linked to heuristics used by SmartScreen browsers; the positive effect that this had on IE and Edge blacklisting can be seen in the browser performance breakdown in Figure 5c. The latter two charts are based on data from all 10 preliminary tests.

TABLE VIII: Aggregate entity blacklisting performance scores in the preliminary tests.

GSB Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0.988 0 0 0.846 0 0.493 0.466 IE 0 0.142 0 0 0 0.148 0.049 S bf Edge 0.950 0.130 0 0.709 0 0.703 0.551 Opera 0.138 0 0 0.236 0 0 0.046 PBf 1.000 1.000 0 0.857 0 0.833 0.421 S TBf 38 10 N/A 43 N/A 151 0.900 C

SmartScreen Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0 0 0 0 0 0 0 IE 0 0.142 0.284 0.142 0.135 0.166 0.100 S bf Edge 0.956 0 0 0 0.952 0.703 0.870 Opera 0 0 0 0 0 0 0 (b) By URL type (all entities) PBf 1.000 0.143 0.286 0.143 1.000 0.833 0.045 S TBf 10 10 14 18 162 7 1 C

APWG Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0.648 0 0 0.141 0 0 0.158 IE 0.255 0.432 0.306 0.306 0.524 0.178 0.319 S bf Edge 0.958 0.142 0.137 0 0.821 0.632 0.804 Opera 0.345 0 0 0 0 0 0.115 PBf 1.000 1.000 1.000 0.857 1.000 0.333 0.198 S TBf 73 286 199 154 276 188 1 C

PhishTank Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0.971 0.141 0.261 0 0.399 0.303 0.387 IE 0.314 0 0 0 0.286 0.167 0.255 S bf Edge 0.771 0 0 0.124 0.522 0.295 0.529 Opera 0.112 0.123 0 0 0 0 0.037 PBf 1.000 0.286 0.286 0.125 0.714 0.333 0.372 S TBf 95 309 344 3784 162 363 0.975 C

PayPal Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0.637 0 0.276 0 0.408 0 0.264 IE 0.345 0.517 0.448 0.306 0.525 0 0.290 S bf Edge 0.213 0 0 0.173 0 0.145 0.119 Opera 0.102 0 0.253 0 0 0 0.034 PBf 1.000 1.000 1.000 1.000 1.000 0.167 0.255 S TBf 121 181 128 179 154 3694 0.925 C

ESET Filter A Filter B Filter C Filter D Filter E Filter F Sb (c) By browser (all entities) GSB 0.297 0 0 0 0 0 0.059 IE 0 0 0 0 0 0 0 S bf Edge 0.256 0 0 0 0 0 0.085 Opera 0.137 0 0 0 0 0 0.046 Fig. 5: Blacklisting over time (preliminary tests). PBf 0.333 0 0 0 0 0 0.055 S TBf 444 N/A N/A N/A N/A N/A 1 C

WebSense Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0 0 0 0 0 0 0 IE 0 0 0.142 0 0 0.331 0.110 S bf Edge 0 0 0.128 0 0 0.299 0.100 Opera 0 0 0 0 0 0 0 PBf 0 0 0.143 0 0 0.286 0.014 S TBf 444 N/A 4 N/A N/A 6 0.275 C

Netcraft Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0 0.135 0 0.538 0 0 0.108 IE 0.166 0 0.142 0 0.134 0 0.100 S bf Edge 0.389 0 0.129 0 0 0 0.130 Opera 0.302 0.260 0.130 0 0.130 0 0.144 PBf 0.667 0.429 0.182 0.571 0.182 0 0.109 S TBf 531 334 206 241 318 N/A 0.975 C

US CERT Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0 0 0.139 0 0 0 0.028 IE 0 0.127 0.142 0 0.134 0 0.045 S bf Edge 0 0.127 0.127 0 0.127 0 0.042 Opera 0 0 0 0 0 0 0 PBf 0 0.143 0.286 0 0.143 0 0.029 S TBf N/A 28 70 N/A 248 N/A 0.200 C

McAfee Filter A Filter B Filter C Filter D Filter E Filter F Sb GSB 0 0 0 0 0 0.160 0.032 IE 0.167 0.127 0.143 0.143 0 0 0.056 S bf Edge 0.939 0 0 0 0.955 0.118 0.671 Opera 0 0 0 0 0 0 0 PBf 1 0.143 0.143 0.143 1 0.167 0.059 S TBf 134 466 3702 3702 180 172 1 C APPENDIX III:CRAWLER TRAFFIC ANALYSIS Our 2,380 phishing sites logged a total of 2,048,606 HTTP requests originating from 100,959 distinct IP addresses. A sub- stantial proportion of requests was characterized by crawlers scanning for phishing kit archives (i.e. zip files) or credential dump files; such requests resulted in 404 “not found” errors. It is beneficial for the security community to be able to identify compromised user information and study phishing kits [8], [10], but such crawling is noisy. By monitoring traffic to their phishing sites, cybercriminals could become aware of the ecosystem’s knowledge of their tools, and adapt accordingly. Figure 6 aggregates the web traffic to all phishing sites from the full tests over the course of a 2-week period rel- ative to initial deployment (along a logarithmic scale). Not Fig. 6: Traffic to our phishing sites over time (full tests). unexpectedly, we observed the largest amount of traffic in the hours immediately after reporting each phishing site. Smaller spikes occurred several hours or days thereafter as additional infrastructure started accessing the sites. We automatically TABLE IX: Geographical distribution of requests to our sites. disabled each site at the end of its 72-hour deployment; Country Total Traffic Unique IPs crawlers would thus start seeing 404 errors thereafter. We United States 64.73% 79.02% observed a spike in traffic at this point with characteristics United Kingdom 6.66% 2.42% similar to the initial crawler traffic, followed by an immediate Germany 4.72% 1.30% Brazil 1.99% 2.04% sharp decline (presumably once the offline state of each site Italy 1.80% 0.34% was verified). Over the following seven days, we logged a Japan 1.73% 0.43% Netherlands 1.73% 0.76% consistent yet slowly-declining level of traffic. It is clear that India 1.54% 0.76% an effort is being made to ensure that offline phishing content Canada 1.36% 1.28% does not make a return. After about 10 days, we saw a second France 1.21% 0.68% Belgium 0.85% 0.13% sharp decline, after which the traffic reached insignificant Singapore 0.65% 0.30% levels. We did not study blacklist warning persistence across Ireland 0.65% 0.66% Norway 0.65% 0.18% browsers; this could be an interesting area to explore in the Australia 0.63% 0.34% future. Korea 0.50% 0.17% Denmark 0.50% 0.12% 1) Geographical Origin: Using the GeoLite2 IP database, Estonia 0.48% 0.07% we found that traffic to our phishing sites originated from Austria 0.45% 0.15% 113 different countries across the majority of North America, Russia 0.42% 2.23% Unknown 3.99% 1.96% Europe, and Asia, and some of Africa, as shown in Table IX. 93 Others 2.75% 4.65% 79.02% of all unique IP addresses were based in the US; this accounted for a slightly lower 64.73% of all traffic but still constituted an overwhelming majority overall. 2) Entity Crawler Overlap: We provide a summary of IP TABLE X: Crawler IP overlap between entities. address overlap between entities in Table X. The data is in- IP Overlap Unique Entity Phish- dicative of collaboration between certain entities, as discussed IPs GSB MS APWG PayPal Tank in Section VI-B5. A per-entity analysis of long-term crawler GSB 1,788 7.94% 31.20% 18.40% 53.52% traffic is outside the scope of this work. MS 475 29.89% 29.89% 23.16% 38.11% APWG 6,165 11.08% 2.30% 11.13% 47.96% PhishTank 2,409 13.66% 4.57% 28.48% 47.11% PayPal 17,708 5.40% 1.02% 16.70% 6.41%

TABLE XI: Web traffic during and after site deployment.

Total HTTP Requests Unique IP Addresses Valid URL Invalid URL Valid URL Invalid URL Sites Live 271,943 452,049 6,528 11,869 Prelim. Day 3-14 262,141 7,230 Total 986,133 20,874 Sites Live 355,093 545,704 22,929 54,392 Full Day 3-14 161,676 21,991 Total 1,062,473 80,085