Measuring and Fingerprinting Click-Spam in Ad Networks Vacha Dave ∗ Saikat Guha Yin Zhang MSR India and UT Austin Microsoft Research India Univ. of Texas at Austin [email protected] [email protected] [email protected]

ABSTRACT Incentives for click-spam are linked directly to the flow of money plays a vital role in supporting free and smart- in — advertisers pay ad networks for each click phone apps. Click-spam, i.e., fraudulent or invalid clicks on on- on their ad, and ad networks pay publishers (websites or phone apps line ads where the user has no actual interest in the advertiser’s that show ads) a fraction (typically around 70% [2]) of the revenue site, results in advertising revenue being misappropriated by click- for each ad clicked on their or app. A publisher stands to spammers. While ad networks take active measures to block click- profit by attracting click-spam to his site/app. An advertiser stands spam today, the effectiveness of these measures is largely unknown. to inflict losses on his competitor(s) by attracting click-spam to his Moreover, advertisers and third parties have no way of indepen- competitors’ ads. An advertising network stands to increase rev- dently estimating or defending against click-spam. enues (but lose reputation) by not blocking click-spam. In this paper, we take the first systematic look at click-spam. Reputed online ad networks have in-house heuristics to detect We propose the first methodology for advertisers to independently click-spam and discount these clicks [39]. No heuristic is per- measure click-spam rates on their ads. We also develop an auto- fect. Advertisers pay for false negatives (click-spam missed by mated methodology for ad networks to proactively detect different the heuristic). None of the ad networks we checked release any simultaneous click-spam attacks. We validate both methodologies specifics about click-spam (e.g., which keywords attract click-spam, using data from major ad networks. We then conduct a large-scale which clicks are click-spam, etc.) that would otherwise allow ad- measurement study of click-spam across ten major ad networks and vertisers to optimize their campaigns, or compare ad networks. four types of ads. In the process, we identify and perform in-depth Research goals and approach: We have two main research goals analysis on seven ongoing click-spam attacks not blocked by major in this paper. Our primary goal is to design a methodology that ad networks at the time of this writing. Our findings highlight the enables advertisers to independently measure and compare click- severity of the click-spam problem, especially for mobile ads. spam across ad networks. The basic idea behind our approach is simple: since the user associated with click-spam is, by definition, Categories and Subject Descriptors not interested in the ad, he would be less likely to make any extra effort to reach the target website than a user legitimately interested K.6.5 [Management of Computing and Information Systems]: in the ad. The advertiser can measure this difference and use it Security and Protection—Online Advertising to estimate the click-spam fraction. Of course some legitimately interested users may not make the extra effort (false positives), or Keywords some uninterested users may still make the extra effort (false neg- atives). We correct for both types of errors by using a Bayesian Click-Spam, Click-Fraud, Invalid Clicks, Traffic Quality framework, and by performing experiments relative to control ex- periments. Section 3 details our methodology. 1. INTRODUCTION Validating the correctness of our methodology is challenging be- cause there is no ground truth to compare against. Ad networks Background and motivation: Click-spam costs online advertisers do not know the false negative rate of their heuristics, and thus on the order of hundreds of millions of dollars each year [4]. In- tend to underestimate click-spam on their network [12]. The ac- stead of supporting free smartphone apps and websites, this money 1 curacy of heuristics used by third-party analytics companies (e.g., ends up in the pocket of click-spammers. Click-spam subsumes Adometry) is unknown since their methodology and models are not a number of scenarios that all have two things in common: (1) the open to public scrutiny; indeed, ad networks contend they overesti- advertiser is charged for a click, and (2) the user delivered to the mate click-spam [16]. We manually investigate tens of thousands of ad’s target URL has no actual interest in being there. Click-spam clicks we received (in Section 5). We present incontrovertible ev- can be generated using a variety of approaches, such as (i) idence of dubious behavior for around half of the search ad clicks (where on the user’s computer clicks on ads in the back- and a third of the mobile ad clicks we suspect to be click-spam, ground), (ii) tricking or confusing users into clicking ads (e.g., on and circumstantial evidence for the rest, thus establishing a tight parked domains), and (iii) directly paying users to click on ads. margin-of-error in our methodology. In the process, we discover ∗ seven ongoing click-spam attacks not currently caught by major ad Work done while at Microsoft Research India networks, which we reported to the parties concerned. 1or equivalently, click-fraud. Since a fraudulent motive is often difficult to conclusively prove, we use the term click-spam instead. Our secondary goal is to measure the magnitude of the click- spam problem today. To this end, in Section 4, we apply our method- ology to measure click-spam rates across ten major ad networks (in- Permission to make digital or hard copies of all or part of this work for cluding , Bing, AdMob, and Facebook) and four ad types. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies Our work represents the first measurement study of click-spam. bear this notice and the full citation on the first page. To copy otherwise, to We also identify key research problems that can have a measur- republish, to post on servers or to redistribute to lists, requires prior specific able impact in tackling click-spam. permission and/or a fee. SIGCOMM’12, August 13–17, 2012, Helsinki, Finland. Contributions: We make three main contributions in this pa- Copyright 2012 ACM 978-1-4503-1419-0/12/08 ...$15.00. per. (i) We devise the first methodology that allows advertisers

175 The publisher gets some fraction (typically 70%) of the revenue that the ad network collects from the advertiser. The accounting is performed as follows. The ad URL points to the ad network’s domain with information about which ad was clicked (encoded in the GET parameters). When the user clicks an ad, the browser contacts the ad network, which logs the encoded information for billing purposes and redirects the user to the advertiser’s site. This redirection is typically performed through an HTTP 302 response, which preserves the publisher’s URL in the HTTP Referer seen by the advertiser’s Web server. Black-hat techniques such as Referer- Cloaking [5] by the publisher, ad network policies [14], or bugs in browsers and proxies may result in empty or bogus Referer values being sent to the advertiser. User engagement: The ad network can track limited user engage- 1: Time-line for serving ads. ment (i.e., ads viewed or clicked) for multiple ads shown across multiple publishers (e.g., using cookies), but cannot, in general, to independently measure and compare click-spam rates across ad track user engagement after the click. The advertiser, on the other networks. We validate the correctness of the methodology using hand, can track detailed user actions (only) on the advertiser’s own real-world data. (ii) We report on the sophistication of ongoing website, but cannot track user engagement with other ads. Thus click-spam attacks and present strategies for ad networks to mit- while the ad network has a broad-but-shallow view, the advertiser igate them. (iii) We conduct a large-scale in-depth measurement has a narrow-but-deep view into user engagement. study of click-spam today. Click-spam discounts: Ad networks internally discount clicks based on in-house heuristics. The user is still redirected to the ad- 2. ONLINE ADVERTISING PRIMER vertiser, but the advertiser is not charged for the click. Ad networks Search vs. contextual: Keyword-based advertising is broadly do not indicate which clicks were charged and which not in the classified into two categories: (i) search advertising, which are ads advertiser’s billing report. based on search keywords that show up on the side of search re- sults, and (ii) , which are ads that show up 3. ESTIMATING CLICK-SPAM on Web pages or in applications based on keywords extracted from In this section we design a method that any party (e.g., advertis- the context. Search ads may be syndicated — i.e., they are shown ers, ad agency, or researchers) can use to estimate click-spam rates not only on the search engine operated by the ad network (e.g., for a given ad without explicit cooperation from the ad network. on www.google.com), but also on affiliate websites that offer cus- tomized search engines (e.g., www.ask.com). The term publisher 3.1 Challenges refers to the party that showed the ad (e.g., website for contextual No ground truth: It is not possible to first identify (definitively) ads, smartphone application for mobile ads, affiliate for syndicated which clicks are click-spam, and then compute what fraction of the search ads). We do not consider other kinds of ads (e.g., ads in total traffic click-spam accounts for. A click is click-spam if the videos, banner ads, etc.) in this paper. user did not intend to click the ad. There is no way to conclusively Mobile vs. non-mobile: Both search and contextual advertis- determine user intent without explicitly asking the user. ing can be further classified as mobile or non-mobile based on No global view: As mentioned, the ad network cannot track user what device the search or webpage request originated from. Mo- engagement on the advertiser site, and the advertiser has no knowl- bile includes smart-phones and other mobile devices that have “full edge of the user’s engagement with other advertisers. For example, browser capabilities”, as well as feature-phones with limited WAP the ad network does not know if the user never loads the adver- browsers. The reason we draw a distinction between mobile and tiser’s site after the click (we saw botnets exhibiting this behavior), non-mobile is because we found ad networks internally seem to and the advertiser does not know if the same user is implicated in have very different systems for serving the same ads to mobile vs. click-spam attacks on another advertiser. The obvious solution is to non-mobile users. We do not know the reason for this differ- for the ad network and advertiser to cooperate. But financial dis- ence, but speculate it is because ad networks tend to expand through incentives and legal concerns prevent them from doing so: the ad- mergers and acquisitions, resulting in multiple technology stacks vertiser could lie in his favor (claiming fraud where there is none) operating concurrently. in order to gain deeper discounts from the ad network; the ad net- Ad delivery: Figure 1 illustrates the time-line for serving online work could be held liable if sharing user history with advertisers ads. When the user visits a publisher website, the website returns has unforeseen privacy consequences. an adbox (e.g., embedded iframe), which causes the user’s browser Granularity: The granularity over which click-spam is estimated to contact the ad network. The request to the ad network identifies is important. An ad network may have low click-spam overall, say, the referring website through the HTTP Referer (sic) header. The but certain lucrative segments (e.g., mortgage) may be experiencing ad network then populates the adbox with contextual ads. In an orders of magnitude more click-spam. Advertisers require fine- alternate mechanism (not shown), premium publishers may directly grained measurements for their selected set of keywords. query the ad network for relevant ads and seamlessly integrate them Noise: As in any -scale system, data is extremely noisy. into the website content. We encountered users where Referer headers are inexplicably omit- Charging model: While there are multiple charging models for ted, or browser User-Agents, IP addresses, cookies etc. change in- online advertising (e.g., impression-based, action-based), by far the explicably within the same session (perhaps a buggy browser or most common is pay-per-click (PPC or CPC), where the advertiser proxy), and bad publishers that behave non-deterministically (per- is charged (and publisher paid) only if the user clicks on the ad. haps to avoid detection). At the same time, because clicking an ad

176 is a rare event, gathering good data is time-consuming. The combi- Assumption 2. We make an implicit assumption that for users that nation results in very low signal-to-noise ratios. are not turned away by the interstitial page, their likelihood of be- coming gold-standard users is not significantly changed. We do not 3.2 Our Approach assume that the interstitial page will not turn away users who may We apply a Bayesian approach to work around the lack of ground have become gold-standard users. Rather we assume for example truth. Instead of attempting to conclusively identify which clicks that some users will not wait on the interstitial page, but if they do, are click-spam, for a given ad, we create two scenarios detailed they will not hold a grudge. We empirically find this assumption to below where the (unknown) fractions of click-spam traffic is differ- hold in practice as we report later. ent. We link the two using a Bayesian formula to effectively cancel 4. Lastly, extending the technique in [25], the advertiser creates a out quantities we cannot measure. The remaining quantities are second ad that is identical to the ad created in step 1 above with the those an advertiser can measure locally without requiring a global exception of the text of the ad. That is, the second ad has the same view. We control for noisy data (i.e., adjust for false-positives and targeting criteria, advertising budget, and destination URL, but the false-negatives) using a control experiment. Our approach does not text of the ad is junk (e.g., a random nonsensical combination of (and indeed cannot) report whether a specific click is click-spam or words). We use this ad to adjust for false positives. not. The output is a single number representing our estimate of the fraction of clicks for the given ad that click-spam accounts for. Assumption 3. We assume that very few users will intentionally click the second ad. While some users may be curious about the 3.2.1 Data Collection random set of words, we find this assumption is borne out in prac- tice. We also assume for now that the click-spam click-through- The detailed data collection procedure is as follows. ratio i.e., the ratio of click-spam clicks to impressions of the ad) is 1. The advertiser (or a researcher signed-up as an advertiser with independent of the text of the ad (we relax this assumption later). an ad network) creates his ads of interest in the usual way. That 3.2.2 Bayesian Estimation is, enters the text of the ad, selects ad targeting criteria (i.e., search keywords, user demographics, mobile vs. non-mobile, etc.), sets Let Gd and Gi be the event that the user is a gold-standard user his advertising budget (i.e., auction bid amount, daily-budget), and that arrived either directly, or via the interstitial page respectively. sets the destination URL of the ad to his landing-page — a page on Let Id and Ii be the event that the user intended to click the ad (i.e., the advertiser’s website the user should be sent to. not click-spam) out of all users directly reaching the landing-page, or all users reaching via the interstitial page respectively. 2. When a user clicks the ad, the advertiser website either loads the Bayesian equation for : intended landing-page for the ad, or with some (small) probability P (Id) The advertiser is interested in first loads an interstitial page before eventually loading the landing- learning P (Id). G and I are linked using Bayes theorem as fol- page. The point of the interstitial page is to turn away some fraction lows: P (G|I) = P (I|G)×P (G)/P (I). Note that a gold-standard of users (legit clicks or not). To this end, the interstitial page may user implies that the click is not click-spam, i.e., P (I|G) = 1. As discussed above in Assumption 2, P (Gd|Id) ≃ P (Gi|Ii); substi- require some passive or active user engagement before continuing × tuting and simplifying yields: P (Gd) P (Ii) . on to the landing-page. An example of passive engagement is to P (Id) = P (Gi) require the user to stay on the page for a few seconds before con- P (G) is computed as the ratio of the number of gold-standard tinuing on to the landing page (the interstitial page in this case may clicks (g; known) to the number of clicks (n; known). P (Ii) is the simply show a “loading...” message; it is known that a few sec- ratio of the number of non-click-spam clicks (ii; unknown) to the onds increase in page loading time can result in a measurable drop number of clicks (ni; known) arriving via the interstitial page. The in traffic [6]). Some examples of active engagement is to require above equation reduces to: the user to click a link (e.g., “This page has moved. Click here to gd × ii continue.”) or, in the extreme case, solve a CAPTCHA [15], both nd ni gd × ii P (Id) = gi = (1) of which also result in significant drops in traffic. nd × gi ni

Assumption 1. An implicit assumption is that the interstitial page Only ii on the right-hand-side is unknown. will turn away a (unknown) larger portion of the click-spam traf- Estimating ii: As mentioned, the interstitial page enhances the fic than the (unknown) portion of legitimate traffic it turns away. traffic quality. If it did a perfect job — i.e., all unintentional clicks Specifically, we do not assume for example that all good users will would be turned away, and all intentional clicks would pass through patiently wait a few seconds; rather we assume that a smaller por- — computing ii would be trivial. In practice, the interstitial page tion of click-spam traffic will wait the duration than the portion of has false-negatives (turns away users intentionally clicking the ad) good users doing so. Indeed user dwell time and willingness to and false-positives (not turning away click-spam). False-negatives click have long been considered an implicit measure of user inter- do not affect ii since it is the number of non-click-spam clicks ac- est [29, 35]. Interstitial pages leverage this knowledge to enhance tually reaching the landing-page, but false-positives result in an in- the quality of the traffic reaching the landing-page by some (un- flated value of ii. We use a control ad to estimate the number of known) amount as compared to without the interstitial page. false-positives, and adjust for it. 3. Next the advertiser defines some advertiser-specific user en- Let Fi and Ti be the false-positive and true-positive click-through- gagement that he considers ultimate proof that the user intended ratios for the original ad and the interstitial page i. Similarly, let ′ ′ to click the ad. For example, if the user makes a financial transac- Fi and Ti be the false- and true-positive click-through-ratios for tion on the advertiser’s site, or signs up for some mailing list, etc. the control ad (identical ad except with junk ad text as mentioned We call such users gold-standard users. Certainly not all users are above). We have four unknowns, and need four equations to solve. ′ gold-standard. But as long as a handful of users coming directly As discussed above in Assumption 3, we assume Ti ≃ 0 and ′ to the landing-page, and a handful coming to the landing-page via Fi = Fi . The advertiser can measure Fi + Ti = li/d where li the interstitial page are gold-standard, we can apply our Bayesian is the number of clicks for original ad reaching the landing page approach below. through the interstitial page, and d is the number of impressions of

177 the original ad as reported by the ad network, and the corresponding use, and 3) what the text of the control ad is. We discuss the impli- ′ ′ ′ ′ equation Fi + Ti = li/d for the control ad. cation of each design decision in turn. First, if the advertiser sets The estimate for ii is simply Ti × d. Solving the four equations too high a bar for the gold-standard user he may not get statisti- above for Ti we get the value of ii adjusted downwards to account cally significant data; if he sets too low a bar that even click-spam ′ d for false-positives as: ii = li − li × d′ users get classified as gold-standard he will underestimate click- Final estimation formula: Combining the above with Eq. (1) we spam rates. Second, if the advertiser picks too easy an interstitial can estimate the click-spam rate for the original ad as: page (everyone gets through), in Eq. (2) gd/gi will approach nd/li and the estimate will approach (i.e., all clicks are legitimate) if the ′ d 1 gd × `li − li × d′ ´ advertiser doesn’t use a control ad; or 0 if he uses a control ad (i.e., P (Id) = (2) nd × gi no clicks are valid). If the advertiser picks too hard an interstitial where: page (no one gets through), gi and li will both approach 0, and the gd, gi : numbers of gold-standard users arriving directly and click-spam estimate will become undefined. Thus there is clearly through the interstitial page respectively some sweet-spot in designing the interstitial page, which we do not d, d′ : number of impressions of the original ad and control discuss. Third, if the control ad is not independent of the original ad respectively ad (e.g., the random choice of words happens to be related to the ′ li, li : number of clicks reaching (via the interstitial page) original ad), false-positives may be over- or under- corrected for. the landing-page for the original ad and control ad Making the right design choices is advertiser-specific. respectively To address the above issues to some degree, we report in the nd : number of clicks on the original ad directly reaching next section our experience with multiple types of interstitial pages, the landing-page different definitions and numbers of gold-standard users. While our data shows much promise in our approach, we stress that a more All these quantities can either be measured directly by the adver- thorough evaluation is needed. tiser, or are present in billing reports ad networks generate today. 3.3 Limitations 4. MEASURING CLICK-SPAM TODAY A key limitation of our approach is that the advertiser must ac- In this section we first validate the correctness of our approach tively measure click-spam. The advertiser must interpose the inter- from the previous section. We then conduct a large-scale measure- stitial page on live traffic (that he has paid-for), create a control ad ment study of ten major ad networks and four types of ads. (that he needs to pay for) to correct for false-positives, etc. Both the Validation strategy: We assume that reputed search ad networks interstitial ad and control ad harm the user experience. It would be (specifically Google and Bing) are mature enough that their in- far more desirable to be able to passively look at logs and be able house algorithms are able to detect and discount for most of the to estimate the click-spam rate from them. click-spam on their search affiliate network. Validating our mea- One way to minimize the user experience impact is to apply our surement approach then involves computing our click-spam esti- approach reactively when click-spam is suspected, but that runs mate and comparing it to the charged clicks for Google and Bing into a second limitation — the rarity of data. Any estimation tech- search ads. Note that our algorithm does not have access to any nique requires statistically significant data. The crucial factor in data (including historical and aggregate data) that in-house algo- Eq. (2) is gi and gd — the number of gold-standard users. If these rithms at Google and Bing have access to, and Google and Bing are small, the click-spam estimate can swing wildly. Suppose the do not have access to the detailed user-engagement data we collect advertiser manages to identify two gold-standard users, one arriv- as advertisers for user clicking our ads (specifically, we do not use ing through the interstitial page and one directly, and computes a any of the analytics products offered by Google or Bing). Given the click-spam estimate based on it. If one new gold-standard user ar- datasets are completely different, if the click-spam rates we com- rives through the interstitial site (or directly), the new click-spam pute match that computed by leading ad networks (which they do estimate is half the previous (or will double). For a statistically as we report below), we have a strong reason to believe that our significant estimate, as we report later, the advertiser must wait for measurement approach is sound. roughly 25 gold-standard users via the two paths. This is especially an issue for small advertisers. Small advertisers may have to wait 4.1 Methodology a long time to get gold-standard users — low advertising budgets We sign-up with ad networks as three different advertisers (each means their ads don’t get shown as much, even if they get shown targeting different keywords) and follow the methodology from users may not click on poorly ranked ads, even if they click they the previous section. The first advertiser targets a highly popular may not engage in a financial transaction with the advertiser, etc. keyword (celebrity). The second, a medium-popularity keyword The need to gather data over such an extended period is clearly at (yoga). And the third, a low-popularity keyword (lawnmower). We odds with minimizing the impact on user experience. pick the keywords from a ranked list of popular keywords that the A group of small advertisers targeting similar keywords/users (or advertising tools of these ad networks provide. an ad-agency representing them) can apply our approach in the ag- For each keyword we create a realistic looking landing page, gregate. Doing so has two benefits. First, due to the aggregation since the policies of the ad networks require us to use keywords effect the group accretes statistically significant data more quickly. that are relevant to the landing page. We instrument the landing- And second, the user experience impact is amortized across many page to track mouse-movement, time spent on the page, switching advertisers. The downside, however, is that advertisers lose the browser tabs into or away from our page, whether any link on the ability to individually define what a gold-standard user means (which page was clicked or not, page scroll, etc. Our instrumentation is our approach otherwise allows) and have to depend on someone through Javascript on the page; we detect browsers that don’t sup- other than themselves to estimate click-spam rates. port Javascript or have it disabled and exclude those data points. Finally, our approach is naturally sensitive to the three choices We direct on-path proxies (if any) to not cache any response so the advertiser needs to make: 1) what his definition of a gold- our server logs accurately reflect accesses. Unless otherwise men- standard user is, 2) what interstitial page approach he wishes to tioned, we pick a lax definition of gold-standard users based on

178 1.25 celebrity yoga 1 lawnmower

0.75 2: A control ad; content is a random set of English words 0.5 this telemetry, which is, that the user stays on the page for at least 0.25 five seconds, and (for non-mobile browsers) produces at least one Fraction valid (norm.) mouse/cursor move event. We are unable to define gold-standard 0 A B C users based on financial transactions, even though we expect that to be very strong signal of intent for real advertisers. Next we create three interstitial pages: the first shows a loading 3: Normalized estimates for search ads message for five seconds before automatically redirecting to the landing-page. The second asks the user to click a link to continue to 4.4 Ethics the landing-page. And the third asks the user to solve a CAPTCHA. Throughout our study we followed the advertiser terms-of-service We do not test the CAPTCHA interstitial for Google traffic since (current as of when we did the measurement) for each of the ad their advertiser policies restrict us from doing so. networks we measured. Whenever our ads were rejected by the We then create four ads for each target landing-page. The first ad ad network (due to policy reasons) we fixed the issue so as to be directly takes the user to the landing page. The second, third and compliant; if we couldn’t fix it, we simply dropped that data-point. fourth ads first take the user to the three interstitial pages respec- High click-spam is an embarrassment for ad networks. Our goal tively, before continuing on to the landing page. All ads target the in this paper is to systematically design a methodology, highlight same keyword(s), user demographics, device and platform types, the severity of the click-spam problem, and give researchers the etc. The reason we create four separate ads (instead of a single tools and knowledge to further the state of the art. Our goal is not to one and interposing the interstitial page after the click) is so that embarrass ad networks. As a result, we prefer to report normalized Google/Bing produce fine-grained billing reports and statistics for or relative numbers whenever possible, and anonymize ad network each ad, which we can then validate our design choices against. names whenever it does not affect the core message of this paper. We create four additional (control) ads for each landing-page that Lastly, we expressly try to minimize adversely impacting user correspond to the four original ads, but with junk ad text. The ad experience on these ad networks. For example, in order to get text was generated by picking five random words from an English enough clicks on an ad, we have two options: run the campaign dictionary (e.g., Figure 2). for longer, or increase the bid amount. We always choose the latter 4.2 Scale to minimize the time our ad is active on the network. For ads where despite increasing the bid we cannot gather traffic fast enough, we We repeat the above for 10 ad networks. For search ads we mea- prefer to give up on that data-point and stop running that ad. Mini- sure , Bing Search, and 7Search. For contextual ads mizing the time our ads are active also minimizes our contribution we measure Google AdSense and Bing Contextual. For mobile ads to the existing auction volatility for the keywords we bid on. As of we measure Google Mobile, Bing Mobile, AdMob (now owned by this writing we have not received any complaints from ad networks, Google), and InMobi. And lastly for social ads, we measure Face- users, or advertisers regarding our study. book. Altogether this adds up to 216 ads across all the networks. We run the ads for a period of 50 days as needed to gather enough 4.5 Validation data. The majority of the ads were flighted in early January 2012. We continually adjust bids (mostly revising them higher) to help Figure 3 compares (normalized) complementary click-spam rates the lower popularity ads quickly attract enough data. computed by our approach (plotted as error bars) and the (nor- In all our ads were shown 26M times across all ad networks. malized) complementary click-spam rates as reported by Bing and They resulted in a total of 85K clicks (17K charged). Our ads were Google for their search ad networks (plotted as bars). We ran an- shown at at-least 1811 publisher websites and mobile apps (but the other experiment where where we explicitly set our ad campaign true number is likely much higher since we cannot determine the to exclude syndicated search partners for one of the search ad net- publisher for over 65% of our traffic). The landing pages were works (plotted as C in Figure 3). The Bing and Google estimates fetched by a total of 33K unique IP addresses located in 190 coun- are the ratio between the number of clicks we were charged for tries. We encountered over 7200 browser User-Agent strings (after (from the billing report), and the total number of landing-page fetches sanitizing them to remove browser plugin version numbers). (from our logs). For our approach, we calculate two separate esti- mates based on the delay and click interstitial pages. The spread of 4.3 Data the error bar plots the max and the min of the estimates we compute. We log all web requests made to our server. The logs used in this The center tick plots the average. The figure plots the estimates for study are standard Apache webserver logs that include the user’s all three ads we flighted. In line with our goals, this figure (and IP address, date and time of access, URL accessed (of a page on all other figures in this section) are normalized so one of the data our webserver) along with any GET parameters, the HTTP Referer points is 1. value and User-Agent value sent for that request, and a cookie value As is evident from Figure 3, our estimate for the yoga and lawn- we set the first time we see a user to identify repeat visits from mower ads are in the same ball-park as that reported by Google the same user. The raw logs including user engagement telemetry and Bing. We manually investigated the difference between our weighs in at over 3 GB. estimates and that for the celebrity ad. We found over 50 clicks A sanitized version of our raw logs is available online2. from sites associated with well-known search redirection viruses where browser toolbars hijack normal user searches and funnel 2http://www.cs.utexas.edu/~vacha/sigcomm12-clickspam.tar.gz them through affiliate search programs (Section 5.3.1 has more de-

179 0.3 1 100% click 5s dwell, 1 mouse ev 0.25 delay 15s dwell, 5 mouse ev captcha 0.8 30s dwell, 15 mouse ev 50% 0.2 0.6 0% 0.15 0.4 ad=yoga, intst=click, gstd=5s+1mev 0.1 -50% ad=celebrity, intst=delay, gstd=5s+1mev Fraction cleared 0.2 ad=celebrity, intst=click, gstd=5s+1mev

0.05 Delta from final estimate ad=celebrity, intst=click, gstd=15s+5mev Fraction gold-standard -100% 0 0 0 5 10 15 20 25 30 35 40 45 Original Control Original Control # gold-standard users (a) Interstitial page performance (b) Gold-standard user definition (c) Convergence

4: Design space exploration tails). We were charged for at least 48 of these clicks. Our estimates match the search ad network’s estimates recomputed after discount- 1 1 0.8 ing these clicks. Furthermore, as seen for network C where syndi- 0.75 cated search partners are excluded, our estimates closely match that 0.6 0.5

reported by the network. There were no clicks on the control ads CDF 0.4 on network C, which further supports our high estimate. 0.25 A 0.2 D Our absolute numbers also agree with public estimates of av- Fraction valid (norm.) B C erage click-spam for these networks [3]. Next we drill deeper to 0 0 A B C D 0s 2s 4s 6s 8s 10s validate our design decisions. (a) Mobile Ad Network (b) Dwell Time 4.5.1 Interstitial Pages 5: Normalized estimates and dwell times for mobile ads Figure 4a plots the fraction of clicks for each interstitial page that reach the landing page for the celebrity ad, and for the correspond- ing control ad. Note that this fraction drops as the interstitial page 4.6 Search Ad Networks changes from clicking a link (29%), to waiting 5 seconds (13%), to Figure 3 shows that while reputed search ad networks generally solving a CAPTCHA (4%), demonstrating the increasingly higher have a good handle on click-spam, a single average click-spam met- bar set by the interstitial pages. Interestingly, we find users are ric across the entire network is of little use due to different key- more likely to click through to the landing-page than wait 5 sec- words experiencing different levels of click-spam. An advertiser onds. Except for the CAPTCHA interstitial, the fraction reaching cares only about click-spam rates for keywords he is interested in the landing page is significantly lower for the control ad than for bidding on. The difference in click-spam rates between the celebrity the original ad; this validates Assumption 1 from the previous sec- ad and lawnmower ad is up to 20% (normalized). tion that the interstitial page concentrates non-click-spam traffic (by We omit discussion of 7Search since we did not get any gold- some unknown amount). Despite the varied interstitial page perfor- standard users through that network to base our estimates on. mance, the estimates computed from the delay and click interstitial converge (in Figure 3) for the experiments where we have a bal- 4.7 Mobile Ad Networks anced number of converters through the interstitial and direct path, Figure 5a plots our click-spam estimates and the networks’ own which supports Assumption 2. The CAPTCHA seems to reduce estimates for the celebrity ad across the four mobile ad networks both normal and control traffic to the same low base level regardless we measured. Despite running our ads for over a month, and weak- of user intent; as a result, it is unsuitable for use in our framework. ening our definition of gold-standard users to only 5s of dwell time (i.e., user spent 5s on our landing page; no tap event required), 4.5.2 Gold-Standard Users we failed to attract even five gold-standard users for the yoga and Figure 4b plots the fraction of gold-standard users for the origi- lawnmower ads, and attracted fewer than twenty for the celebrity ad. nal celebrity ad and the control ad, for three different definitions of While, our estimates are below the convergence threshold, to inves- gold-standard users. The first definition is, as before, 5s of dwell tigate the huge difference in our interim estimates and ad network time and 1 mouse event. The second definition is 15s of dwell time numbers, we plot the CDF of user dwell-time in Figure 5b. and 5 mouse events. The third definition is 30s of dwell time and 15 Ad network A charged us for over a third of the clicks (non- mouse events. Note that the fraction of gold-standard users for the normalized), yet as illustrated by the y-intercept in Figure 5b, over control ad is zero for the second and third definition. This validates 95% of network A users spent under a second on our landing-page! Assumption 3 that very few users are curious enough to click the We find evidence of an attack that would result in such a signature control ad. In a real-world setting, we expect advertisers to define in Section 5.4.1. Network D appears to be quite well aware of the gold-standard users based on financial transactions (much tighter poor traffic quality on their network; they charged us for less than than any of our definitions). 1% of the clicks. Mobile ad network C is a curious case. There is We focus next on the sensitivity of the click-spam estimate to the practically no difference between the click-through-rate (CTR) of number of gold-standard users. Figure 4c plots the convergence our original ad and the CTR of our control ad with junk text, sug- of our click-spam estimate as a function of the number of gold- gesting that the content of the ad is irrelevant for users clicking ads standard users for various combinations of our ad, interstitial page, on this network. Our Bayesian formula understandably estimates and definition of gold-standard user. X-values are driven by the click-spam to be nearly 100% for this network despite the network periodicity of ad net