Modeling Indirect Effects of Paid Search Advertising: Which Keywords Lead to More Future Visits?
CELEBRATING 30 YEARS Vol. 30, No. 4, July–August 2011, pp. 646–665 10.1287/mksc.1110.0635 issn 0732-2399 eissn 1526-548X 11 3004 0646 © 2011 INFORMS
Modeling Indirect Effects of Paid Search Advertising: Which Keywords Lead to More Future Visits?
Oliver J. Rutz School of Management, Yale University, New Haven, Connecticut 06511, [email protected] Michael Trusov Robert H. Smith School of Business, University of Maryland, College Park, Maryland 20742, [email protected] Randolph E. Bucklin Anderson School of Management, University of California, Los Angeles, Los Angeles, California 90095, [email protected]
any online shoppers initially acquired through paid search advertising later return to the same website Mdirectly. These so-called “direct type-in” visits can be an important indirect effect of paid search. Because visitors come to sites via different keywords and can vary in their propensity to make return visits, traffic at the keyword level is likely to be heterogeneous with respect to how much direct type-in visitation is generated. Estimating this indirect effect, especially at the keyword level, is difficult. First, standard paid search data are aggregated across consumers. Second, there are typically far more keywords than available observations. Third, data across keywords may be highly correlated. To address these issues, the authors propose a hierarchical Bayesian elastic net model that allows the textual attributes of keywords to be incorporated. The authors apply the model to a keyword-level data set from a major commercial website in the automotive industry. The results show a significant indirect effect of paid search that clearly differs across keywords. The estimated indirect effect is large enough that it could recover a substantial part of the cost of the paid search advertising. Results from textual attribute analysis suggest that branded and broader search terms are associated with higher levels of subsequent direct type-in visitation. Key words: Internet; paid search; Bayesian methods; elastic net History: Received: November 3, 2009; accepted: December 22, 2010; Eric Bradlow served as the editor-in-chief and Carl Mela served as associate editor for this article. Published online in Articles in Advance June 6, 2011.
1. Introduction spent on that click is “lost”; i.e., it has not gener- In paid search advertising, marketers select specific ated any revenue. This line of research has not yet keywords and create text ads to appear in the spon- addressed the subsequent revenues that may be gen- sored results returned by search engines such as erated by consumers returning to the site. Revenues Google or Yahoo!. When a user clicks on the text (e.g., sales, subscriptions, referrals, display ad impres- ad and is taken to the website, the resulting visit sions) generated by site visitors who arrive by means can be attributed directly to paid search. Existing not directly traced to paid search are typically consid- research has focused on modeling traffic and sales ered to be separate. Visitors, however, could decide generated by paid search in this direct fashion (e.g., not to purchase at the first (paid search originated) Ghose and Yang 2009, Goldfarb and Tucker 2011, Rutz session but return to the site later either to view more 1 et al. 2010). In these models, the effectiveness of content or perhaps to complete a transaction. Indeed, paid search is evaluated by attributing any ensuing after initially coming to a site as a result of a paid direct revenues and pay-per-click costs to keywords search, consumers could return again and again, cre- or categories of keywords. These approaches assume ating an ongoing monetization opportunity. The prob- implicitly that if a consumer does not purchase in lem of quantifying these indirect effects of paid search the session following the click-through, the money is similar to the one described by Simester et al. (2006) in the mail-order catalog industry. They point out that 1 A notable exception is a dynamic model developed by Yao and a naïve approach to measuring the effectiveness of Mela (2011). By leveraging bidding history information in a unique mail-order catalogs omits the advertising value of the data set from a specialized search engine in the software space, the authors are able to infer present and future advertiser value arising catalogs and that targeting decisions based on models from both direct and indirect sources. without indirect effects produce suboptimal results.
646 Rutz, Trusov, and Bucklin: Modeling Indirect Effects of Paid Search Advertising Marketing Science 30(4), pp. 646–665, © 2011 INFORMS 647
In the case of paid search advertising, ignoring indi- and one might posit that each keyword has its own rect effects can positively bias the attribution of sales specific indirect effect. From the firm’s perspective, to other sources of site traffic and downwardly bias the indirect effect is important to quantify to effec- keyword-level conversion rates. tively manage paid search advertising. Consumers can access a website by typing the com- There are several challenges in doing this. First, pany’s URL into their browser or using a previously there is typically no readily available consumer-level visited URL (stored either in the browser’s history or data that firms can use to accurately identify a previously saved “bookmarks”). Because these visits “return” visit. Hence, attribution of future direct type- occur without a referral from another source, they fall in traffic to keyword-level paid search traffic needs to into the class of traffic known as nonreferral, and we be done based on aggregate data. Second, the num- label them direct type-in visits.2 Referral visits can be ber of keywords managed by the firm is often much due to paid search (the consumer accesses the site via larger than the number of observations, e.g., days, the link provided in the text ad at the search engine), available. Finally, paid search traffic generated by sub- organic search (the consumer clicks on a non-paid, sets of keywords is frequently highly correlated. To organic link at the search engine), or other Internet handle these problems, we introduce a new method links, such as clicking on a banner advertisement. In to the marketing literature: a flexible regularization this paper, we focus on referrals from a search engine and variable selection approach called the elastic net (both organic and paid) and their impact on future (Zou and Hastie 2005). We adopt a Bayesian ver- nonreferral visits (the so-called direct type-in). Specif- sion of the elastic net (Li and Lin 2010, Kyung et al. ically, we are interested in how past paid search visits 2010) that allows us to estimate keyword-level indi- affect the current number of direct type-in visits to rect effects across thousands of keywords even with a site. a small number of observations. We also extend the Shoppers formulate their search queries in dif- literature on Bayesian elastic nets by incorporating a ferent ways depending on their experience, shop- hierarchy on the model parameters that inform vari- ping objectives, readiness to buy, and other factors. able selection. This allows us to leverage information Indeed, managers at the collaborating firm that pro- the keyword provides by introducing semantic key- vided our data—an automotive website—told us that word characteristics. To extract these semantic char- when they switched off some keywords, a drop in acteristics from keywords, we introduce an original direct type-in visitors occurred “a couple of days structured text-classification procedure that is based later,” whereas switching off others did not affect on managerial knowledge of the business domain and future direct type-in traffic. We posit that self-selection design elements of the website. The proposed proce- into the use of different keywords may reflect the dure is extendable to other data sets and is easy to propensity of a paid search visitor to return to the implement. site later via direct type-in. This proposition is con- We test our model on data from an e-commerce sistent with previous research in paid search (e.g., website in the automotive industry. From our per- Ghose and Yang 2009, Goldfarb and Tucker 2011, Rutz spective, the indirect effects should be especially pro- et al. 2010, Yao and Mela 2011) that showed hetero- nounced in categories such as big-ticket consumer geneity of direct effects across keywords. We seek to durable goods, where the consumer search process is contribute to this stream of research by expanding long (e.g., Hempel 1969, Newman and Staelin 1972). this notion of keyword-level heterogeneity into the Given the nature of the product and evidence for a domain of indirect effects. For example, imagine two prolonged search process (e.g., Klein 1998, Ratchford consumers wanting to buy a DVD player. One of them et al. 2003), we believe that this category is well suited searches for “weekly specials Best Buy DVD player” to study indirect effects in paid search. For example, and the other for “Sony DVD player deal.” An ad Ratchford et al. (2003) find that consumers, on aver- for http://www.bestbuy.com is displayed after both age, spend between 15 and 18 hours searching before searches, and both consumers click on the ad. From purchasing a new car. Best Buy’s perspective, it seems more likely that the Our intended contribution is both methodological consumer who searched for Best Buy’s weekly spe- and substantive. Methodologically, given its robust cials will revisit the site. This leads to the notion that performance in handling very challenging data sets keywords may be proxies for different types of con- (i.e., many predictors, few observations, correlation), sumers and/or search stages. Thus, they may be more we believe that the (Bayesian) elastic net will find or less effective for the advertiser based on potential many useful applications among marketing practi- differences in their indirect effects on future traffic, tioners and academics. To the best of our knowledge, we are the first to introduce this model to the mar- 2 Note that the term “direct type-in” should not be confused with keting field. Moreover, we offer an important exten- the direct effect of paid search. sion to this model by allowing the variable selection Rutz, Trusov, and Bucklin: Modeling Indirect Effects of Paid Search Advertising 648 Marketing Science 30(4), pp. 646–665, © 2011 INFORMS parameters to depend on a set of predictor-specific and corporate policy. First, firms may require each information in a hierarchy. Substantively, we expand entering visitor to provide log-in identification infor- the paid search literature by introducing another mation that can help with tracking across multiple dimension of keyword performance—propensity to visits. Although this approach is indeed used for generate return traffic. We find significant heterogene- existing customers of some sites—often subscription- ity in indirect effects across keywords and demon- type services (e.g., most of the features of Netflix or strate that propensity to return is, in part, revealed Facebook require users to be logged in)—it is unlikely through the textual attributes of the keywords used. to be effective with new customers. Because new Our empirical findings offer new insights from text customer acquisition is one of the primary goals of analysis and should help firms to more accurately paid search advertising, asking for registration at an select and evaluate keywords. early stage is likely to turn many first-time visitors The rest of this paper is structured as follows. First, away. Second, although tracking is typically imple- 3 we give a brief introduction to paid search data and mented by use of persistent cookies, this approach discuss the motivation for our modeling approach. has become less reliable in recent years. Current We then present our model, data set, and results. releases of major Internet browsers (e.g., Internet Next, we discuss the implications of our findings for Explorer, Mozilla Firefox, Google Chrome) support a feature that automatically deletes cookies at the end paid search management. We finish with a conclusion, of each session. In addition, many antivirus and anti- the limitations of our approach, and a discussion of adware programs—now in use on more than 75% of future research in search engine marketing. Internet-enabled devices (PC Pitstop TechTalk 2010)— do the same. Although there seems to be no con- 2. Modeling Approach and sensus on the percentage of consumers who delete Specification cookies regularly, reported numbers range up to 40% (Quinton 2005, McGann 2005). A cookie-based track- We propose to examine indirect effects of paid search ing approach would misclassify any repeat visitor as a by modeling the effect of previous paid search visits new visitor if the cookie was deleted, biasing any esti- on current direct type-in visits. Our goal is to do this mate that is based on the ensuing faulty repeat versus based on the type of readily available, aggregate data new visitor split. Finally, as a corporate policy, many provided to advertisers by search engines, along with companies in the public sector (39% of state and 56% easily tracked visit logs from company servers. of federal government websites) and some companies in the private sector (approximately 10%) simply pro- 2.1. Aggregate vs. Tracking Data hibit tracking across sessions (West and Lu 2009). Ideally, firms would have consumer-level data at In the absence of tracking data, aggregate data are the keyword level tracked over time. Although the available that capture how visitors have reached the “ideal” scenario is attractive, our view is that it is site. Visitors to any given website are either refer- effectively infeasible. There are two places where ral or nonreferral visitors. A nonreferral visit occurs these data can be potentially collected—firms’ web- if a consumer accesses the website by typing in the sites or consumers’ computers. All other intermedi- company’s URL, uses a URL saved in the browser’s ate sources (such as search engines and third-party history, or uses a previously saved (“bookmarked”) referral sites) cannot provide a complete picture of the URL. We call these visits direct type-in.A referral visit interaction between consumers and firms. For exam- is one that comes from another online source. For ple, Google may be able to track all paid and organic example, consumers can use a search engine to find visits to firms’ websites originated from its engine, the site and access it by clicking on an organic or a but it will not have much information about direct paid search result.4 In this paper, we focus on visitors visits or traffic generated through referrals. Needless referred by search engines and call these organic and to say, because of consumer privacy concerns, search paid search visits. We note that the proposed model engines are highly unlikely to share consumer-level can be extended to accommodate other referral traffic data with advertisers (e.g., Steel and Vascellaro 2010). such as banner ads or links provided by listing ser- Alternatively, tracking on the consumers’ end could vices such as Yellow Pages. be implemented by either recruiting a panel or using data from a syndicated supplier who maintains a 3 IP address matching is another alterative, but as Internet service panel. Unfortunately, the size of such a panel needed providers commonly use dynamic addressing and corporate net- to ensure adequate data on the keywords used by a works “hide” the machine IP address from the outside world, it is mostly infeasible. given firm would be impractical—especially for less 4 Companies can also try to “funnel” consumers to their site by popular keywords in the so-called “long tail.” Cross- using online advertising. A company can place a banner ad on the session consumer tracking on firms’ websites is also Internet, and consumers can access the company’s site by clicking problematic for several reasons—practical, technical, on that banner ad. Rutz, Trusov, and Bucklin: Modeling Indirect Effects of Paid Search Advertising Marketing Science 30(4), pp. 646–665, © 2011 INFORMS 649
2.2. Method Motivation regressions (e.g., Bai and Ng 2002) make such infer- From a methodological perspective, the objective of ence difficult. Some regularization techniques, such our study is to demonstrate how a modified version as Ridge regression (e.g., Hoerl and Kennard 1970), of the Bayesian elastic net (BEN; e.g., Li and Lin 2010, do not allow subset selection but instead produce a Kyung et al. 2010) can be used in marketing appli- solution in which all of the predictors are included cations where (1) the number of predictor variables in the model. The second condition eliminates tech- significantly exceeds the number of available obser- niques that become “saturated” when the number of vations, (2) predictors are strongly correlated, and predictors reaches the number of observations. For (3) model structure (but not just a forecasting perfor- this reason models such as LASSO regression (e.g., mance) is of focal interest to the researcher or prac- Tibshirani 1996, Genkin et al. 2007) are not applica- titioner. Zou and Hastie (2005, p. 302), who coined ble in our context. Although LASSO allows exploring the term “elastic net,” describe their method using the the entire space of models, the “best” model cannot analogy of “a stretchable fishing net that retains all have more than n predictors selected. Finally, meth- the big fish.” Indeed, the method is meant to retain ods that address dimensionality reduction perform “important” variables (big fish) in the model while poorly when high correlations exist among subsets of shrinking all predictors toward zero (therefore let- predictors. The multicollinearity problem is particu- ting small fish escape). We offer an extension to the larly important in high dimensions when the pres- Bayesian elastic net model proposed by Kyung et al. ence of related variables is more likely (Daye and Jeng (2010) by incorporating available predictor-specific 2009). For example, regularization through a stan- information into the model in a hierarchical fashion. dard latent class approach (i.e., across predictors) or Before we present the details of the proposed model, with LASSO may fail to pick more than one predictor we offer a brief motivation for our selected statistical from a group of highly correlated variables (Zou and approach. Hastie 2005, Srivastava and Chen 2010). Normally, when the number of observations sig- Another class of methods that can potentially address the above conditions is based on a two-group nificantly exceeds the number of predictors, estima- discrete mixture prior with one mass concentrated tion/evaluation of the effects of individual predictors at zero and another somewhere else—also known as is straightforward. However, when the number of the so-called “spike-and-slab” prior (e.g., Mitchell and predictors is substantially larger than the number Beauchamp 1988). This class of models is generally of observations, estimation of individual parameters estimated by associating each predictor with a latent becomes problematic. The model becomes saturated indicator that determines group assignment (e.g., when the number of predictors reaches the number George and McCulloch 1993, 1997; Brown et al. 2002). of observations. Using terminology proposed by Naik Bayesian methods are commonly used to explore the et al. (2008), the paid search data we use fall into a space of possible models and to determine the best × category of so-called VAST (Variables Alternatives model as the highest frequency model (stochastic × × Subjects Time) matrix arrays where the V dimen- search variable selection, or SSVS). Although this class sion is very large compared with all other dimensions. of methods has been shown to perform well in a The problem of having a large number of predictors “typical” (n > p5 setting (e.g., Gilbride et al. 2006), coupled with a relatively low number of observations the application to a “large p, small n” context is not is referred to as the “large p, small n” problem (West without limitations, and computational issues arise 2003). Naik et al. (2008) review approaches to the for large p (Griffin and Brown 2007).5 “large p, small n” problem and identify the follow- The Bayesian elastic net belongs to the class of reg- ing three broad classes of methods—regularization, ularization approaches and is designed to handle the inverse regression, and factor analysis. The following above-mentioned three conditions. Like Ridge regres- conditions, found in our empirical application, make sion, it allows more than n predictors to be selected these approaches unsatisfactory: (1) the focus of the in the “best” model. Like LASSO, it offers a sparse analysis is on identifying all relevant predictors that solution (not all available predictors are selected) and help to explain the dependent variable, rather than just on forecasting; (2) more than n predictors should be 5 As p becomes large, it becomes hard to identify high-frequency allowed to be selected; and (3) subsets of predictors models—the key outcome of the SSVS approach (Barbieri and may be strongly correlated. The first condition rules out Berger 2004). Consequently, for high-dimensional problems the methods that reduce dimensionality by combining posterior mean or BMA (Bayesian model averaging) estimator is predictors in a manner that does not allow recovery typically used. Because a BMA estimator naturally contains many small coefficient values, it does not offer a sparse solution (Ish- of predictor-level effects. For example, methods such waran and Rao 2010), which is of interest in our case. Though BMA as partial inverse regression (Li et al. 2007), principal improves on variable selection in terms of prediction, its drawback components (e.g., Stock and Watson 2006), or factor is that it does not lead to a reduced set of variables (George 2000). Rutz, Trusov, and Bucklin: Modeling Indirect Effects of Paid Search Advertising 650 Marketing Science 30(4), pp. 646–665, © 2011 INFORMS
allows for grouping across predictors; strongly corre- 000 exp4−l × 5 wl = 1 lated predictors are added to or excluded from the PL000 = exp4−k × 5 model together. In the modeling situation described k 1 by the above conditions and faced by firms in the where DirVt is the number of direct type-in visitors day month paid search domain, the Bayesian elastic net offers a at time t, 8It 1 It 9 is a set of daily and monthly very good match as a state-of-the-art approach to the indicator variables, OrgVt is the number of organic k “large p, small n” problem. visitors at time t, PsVt is the number of paid search visitors from keyword k at time t, K is the number of 2.3. Bayesian Elastic Net 2 keywords the firm is using, t ∼ N 401 5 and 1 = We specify as our dependent variable the number day day month month 0 0 41 ··· 6 1 ··· 4 5 , , , = 41 ··· K 5 , and of direct type-in visitors to the website at a given wDirV , wOrgV , wPsV , LDirV , LOrg, LPsV , , and are time t (DirV). In our data this activity is tracked at parameters (vectors) to be estimated. the daily level. We investigate the effect, if any, of A challenge in estimating Equation (1) is that most prior paid search visits (PsV) on the number of direct firms bid on hundreds or even thousands of key- type-in visitors. Our model allows for keyword-level words. Based on the available daily data provided by effects, which we see as an empirical question.6 We search engines, firms would need to run search cam- also account for seasonal effects, i.e., using daily and paigns for multiple years before one would be able to monthly dummies, and for variations in overall web- estimate the model given in Equation (1). For exam- site traffic patterns by using lagged organic visitors7 ple, our focal company bid on over 15,000 keywords. (OrgV) as a control. Finally, potential feedback effects One way to address this problem is to estimate the are incorporated by including lagged direct type-in model specified in Equation (1) for each individual visitors. We use an exponentially smoothed lag struc- keyword at a time (controlling for all other variables). ture (ESmo), allowing us to estimate only one param- A problem with this approach is that the daily traf- eter per keyword.8 Our model is given by fic generated by different keywords is correlated. The 6 4 bias from omitting the effects of other keywords in P day day P month month ESmo DirVt = + i It + i It + DirVt i=1 i=1 Equation (1) is likely to produce inflated estimates for K (1) the role of any given individual keyword in generat- + ESmo + P ESmo1k + OrgVt kPsVt t1 ing returning direct type-in visitors. k=1 To address this challenge, we propose a Bayesian elastic net (Kyung et al. 2010) that discriminates LDirV LDirV ESmo X DirV X DirV among keywords based on their effectiveness using DirVt = wl DirVt−l1 wl = 11 l=1 l=1 a Laplace–Gaussian prior to address the data defi- DirV DirV DirV 0 ciency. The prior ensures that “weaker” predictors w = w ··· w 1 1 LDirV get shrunk toward zero faster than “stronger” predic- LOrgV LOrgV tors (see also Park and Casella 2008). A BEN allows ESmo X OrgV X OrgV OrgVt = wl OrgVt−l1 wl = 11 us to estimate keyword-level parameters even when l=1 l=1 the number of keywords is significantly larger than 0 OrgV OrgV the number of observations (n p5. The Laplace– wOrgV = w ··· w 1 1 LOrgV Gaussian prior encourages a grouping effect, where
LPsV LPsV strongly correlated predictors are selected or dropped ESmo1k X PsV k X PsV PsVt = wl PsVt−l1 wl = 11 out of the model together. Finally, it allows for the l=1 l=1 best model to include more than n predictors. Our PsV PsV PsV 0 modeling approach also has been designed to directly w = w1 ··· wL 1 PsV address concerns expressed by practitioners. We have been told that “when we turn off sets of keywords, 6 A consistent finding of paid search research (e.g., Ghose and Yang 2009) is heterogeneity in response across keywords. One pos- we see a drop in the number of visitors that come sible reason could be consumer “self-selection” into keywords— to the site directly. That drop has a time delay of a consumers reveal information on preferences through the use of couple of days.” The goal of our model is to iden- keywords. tify which keywords drive this effect, allowing us to 7 Organic visits can have a similar effect as paid visits. A consumer measure the additional indirect value these particu- accesses the site through an organic link and might be coming back lar keywords have above and beyond generating site as a direct visitor going forward. visits via paid search click-through. 8 Alternatively, we could specific the lag structure explicitly. How- Along with data on traffic, firms have detailed key- ever, this would require us to estimate multiple (i.e., the number of lags) parameters per keyword—a steep increase for cases in which word lists. In general, keywords can be decomposed the firm bids on thousands of keywords. To be consistent, we also into certain common semantic characteristics, e.g., a use exponential smoothing on DirV and OrgV. keyword is related to car financing. We propose a Rutz, Trusov, and Bucklin: Modeling Indirect Effects of Paid Search Advertising Marketing Science 30(4), pp. 646–665, © 2011 INFORMS 651 novel text mining approach to extract semantic char- two potential sources: managerial reaction (feedback acteristics from keywords (for details, see the sec- mechanism) to changes in direct traffic and omit- tion on empirical application), allowing us to leverage ted variable bias. We probed extensively for endo- semantics to shed light on why some keywords are geneity, and none of the tests we conducted indicate effective and others are not. To this end, we propose that endogeneity is of concern (for details, please see a hierarchical model that can incorporate predictor- Web Appendix B in the electronic companion). To specific information such as semantics. Note that an address concerns with regard to serial correlation, we elastic net shrinks all predictors to zero—a typical include an AR415 error process in our model (e.g., feature of regularization approaches. Thus, contrary Koop 2003) implemented in a fully Bayesian frame- to the way in which hierarchies on parameters have work (for details, please see Web Appendix A in the been implemented in the marketing literature, we electronic companion). cannot impose a hierarchical mean on the keyword- 2.4. Model Comparisons specific parameter k in the standard fashion. In a BEN, parameters determine the amount of shrink- We compare our proposed model against two bench- age toward zero and thus have a significant influence marks. First, we model direct type-in visits with- on the parameter estimate. Our hierarchy is based out including paid search as an explanatory variable on these parameters and we define the BEN prior as (NO KEYWORDS). This is implemented by estimat- follows: ing Equation (1) without PsV. Second, we model direct type-in visits assuming all keywords have the " K K # 1 X 1 X 2 same (homogeneous) effect, i.e., setting k = 1 ∀ k ∝ exp −√ 1k k − 2kk 0 (2) 2 2 k=1 2 k=1 (ALL KEYWORDS). Our proposed model follows the specification described in Equations (1)–(4) above We define the hierarchy on the keyword-specific and is labeled BEN. Moving from NO KEYWORDS parameters 1k and 2k as follows: to ALL KEYWORDS enables us to assess the pres- ence and magnitude of the indirect effect but at 1k ∼ gamma4c1k1 d5 and 2k ∼ gamma4c2k1 d51 (3) an aggregate (across all keywords) level. Moving from ALL KEYWORDS to BEN enables us to deter- where c and c are parameters to be estimated for 1k 2k mine the keyword-level effect on direct type-in vis- all keywords.9 We use a log-normal model to repre- its and the value of modeling the indirect effect on a sent the effect of the semantic keyword characteristics keyword level. on c1k and c2k: