1

HOW DOES MOBILE APP FAILURE AFFECT PURCHASES IN ONLINE AND OFFLINE CHANNELS?

Unnati Narang Venkatesh Shankar Sridhar Narayanan

December 2020

* Unnati Narang ([email protected]) is Assistant Professor of , University of Illinois, Urbana Champaign, Venkatesh Shankar ([email protected]) is Professor of Marketing and Coleman Chair in Marketing and Director of Research, Center for Retailing Studies at the Mays Business School, Texas A&M University, and Sridhar Narayanan ([email protected]) is an Associate Professor of Marketing at the Graduate School of Business, Stanford University. We thank the participants at the ISMS Marketing Science conference, the UTDFORMS conference, and research seminar participants at the University of California, Davis, the University of Toronto, the University of Illinois, Urbana Champaign, and the University of Texas at Austin for valuable comments. 2

Abstract

Mobile devices account for a majority of transactions between shoppers and marketers. Branded retailer mobile apps have been shown to significantly increase purchases across channels. However, app service failures can lead to decreases in app usage, making app failure prevention and recovery critical for retailers. Does an app failure influence purchases in general and within the online channel in particular? Does it have any spillover effects across other channels? What potential mechanisms explain and what factors moderate these effects? We examine these questions empirically, employing a unique dataset from an omnichannel retailer. We leverage a natural experiment of exogenous systemwide failure shocks in this retailer’s mobile app and related data to examine the causal impact of app failures on purchases in all channels using a difference-in-differences approach. We investigate two potential mechanisms behind these effects – channel substitution and preference dilution. We also analyze shopper heterogeneity in the effects using a theoretically-driven moderator approach as well as a data- driven machine learning method. Our analysis reveals that although an app failure has a significant overall negative effect on shoppers’ frequency, quantity, and monetary value of purchases across channels, the effects are heterogeneous across channels and shoppers. Interestingly, the decreases in purchases across channels are driven by purchase reductions in brick-and-mortar stores and not in the online channel. A significant decrease in app engagement post failure explains the overall drop in purchases. Brand preference dilution after app failure explains the fall in store purchases, while channel substitution post failure explains the preservation of purchases in the online channel. Surprisingly, purchases rise for a small group of shoppers who were close to the retailer’s store at the time of app failure. Furthermore, shoppers with a higher monetary value of past purchases, and less recent purchases are less sensitive to app failures. The results suggest that app failures lead to an annual revenue loss of about $2.4- $3.4 million for the retailer in our data. About 47% shoppers contribute to about 70% of the loss. We outline targeted failure prevention and service recovery strategies that retailers could employ.

Keywords: service failure, mobile marketing, mobile app, retailing, omnichannel, difference-in- differences, natural experiment, causal effects 1

INTRODUCTION

Mobile commerce has seen tremendous growth with mobile devices accounting for a majority of interactions between shoppers and marketers. This growth has accelerated through the rapid increase of smartphone penetration – about 3.2 billion people (41.5% of the global population) used smartphones in 2019.1 Mobile applications (henceforth, apps) have emerged as an important channel for retailers as they have been found to increase engagement and purchases across channels (e.g., Kim et al. 2015; Narang and Shankar 2019; Xu et al. 2016).

While retailers have widely embraced mobile apps, there is little understanding about how service failures in this channel affect shopper behavior. This issue is important because unlike other channels, the mobile channel is highly vulnerable to failures. The diversity of mobile operating systems (e.g., iOS, Android), devices (e.g., mobile phone and tablet), and versions of hardware and software and their constant use across a variety of mobile networks often result in app failures. Failures in a retailer’s mobile app have the potential to negatively affect shoppers’ engagement with the app and their shopping outcomes within the mobile channel. In addition, app failures may have spillover effects across other channels due to both substitution of purchases across channels and dilution of preference for the retailer brand. Understanding how and why failures impact shoppers’ behavior across channels is important for retailers.

Preventing and recovering from app failures is critical for managers because more than 60% of shoppers abandon an app after experiencing failure(s) (Dimensional Research 2015). In 2016, app crashes were the leading cause of system failures, contributing 65% to all iOS failures

(Blancco 2016). About 2.6% of all app sessions result in a crash, suggesting about 1.5 billion app failures across 60 billion app sessions annually (Computerworld 2014). Given the extent of these

1 Source: Statista report on smartphone penetration (https://tinyurl.com/hy2skfk) last accessed 18 November 2020. 2 app failures and their potential damage to firms’ relationships with customers, determining the impact of app failures is important for formulating preventive and recovery strategies.

Despite the importance of app failures, not much is known about their impact on purchases.

While app crashes in a shopper’s mobile device have been shown to negatively influence app engagement (e.g., restart time, browsing duration, and activity level, Shi et al. 2017), the relationship between app failures and subsequent purchases has not been studied. Furthermore, a large proportion of shoppers use both online (desktop website, mobile website, and mobile app) and offline (brick-and-mortar) retail channels. However, we do not know much about the impact of app failures on shopping outcomes across channels (spillover effects).

From a theoretical standpoint, the potential mechanisms behind such effects within and across channels are important. How much of these effects arise due to channel substitution post failure? What portion of the effects can be attributed to dilution of preference for the retailer’s brand? Prior research has not addressed these interesting questions.

The effects of app failure may also differ across shoppers. Shoppers may be more or less negatively impacted by failures depending on factors such as shoppers’ relationship with the firm

(Chandrashekaran et al. 2007; Goodman et al. 1995, Hess et al. 2003; Knox and van Oest 2014;

Ma et al. 2015) and shoppers’ prior use of the firm’s digital channels (Cleeren et al. 2013; Liu and Shankar 2015; Shi et al. 2017). It is important for managers to better understand how the effects of failure vary across shoppers so that they can devise targeted preventive and recovery strategies. Yet not much is known about heterogeneity in the effects of app failure.

Our study fills these crucial gaps in the literature. We quantify and explain the impact of app failures on managerially important outcomes, such as the frequency, quantity, and monetary value of purchases in online and offline channels. We address four research questions: 3

• What are the effects of a service failure in a retailer’s mobile app on the frequency, quantity, and monetary value of subsequent purchases by the shoppers? • What are the effects of a service failure in an app on purchases in the online and offline channels? • What potential mechanisms explain the effects of an app service failure on purchases? • How do these effects vary across shoppers or what factors moderate these effects?

Estimation of the causal effects of app failures on shopping outcomes is challenging. It is typically hard to do this using observational data due to the potential endogeneity of app failures.

This endogeneity may stem from an activity bias in that shoppers who use the app more frequently are also more likely to experience failures than other shoppers. Therefore, failure- experiencing shoppers may differ systematically from non-failure experiencers in their shopping behavior, leading to potentially spurious correlations between failures and shopping behavior.

Panel data may not necessarily mitigate this issue because time-varying app usage/shopping activity is potentially correlated with time-varying app failures for the same reason. That is, shoppers are likely to engage more with the app when they are likely to purchase, potentially leading to more failures than in periods when shoppers engage less with the app. Additionally, the nature of activity on the app may be correlated with failures. For instance, a negative correlation between failures and purchases may result from a greater incidence of failures on the app’s purchase page than on other pages. Thus, it is hard to make the case that correlations between app failures and shopping outcomes in observational data have a causal interpretation.

The gold standard among the methods available to uncover the causal impact of service failures is a randomized field experiment. However, such an experiment would be impractical in this context because a retailer will unlikely deliberately induce failures in an app even for a small subset of its shoppers for ethical reasons. Alternatively, we can use an instrumental variable approach to control for endogeneity. However, it is hard to come up with instrumental variables that are valid and exhibit sufficient variation to address the endogeneity concerns in this context. 4

We overcome the estimation challenges and mitigate the potential endogeneity of app failures using the novel features of a unique dataset from a large omnichannel retailer of video games, consumer electronics and wireless service. We exploit a natural experiment of server error-induced systemwide exogenous failures in the retailer’s mobile app to estimate the causal effects of app failure. Conditional on signing in on the day of the failure, whether a user experienced a failure or not was a function of whether they attempted to use the app during the time window of the failure, which they could not have anticipated in advance. We take advantage of the resulting quasi-randomness in incidences of failures to estimate the causal effects of failures on the mobile app. We employ a difference-in-differences (DID) approach that compares the pre- and post- failure outcomes for the failure experiencers with those of failure non-experiencers to estimate the effects of the app failure. Through a series of robustness checks, we confirm that failure non-experiencers act as a valid control for failure experiencers, providing us the exogenous variation to find causal answers to our research questions.

We investigate the potential mechanisms and moderators of the effects of failures on shopping behavior by exploiting the panel nature of our dataset. We test for the moderating effects of factors such as relationship with the firm and prior digital channel use on the effects of service failures. These factors have been explored for services in general (e.g., Hansen et al.

2018; Ma et al. 2015) but not in the digital or mobile app contexts. In addition, we recover the heterogeneity of effects at the individual level using data-driven machine learning methods.

Our results show that app failures have a significant overall negative effect on shoppers’ frequency, quantity, and monetary value of purchases across channels, but the effects are heterogeneous across channels and shoppers. A significant decrease in app engagement (e.g., number of app sessions, dwell time, and number of app features used) post failure explains the 5 overall drop in purchases. Interestingly, the overall decreases in purchases across channels are driven by purchase reductions in stores, rather than in the online channel. The fall in store purchases after app failure is consistent with brand preference dilution, while the preservation of purchases in the online channel is consistent with channel substitution. Shoppers experiencing the failure when they are farther away from purchase (e.g., browsing product information) experience greater negative effects of a failure than those closer to purchase (e.g., checking out in the app). Surprisingly, the basket size and value of purchases rise for a small group of shoppers who were close to the retailer’s store at the time of app failure. Furthermore, shoppers with a higher monetary value of past purchases and less recent purchases are less sensitive to app failures. Finally, most shoppers (96%) react negatively to failures, but about 47% of these shoppers contribute to about 70% of the losses in annual revenues that amount to $2.4-$3.4 million.

In the remainder of the paper, we first discuss the literature related to service failures, cross- channel spillovers, and consumer interaction with mobile apps. Next, we discuss the data in detail, summarizing them and highlighting their unique features. Subsequently, we describe our empirical strategy, layout and test the key identification strategy, and conduct our empirical analysis of the effects of app failures. We explore the potential mechanisms behind the results.

We then conduct robustness checks to rule out alternative explanations. We conclude by discussing the implications of our results for managers.

BACKGROUND AND RELATED LITERATURE

Services Marketing and Service Failures

The nature of services has evolved considerably since academics first started to study services marketing. For long, the production and consumption of services remained inseparable primarily 6 because services were performed by humans. However, of late, technology-enabled services have risen in importance, leading to two important shifts (Dotzel et al. 2013). First, services that can be delivered without human or interpersonal interaction have grown tremendously. Online and mobile retailing no longer require shoppers to interact with human associates to make purchases. Second, closely related to this idea is the fact that services are increasingly powered by technologies such as mobile apps that allow anytime-anywhere access and convenience.

With growing reliance on technologies for service delivery and the complexity of the technology environment in which these services are delivered, service failures are attracting greater attention. A service failure can be defined as service performance that falls below customer expectations (Hoffman and Bateson 1997). Service failures are widespread and are expensive to mend. Service failures resulting from deviations between expected and actual performance damage customer satisfaction and brand preference (Smith and Bolton 1998). Post- failure satisfaction tends to be lower even after a successful recovery and is further negatively impacted by the severity of the initial failure (Andreassen 1999; McCollough et al. 2000). In interpersonal service encounters, human interactions and employee behaviors influence both failure effect and recovery (Bitner et al. 1990; Meuter et al. 2000). In technology-based encounters, such as those in e-tailing and with self-service technologies (e.g., automated teller machines [ATMs]), the opportunity for human interaction is typically small after experiencing failure (Forbes et al. 2005; Forbes 2008). However, there may be significant heterogeneity in how consumers react to service failures (Halbheer et al. 2018).

In the mobile context, specifically for mobile apps, it is difficult to predict the direction and extent of the impact of a service failure on shopping outcomes. First, mobile apps are accessible at any time and in any location through an individual’s mobile device. On the one hand, because 7 a shopper can tap, interact, engage, or transact multiple times at little additional cost on a mobile app, the shopper may treat any one service failure as acceptable without significantly altering her subsequent shopping outcomes. Such an experience differs from that with a self-service technological device such as an ATM, which may need the shopper to travel to a specific location or incur other hassle costs that may not exist in the mobile app context. On the other hand, the costs of switching to a competitor are also much lower in the mobile app context, where a typical shopper uses and compares multiple apps. Thus, a service failure in any one app may aggravate the shopper’s frustration with the app, leading to strong negative effects on outcomes such as purchases from the relevant app provider.

Second, a mobile app is one of the many touchpoints available to shoppers in today’s omnichannel shopping environment. Thus, a shopper who experiences a failure in the app could move to the web-based channel or even the offline or store channel. In such cases, the impact of a failure on the app could be zero or even positive (if the switch to the other channel leads to greater engagement of the shopper with the retailer). By contrast, if the channels act as complements (e.g., if the shopper uses one channel for researching products and another for purchasing) or if the failure impacts the preference for retailer brand, a failure in one channel could impede the shopper’s engagement in other channels. Thus, it is difficult to predict the effects of app failure, in particular, about how they might spill over to other channels.

Channel Choice and Channel Migration

A shopper’s experience in one channel can influence her behavior in other channels. Prior research on cross-channel effects is mixed, showing both substitution and complementarity effects, leading to positive and negative synergies between channels (e.g., Avery et al. 2012;

Pauwels and Neslin 2015). The relative benefits of channels determine whether shoppers 8 continue using existing channels or switch to a new channel (Ansari et al. 2008; Chintagunta et al. 2012). When a bricks-and-clicks retailer opens an offline store or an online-first retailer opens an offline showroom, its offline presence drives sales in online stores (Bell et al. 2018; Wang and

Goldfarb 2017).2 This is particularly true for shoppers in areas with low brand presence prior to store opening and for shoppers with an acute need for the product. However, the local shoppers may switch from purchasing online to offline after an offline store opens, even becoming less sensitive to online discounts (Forman et al. 2009). In the long run, the store channel shares a complementary relationship with the Internet and catalog channels (Avery et al. 2012).

While the relative benefits of one channel may lead shoppers to buy more in other channels, the costs associated with one channel may also have implications for purchases beyond that channel. In a truly integrated omnichannel retailing environment, the distinctions between physical and online channels blur, with the online channel representing a showroom without walls (Brynjolfsson et al. 2013). Mobile technologies are at the forefront of these shifts. More than 80% of shoppers use a mobile device while shopping even inside a store (Google M/A/R/C

Study 2013). As a result, if there are substantial costs associated with using a mobile channel

(e.g., those induced by app failures), such costs may spill over to other channels. If shoppers use the different channels in complementary ways, the disruption of one of those channels could negatively impact their engagement with the other channels as well. However, if shoppers treat the channels as substitutes, failures in one channel may drive the shoppers to purchase in another channel. If an app failure dilutes shoppers’ preference for the retailer brand, it may lead to negative consequences across channels. Overall, the direction of the effect of app failures on

2 A bricks-and-clicks retailer is a retailer with both offline (“bricks”) and online (“clicks”) presence. 9 outcomes in other channels such as in brick-and-mortar stores and online channels depends on which of these competing and potentially co-existing mechanisms is dominant.

Mobile Apps

The nascent but evolving research in mobile apps shows positive effects of mobile app channel introduction and use on engagement and purchases in other channels (Kim et al. 2015; Narang and Shankar 2019; Xu et al. 2016) and for coupon redemptions (Andrews et al. 2015; Fong et al.

2015; Ghose et al. 2019) under different contingencies.

To our knowledge, only one study has examined crashes in a mobile app on shoppers’ app use. Shi et al. (2017) find that while crashes have a negative impact on future engagement with the app, this effect is lower for those with greater prior usage experience and for less persistent crashes. However, while they look at subsequent engagement of the shoppers with the mobile app, they do not examine purchases. Thus, our research adds to Shi et al. (2017) in several ways.

First, we focus on estimating the causal effects of failure. To this end, we exploit the random variation in failures induced by systemwide failures. Second, we quantify the value of app failure’s effects on subsequent purchases. The outcomes we study include the frequency, quantity, and value of purchases, while the key outcome in that study is app engagement. Third, we examine the cross-channel effects of mobile app failures, including in physical stores, while

Shi et al. (2017) study subsequent engagement with the app provider only within the app.

Finally, we explore the mechanisms behind the effects of failure , and examine the moderating effects of relationship with the retailer and prior digital and heterogeneity in shoppers’ sensitivity to failures using a machine learning approach.

To summarize, our study (1) focuses on the effect of app failure on purchases, (2) quantifies the effects on multiple outcomes such as frequency, quantity, and monetary value of purchases, 10

(3) addresses the outcomes in each channel and across all channels (substitution and complementary effects), and (4) uncovers the mechanisms behind and moderators of the effects of app failure on shopping outcomes and heterogeneity in effects across shoppers. All these characteristics are novel, contributing to the research streams on service marketing, channel choice, and mobile apps.

RESEARCH SETTING AND DATA

Research Setting

We obtained the dataset for our empirical analysis from a large U.S.-based retailer. In the following paragraphs, we describe the retailer, the mobile app, and the channel sales mix.

The retailer sells a variety of products, including software such as video games and hardware such as video game consoles and controllers, downloadable content, consumer electronics and wireless services with 32 million customers. The gaming industry is large ($99.6 billion in annual revenues), and the retailer is a major player in this industry, offering us a rich setting. The retailer has a large offline presence, and in this respect, is similar to Walmart, PetSmart, or any other brick-and-mortar chain with an omnichannel strategy. The retailer’s primary channel is its store network comprising 4,175 brick-and-mortar stores across the U.S. Additionally, it has a large ecommerce website, and the mobile app that is the focus of our study.

The app allows shoppers to browse the retailer’s product catalog, get deals, order online through a mobile browser, locate nearby stores, as well as make purchases through the app itself.

The app is typical of mobile apps of large retailers (e.g., PetSmart, Costco) in features and consumer interactions. The growth in the adoption of the app has also been similar to that of many large retailers. App adoption rate started small and grew over time. Figure 1 shows some screenshots from the app. 11

Figure 1 APP SCREENSHOTS

The online and offline channel sales mix of the retailer in our data is typical of most large retailers. About 76% of the total sales for the top 100 largest retailers in the U.S. are from similar retailers with a store network of 1,000 or more stores (National Retail Federation 2018). Most large retailers have a predominant brick-and-mortar presence. For these retailers, while most of the transactions and revenues come from the offline channels, online sales exhibit rapid growth.

For example, Walmart’s online revenues constitute 3.8% of all revenues, 1.3% of all PetSmart’s sales come from the online channel, Home Depot generates 6.8% of all revenues from ecommerce, and 5.4% of Target’s sales are through the online channel.3 For the retailer in our data, online sales comprised 10.2% of overall revenues, somewhat higher than that for similar large retailers. Furthermore, about 26% of the shoppers bought online in the 12 months before the failure event we study. The retailer’s online sales displayed a 13% annual average growth in the last five years, similar to these retailers who also exhibited double digit growth (Barron’s

3 Source: eMarketer Retail, https://retail-index.emarketer.com/ 12

2018). Its annual online sales revenues are also substantial at $1.1 billion. Therefore, our research context offers a rich setting to examine cross-channel effects of a mobile app failure.

Data and Sample

We study the impact of a systemwide failure that occurred on April 11, 2018.4 The firm provided us with mobile app use data and transactional data across all channels for all the app users who logged into the app on the failure day. The online channel represents purchases at the retailer’s website, including those using the mobile browser. Nested within the app use data are data on events that shoppers experience, along with their timestamps. The mobile dataset recorded the app failure event as ‘server error.’ Thus, this event represents an exogenous app breakdown, and the data allow us to identify shoppers who logged in to experience the systemwide app failure.

Table 1 provides the descriptive statistics for the variables of interest. Over a period of 14 days pre- and post- failure, shoppers make an average of a little less than one purchase comprising about 1.6 items for a value of about $43. In the 12 months before failure, shoppers make purchases worth $623 and on average, buy .66 times in the online channel. Overall, 52% of the shoppers experience the failure during our focal failure event.

Table 1 SUMMARY STATISTICS Variable Mean Std. dev. Frequency of purchases .82 1.34 Quantity of purchases 1.61 3.32 Value of purchases ($) 43.31 96.42 App failure/Failure experiencer .52 .50 Recency of past purchases (in days) -45.68 68.83 Value of past purchases ($) 629.60 699.38 Frequency of past online purchases .66 1.97 Notes: These statistics of the variables are over pre- and post- 14 days of the failure. The past purchases are computed over a one-year period. N = 273,378.

4 We verified that this failure was systemwide and exogenous through our conversations with company executives. 13

EMPIRICAL STRATEGY

Overall Empirical Strategy

As outlined earlier, we leverage the exogenous systemwide shock to estimate the causal effect of app failure on shopping outcomes. The main idea behind our empirical approach is that conditional on the attempted usage of the app on the day of the failure, the experience of a failure by a specific shopper is random. We examine this assumption in the data by testing for balance between shoppers who experience a failure and those who do not, using a set of pre-failure variables. We find no systematic difference in these variables between shoppers who experienced failures and those who did not, supporting our identification strategy. To determine the treatment effect of a failure, we conduct a DID analysis, comparing the post-failure behaviors with the pre-failure behaviors of shoppers who logged in on the day of the failure and experienced it (akin to a treatment group) relative to those who logged in on that day but did not experience the failure (akin to a control group).

To analyze the treatment effects within and across channels, we repeat this analysis with the same outcome variables separately for the offline and online channel. To understand the underlying mechanisms for the effects, we examine two explanations, brand preference dilution and channel substitution, using the data on shoppers’ app engagement, closeness to purchase, location at the time of failure, time to next purchase, and shipping costs to check for consistency with these mechanisms. To analyze heterogeneity in treatment effects, we first perform a moderator analysis using a priori factors identified in the literature such as prior relationship strength and digital channel use, followed by a data driven machine learning (causal forest) approach to fully explore all sources of heterogeneity across shoppers. Finally, we carry out multiple robustness checks. 14

Exogeneity of Failure Shock

To verify that there is no systematic difference between shoppers who experience the failure shock and those who do not, we examine two types of evidence. First, we present plots of the behavioral trends in shopping for both failure-experiencers and non-experiencers for the failure shock in the 14 days before the app failure. Figure 2 depicts the monetary value of daily purchases by those who experienced the failure and those who did not. The purchase trends in the pre-period are parallel for the two groups (p > .10), providing us assurance that these shoppers do not systematically differ across the two groups. The trends are similar for the frequency and quantity of purchases, and the proportion of online purchases (see Web Appendix

Figure D1).

Figure 2 COMPARISON OF FAILURE-EXPERIENCERS’ AND NON-EXPERIENCERS’ PURCHASES 14 DAYS BEFORE FAILURE

Note: The red line represents failure experiencers, while the solid black line represents the failure non-experiencers.

Second, we compare the failure experiencers with non-experiencers across shopping behaviors, such as recency of purchases and frequency of past online purchases (see Figure 3) and past app usage sessions (see Figures 4 and 5). We also compare their observed demographic variables, such as gender and membership in loyalty program. We do not find any significant differences in these variables across the groups. 15

Figure 3 COMPARISON OF FAILURE-EXPERIENCERS AND NON-EXPERIENCERS 5

4

3

2 1.53 1.51

1 86.49% 84.39% 0.62 0.70 35.17% 34.49%

0 Gender (female) Loyalty program Recency of past purchase/30 Past online purchase frequency Treated Control Note: Loyalty program level represents whether shoppers were enrolled (=1) or not (=0) in an advanced reward program.

Figure 4 PAST DAILY AVERAGE APP SESSION TRENDS OF FAILURE EXPERIENCERS VS. NON- EXPERIENCERS

Figure 5 PAST DAILY AVERAGE NON-PURCHASE RELATED APP SESSION TRENDS OF FAILURE EXPERIENCERS VS. NON-EXPERIENCERS

Note: Non- purchase-related app sessions involve browsing pages whose actions are farther from purchase--such as browsing products or obtaining store related information.

To summarize, we find no systematic differences between the failure experiencers and those who do not experience failures in either their trends of outcomes before the failure event, or in other 16 variables that we observe prior to the event. This pre-trend analysis gives us confidence in the validity of our empirical strategy.

Econometric Model and Identification

As described in the previous section, we estimate the effects of app failure on shopping outcomes by relying on a quasi-experimental research design with a DID approach (e.g., Angrist and

Pischke 2009). Specifically, we leverage a systemwide failure shock and compare app users who experience this shock with those who do not, given that they accessed the app on the day of the failure.

Our two-period linear DID regression takes the following form:

(1) � = � + �� + �� + ��� + � where i is shopper, t is time period (pre- or post- failure time period), and Y is the outcome variable (frequency, quantity, monetary value), F is a dummy variable denoting treatment (1 if shopper i experienced the app failure and 0 otherwise), P is a dummy variable denoting the period (1 for the period after the systemwide app failure and 0 otherwise), α is a coefficient vector, and ϑ is an error term. We cluster standard errors at the shopper level, following Bertrand

5 et al. (2004). The coefficient of FiPt, i.e., �, is the treatment effect of the app failure.

The assumptions underlying the identification of this treatment effect are: (1) the failure is random conditional on a shopper logging into the app during the time window of the failure shock and (2) the change in outcomes for the non-failure experiencing app users is a valid counterfactual for the change in outcomes that would have been observed for failure- experiencing app users in the absence of the failure.

5 Because we analyze the short-term effect of a service failure (14 and 30 days), we do not have an adequate number of observations per shopper post failure for us to estimate shopper fixed effects in our analysis. 17

EMPIRICAL ANALYSIS RESULTS

Relationship between App Failures and Purchases

We first examine the overall differences in post-failure behaviors between shoppers who experienced failures and those who did not using model-free evidence 14 days pre and post failure. We choose a 14-day window period because this two-week period is close to the mean interpurchase time in our dataset of 11 days and will equally include any “day of the week” effects in shopping.6

Table 2 reports the raw comparisons of post-failure vs. pre-failure purchase outcome variables for both failure experiencers (70,568 treated) and non-experiencers (66,121 control) among the set of consumers who accessed the app on the day of the failure. We find that post- failure, shoppers who experienced the systemwide failure had .04 (p < .001) lower purchase frequency, .07 (p < .001) lower purchase quantity, and $2.42 (p < .001) lower monetary value than shoppers who did not experience the failure. A simple comparison of shopping outcomes across the two groups shows that the average monetary value of purchases increased by 81.8%

($30.41 to $55.28) for failure-experiencers, while it increased by 87.6% ($30.75 to $57.70) for non-failure experiencers post failure relative to the pre period (p < .001).7 Given our identification strategy, the diminished growth in the monetary value of purchases for failure experiencers relative to non-experiencers comes from the exogenous failure shock.

6 We also estimated a model with dynamic treatment effects for a longer period of four weeks pre- and post- the failure shock and found similar effects (see Figure 6 and Table D1). 7 Increasing sales trend between the pre- and post- period for both the groups is partially due to the April 19 weekend in the post period that witnessed the release of a new game. 18

Table 2 MODEL-FREE EVIDENCE: MEANS OF OUTCOME VARIABLES FOR TREATED AND CONTROL GROUPS Variable Treated Treated Control Control pre post pre post period period period period Frequency of purchases .74 .89 .75 .93 Quantity of purchases 1.52 1.69 1.52 1.76 Value of purchases ($) 30.41 55.28 30.75 57.70 Frequency of purchases – Online .03 .04 .03 .04 Quantity of purchases – Online .05 .06 .05 .07 Value of purchases – Online ($) 1.34 2.93 1.50 3.17 Frequency of purchases – Offline .70 .85 .71 .88 Quantity of purchases – Offline 1.47 1.63 1.47 1.69 Value of purchases – Offline ($) 29.07 52.35 29.25 54.53 Notes: These statistics are based on pre- and post- 14 days of the failures. N = 273,378. Main Diff-in-Diff Model Results

The results from the DID model in Table 3 show a negative and significant effect of app failure on the frequency (� = -.024, p < .01), quantity (� = -.057, p < .01), and monetary value of purchases (� = -2.181, p < .01) across channels. Relative to the pre-period for the control group, the treated group experiences a decline in frequency of 3.20% (p < .01), quantity of 3.74% (p

< .01), and monetary value of 7.1% (p < .01).8

Table 3 DID MODEL RESULTS OF FAILURE SHOCKS FOR PURCHASES ACROSS CHANNELS Variable Frequency Quantity of Value of of purchases purchases purchases Failure experiencer -.024** -.057** -2.181** x Post shock (DID) (.008) (.020) (.681) Failure experiencer -.021** -.030 -.694* (.007) (.018) (.302) Post shock .178*** .236*** 26.947*** (.006) (.014) (.497) Intercept .750*** 1.523*** 3.755*** (.005) (.012) (.219) R squared .004 .001 .018 Effect size -3.20% -3.74% -7.09% Mean Y .82 1.61 43.31 Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.

8 We calculate the percentage change by dividing the treatment coefficient by the intercept. For instance, the treatment coefficient for value of purchases (2.18) divided by intercept (30.76) amounts to a 7.1% change. 19

Next, we examine the channel spillover effects of app failures in greater depth. We split the total value of purchases into offline and online purchases. Table 4 reports the results for these alternative channel-based dependent variables. There is a negative and significant effect of app failure on the frequency (� = -.02, p < .01), quantity (� = -.05, p < .01), and monetary value of purchases (� = -2.09, p < .01) in the offline channel. Interestingly, we do not find a significant

(p > .10) effect of app failure on any of the purchase outcomes in the online channel. Because there is no corresponding increase in the online channel and because the overall purchases drop, we conclude that the decreases in overall purchases across channels are largely due to declines in in-store purchases.

Table 4 DID MODEL RESULTS OF FAILURE SHOCKS FOR PURCHASES BY CHANNEL Offline Online Variable Frequency of Quantity of Value of Frequency of Quantity of Value of purchases purchases purchases purchases purchases Purchases Failure experiencer x Post shock -.022** -.055** -2.088** -.002 -.002 -.093 (DID) (.008) (.019) (.660) (.002) (.003) (.154) Failure -.018** -.025 -.527 -.003* -.005* -.167** experiencer (.006) (.017) (.293) (.001) (.002) (.064) Post shock .170*** .221*** 25.275*** .009*** .015*** 1.672*** (.005) (.014) (.482) (.001) (.002) (.113) Intercept .714*** 1.470*** 29.255*** .036*** .054*** 1.500*** (.005) (.012) (.213) (.001) (.002) (.048) R squared .0038 .0001 .0169 .0016 .0002 .0003 Effect size -3.08% -3.74% -7.14% - - - Mean Y .78 1.56 41.08 .04 .06 2.23 Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.

Mechanisms Behind the Effects of Failures on Shopping Outcomes

We now provide descriptive evidence for the potential mechanisms behind the results. The overall negative effect of app failure on shopping outcomes across channels could be due to decreases in intermediate outcomes such as shoppers’ engagement after failure. To explore this 20 possibility, we examine the effect of app failure on app engagement variables such as the number of app sessions, the average dwell time per session, and the average number of app features used in each session. The results of the corresponding DID model appear in Table 5. The treatment effect of failure for each of the three variables is negative and significant (p < .001), suggesting that app failure is associated with diminishing app engagement.

Table 5 DID MODEL RESULTS FOR APP ENGAGEMENT VARIABLES Variable No. of app Average dwell Average no. of sessions time per app features session used Failure experiencer -.689*** -7.444*** -4.678*** x Post shock (DID) (.005) (.072) (.024) Failure experiencer .651*** 7.041*** 4.508*** (.005) (.065) (.021) Post shock -.624*** -3.525*** -2.558*** (.003) (.047) (.016) Intercept .727*** 4.654*** 3.067*** (.003) (.040) (.013) R squared .4211 .1833 .4843 Mean Y .57 4.61 2.91 Notes: Robust standard errors clustered by shoppers are in parentheses; the app engagement variables are measured 5 hours pre- and post- failure; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.

The differential effect of app failure across the channels could be explained by the co- occurrence of two countervailing forces: channel substitution and brand preference dilution. The channel substitution effect occurs when app failure experiencers move to the mobile web browser, the desktop website, or the physical store to complete their intended purchase. If shoppers switch channels to complete their intended purchase, we should not observe negative effects of the failure in the channels of their subsequent purchases. We may potentially see even positive effects if the switch to the other channel leads to greater purchases in that channel than would have occurred in the online channel. Brand preference dilution happens when app failure experiencers get annoyed or dissatisfied with the retailer and lower their future purchases overall, including in the store. It is possible that channel substitution effect and brand preference dilution 21 operate when shoppers experience the app failure at different stages of the purchase funnel.

Shoppers who are close to purchase at the time of app failure may quickly switch channels and complete their purchase through the mobile or desktop website forms of the online channel.

However, shoppers who are far from purchase when the app fails may prefer the retailer brand less and buy less than what they had planned to in the future, perhaps because they switched to competing retailers instead.

To explore the role of stage in the purchase funnel in explaining the differential effects of app failure, we first examine the effects of app failure across shoppers based on whether they are close to or far from purchase at the time of failure. For this analysis, we utilize information in the data about the type of page on which the shopper was when the failure occurred. Table 6 reports the DID model results when the app failure occurred on purchase related and non-purchase related pages. Purchase-related pages in an app involve pages that are closer to purchase, such as those relating to adding a product to shopping cart, clicking checkout, or making payments. In contrast, non- purchase-related pages involve pages whose actions are farther from purchase, such as browsing products or obtaining store related information. The effect of app failure is negative and significant (p < .001) on all the outcome variables for shoppers who experience failure on a non-purchase related page than for shoppers who experience failure on a purchase related page. Shoppers who already have a strong purchase intent and are on a purchase-related page right before the failure are not as negatively affected as those without a strong purchase intent or on a non-purchase related page.

22

Table 6. DID MODEL RESULTS FOR FAILURES OCCURRING ON PURCHASE AND NON-PURCHASE RELATED PAGES Failure on purchase related page Failure on non-purchase related page Variable Frequency of Quantity of Value of Frequency of Quantity of Value of purchases purchases purchases purchases purchases purchases Failure experiencer x Post shock .000 -.016 .907 -.053*** -.108*** -4.627*** (DID) (.013) (.038) (1.195) (.009) (.022) (.763) Failure -.019 -.016 -.246 -.041*** -.075*** -1.208*** experiencer (.011) (.034) (.52) (.007) (.019) (.341) Post shock .178*** .236*** 26.947*** .178*** .236*** 26.947*** (.006) (.014) (.497) (.006) (.014) (.497) Intercept .750*** 1.523*** 30.755*** .750*** 1.523*** 30.755*** (.005) (.012) (.219) (.005) (.012) (.219) R squared .004 .001 .019 .004 .001 .018 Mean Y .836 1.637 44.270 .813 1.591 42.850 Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. DID = Difference-in-Differences. N = 160,662 for failure on purchase related page. N= 217,418 for failure on non-purchase related page.

To further explore the role of the purchase funnel, we compare the change in the value of purchases between the post and the pre app failure time periods for two groups of shoppers, close to and far from purchase based on a median split of re-login attempts during the failure window.

The median number of attempts is three. The negative effect of failure for shoppers who make greater re-login attempts is lower (Value of purchases(post-pre, high attempt) = 28.03, Value of purchases(post-pre, control) = 26.95, p > .01) than for shoppers who make fewer re-login attempts

(Value of purchases(post-pre, low attempt) = 20.33, Value of purchases(post-pre, control) = 26.95, p < .001).

The group of shoppers who are close to purchase at the time of app failure are likely to repeatedly attempt to re-login during the failure duration to complete their intended purchase.

Such shoppers may eventually make the purchase in another channel, resulting in channel substitution. However, the group of shoppers who are far from purchase at the time of failure, make fewer attempts to log back during the failure time window. A greater negative effect of app failure for such shoppers may be due to brand preference dilution. 23

Failure-experiencers who were close to a purchase or had purchase intent, would have had to determine whether to complete the transaction, and if so, whether to do it online or offline. For shoppers who typically buy online, the cost of going to the retailer’s website to complete a purchase interrupted by the app failure is smaller than that of going to the store to complete the purchase. Therefore, these shoppers will likely complete the transaction online and not exhibit any significant decrease in shopping outcomes in the online channel post failure. Thus, channel substitution effect likely explains the insignificant effects of app failure in the online channel. By contrast, shoppers who typically buy in the retailer’s brick-and-mortar stores and who experience the app failure, will likely have a diminished perception of the retailer with fewer incentives to buy from the stores in the future. Thus, brand preference dilution effect may prevail for these shoppers after app failure. This effect is due to a negative spillover from the app channel to the offline channel for shoppers experiencing the failure even if they are primarily offline shoppers.

Indeed, a negative message or experience can have an adverse spillover effect on attributes or contexts outside the realm of the message or experience (Ahluwalia et al. 2001).

To further explore channel substitution toward the online channel, we examine the time elapsed between the occurrence of the failure and subsequent purchase in the online channel.

Failure experiencers’ inter-purchase time online (Meantreated = 162.8 hours) is much shorter than non-experiencers’ (Meancontrol = 180.7 hours) (p = .003). This result further suggests that after an app failure, shoppers look to complete their intended purchases in the online channel.

Next, to understand channel substitution toward the offline channel, we examine the effect of app failure for shoppers who were geographically close to a physical store at the time of failure.

Shoppers who are closer to a store when they experience the app failure could more easily complete their purchase in the store than shoppers farther from a store. Table 7 reports the DID 24 model for the subsample of shoppers located within two miles of the retailer’s store at the time of failure. The results show that shoppers closer to the store are not negatively affected by the failure (p < .05). Rather surprisingly, both the basket size and the monetary value of purchases for shoppers close to a store are significantly higher after app failure (p < .05). This result suggests that shoppers who experience a failure close to or at a physical store end up buying additional items in the store. Thus, app failure has an unintended positive effect on such shoppers. An implication is that channel substitution to a store leads to more purchases, but that channel substitution is less likely for shoppers who are farther from the store. However, the proportion of shoppers close to the store at the time of failure is very small (2.4%), so the overall effects of app failure on offline purchases and all purchases are still negative.

Table 7 DID MODEL RESULTS FOR VALUE OF PURCHASES AND BASKET SIZE BY CHANNEL FOR SHOPPERS CLOSE TO A STORE (< 2 MILES) AT THE TIME OF FAILURE Offline Online Variable Value of Basket size Value of Basket size purchases purchases Failure experiencer x 13.542* .134* .885 .023 Post shock (DID) (5.307) (.058) (1.178) (.020) Failure experiencer 2.419 .061 -1.096* -.012 (2.171) (.048) (.479) (.013) Post shock 37.833*** .109** 2.382** .019 (3.150) (.036) (.882) (.012) Intercept 32.458*** .846*** 2.251*** .059*** (1.336) (.031) (.390) (.009) R squared .0395 .0064 .0027 .0012 Mean Y 55.00 .95 3.18 .07 Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 6,572. DID = Difference-in-Differences. Two miles is the median distance from store at the time of failure.

To further analyze the role of distance to the store at the time of failure, we present the contrast analysis between shoppers who were less than two miles and those who were greater than two miles from the nearest store at the time of failure in Table 8. The basket sizes of these groups of shoppers do not differ post failure. However, shoppers closer to the store spend more 25 than those farther from the store post failure, suggesting that the app failure is associated with channel substitution in purchases for shoppers closer to the store.

Table 8 CONTRAST ANALYSIS BASED ON DISTANCE TO STORE AT THE TIME OF FAILURE FOR FAILURE EXPERIENCERS Variable Offline value of Offline basket purchases size Close to store x 14.130* .083 Post shock (DID) (5.707) (.065) Close to store 2.011 .051 (2.357) (.051) Post shock 40.511*** .159** (3.675) (.046) Intercept 34.021*** .855*** (1.593) (.036) R squared .0432 .0064 Mean Y 54.96 .98 Note: N = 5,650. Closeness to store is defined using the median distance of 2 miles. There are 1,298 failure-experiencers within 2 miles of the store at the time of failure and 1,527 failure-experiences who are 2 miles or farther from the store among those who opt-in for location sharing. *** p < .001, ** p < .01, * p < .05. Finally, to better understand how channel substitution and brand preference dilution effects may act on different failure-experiencing shopper groups purchasing in different channels, we compare the effects of app failure on orders above the free shipping cost threshold value ($35) in the online and offline channels. Failure experiencers who intended to order items valued above the threshold can quickly substitute the app channel with the Web channel without additional cost, so we expect the app failure to have little effect on their frequency of online purchases. The results of a DID model for the frequency of online and offline purchases above the free shipping threshold appear in Table 9. Indeed, app failure has no significant (p > .10) effect in the online channel but a negative and significant (p < .05) effect in the offline channel on frequency of purchases, suggesting offline shoppers significantly lower their preference and purchases after the app failure. Thus, channel substitution appears to explain the null effect of app failure online, while brand preference dilution seems to account for the negative effect of failure offline.

26

Table 9 DID MODEL RESULTS FOR THE AVERAGE NUMBER OF ORDERS ABOVE FREE SHIPPING ORDER VALUE THRESHOLD Variable Average number Average number of online orders of offline orders above $35 above $35 Failure experiencer x .021 -.010* Post shock (DID) (.021) (.005) Failure experiencer -.008 .007 (.016) (.004) Post shock .118*** .103*** (.015) (.004) Intercept .414*** .444*** (.011) (.003) R-squared .017 .013 N 8,178 109,836 Mean Y .50 .48 Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = ?. DID = Difference-in-Differences.

We realize that much of the evidence for the mechanisms is descriptive and suggestive in nature. Nevertheless, overall, the evidence is consistent with the asymmetry in the effect of app failure on shopping outcomes across the two channels.

Next, we examine the heterogeneity in the sensitivity of shoppers to app failures in two ways.

We use a theory-based moderator approach as well as a data-driven machine learning approach.

Moderators: Relationship Strength and Prior Digital Use

The literatures on relationship marketing and service recovery suggest two factors may moderate the impact of app failures on outcomes: relationship strength and prior digital channel use.

Relationship Strength. The service marketing literature offers mixed evidence on the moderating role of the strength of customer relationship with the firm in the effect of service failure on shopping outcomes. Some studies suggest that stronger relationship may aggravate the effect of failures on product evaluation, satisfaction, and on purchases (Chandrashekaran et al.

2007; Gijsenberg et al. 2015; Goodman et al. 1995). Other studies show that stronger relationship attenuates the negative effect of service failures (Hess et al. 2003; Knox and van

Oest 2014). Consistent with the direct marketing literature (Bolton 1998; Schmittlein et al. 27

1987), we operationalize customer relationship using RFM (recency, frequency, and monetary value) dimensions. Because of high correlation between the interactions of frequency with

(failure experiencers x post shock) and value of purchases with (failure experience x post shock)

(r = .90, p < .001) and because value of purchases is more important for the retailer, we drop frequency of past purchases.

Prior Digital Channel/Online Use/Experience. The moderating effect of a shopper’s prior digital channel/online use or experience with the retailer on app failure’s impact on shopping outcomes could be positive or negative. On the one hand, more digitally experienced app users may be less susceptible to the negative impact of an app crash on subsequent engagement with the app than less digitally experienced app users (Shi et al. 2017) because they are conditioned to expect some level of technology failures, consistent with the product harm crises literature

(Cleeren et al. 2013; Liu and Shankar 2015) and the expectation-confirmation theory (Cleeren et al. 2008; Oliver 1980; Tax et al. 1998,). On the other hand, prior digital exposure and experience with the firm may heighten shopper expectations and make them less tolerant of failures. We operationalize this variable as the cumulative number of purchases that the shopper made from the retailer’s website prior to experiencing a failure.

The results of the model with relationship strength and past digital channel use as moderators appear in Table 10. Consistent with our expectation, the monetary value of past purchases has positive and significant interaction coefficients with the DID model variable across all the outcome variables (p < .001). Thus, app failures have a smaller effect on shoppers with stronger relationship with the retailer, consistent with the results of Ahluwalia et al. (2001). Recency has negative coefficients (p < .001), suggesting that the more recent shoppers are less tolerant of 28 failure. A failure shock also affects the frequency, quantity, and value of purchases (p < .001) of shoppers with greater digital channel or online purchase experience with the retailer more.

Table 10 DID MODEL RESULTS OF FAILURE SHOCKS FOR PURCHASES ACROSS CHANNELS: MODERATING EFFECTS OF RELATIONSHIP WITH RETAILER AND PAST ONLINE PURCHASE FREQUENCY Variable Frequency of Quantity of Value of purchases purchases purchases Failure experiencer x Post shock -.193*** -.389*** -12.935*** (DID) (.012) (.03) (.879) DID x Past value of purchases .000*** .000*** .017*** (.000) (.000) (.001) DID x Recency of purchases -.001*** -.003*** -.022*** (.000) (.000) (.006) DID x Past online purchase -.019*** -.029*** -1.344*** frequency (.003) (.007) (.220) Past value of purchases .000*** .001*** .024*** (.000) (.000) (.000) Recency .005*** .009*** .206*** (.000) (.000) (.003) Past online purchase frequency .002 .007 -.638*** (.001) (.004) (.105) Failure experiencer .007 .037* .589 (.007) (.017) (.497) Post shock .178*** .236*** 26.947*** (.007) (.017) (.505) Intercept .640*** 1.141*** 25.034*** (.006) (.015) (.446) R squared .159 .122 .093 Mean Y .839 1.643 44.070 Notes: DID = Difference-in-Differences. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378.

Heterogeneity in Shoppers’ Sensitivity to App Failures (Treatment Effect)

In addition to the service marketing literature based moderator variables examined earlier, , we also explore heterogeneity in treatment effects relating to additional managerially useful observed variables (e.g., gender, loyalty level) not fully examined by prior research.

Unfortunately, including these variables as additional moderators in the DID analysis explodes the number of main and interaction effects.

Recent methods of causal inference using machine learning such as “causal forest” allow us to recover individual-level conditional average treatment effects (CATE) (Athey et al. 2017; Wager 29 and Athey 2018). The causal forest is an ensemble of causal trees that averages the predictions of treatment effects produced by each tree for thousands of trees.9 It has been applied in marketing to model customer churn and information disclosure (Ascarza 2018; Guo et al. 2018).

The estimates from causal forest using 1,000 trees appear in Web Appendix Table A1. About

96% of the shoppers have a negative value of CATE with an average of -1.739. The distribution of CATE across shoppers appears in Web Appendix Figure A1. The shopper quintiles based on

CATE levels reflects this distribution in Web Appendix Figure A2, which shows that Segment 1 of the most sensitive shoppers exhibit higher variance than the rest.

Next, we regress the CATE estimate on the covariate space to identify the covariates that best explain treatment heterogeneity. The results appear in Web Appendix Table A2. They show that all the covariates, including gender and loyalty are significant (p < .001). Shoppers with higher value of past purchases and more frequent online purchases are less sensitive to an app failure than others. Shoppers who bought more recently in the past are less tolerant of an app failure.

Some of these results complement those from the moderator analysis.

The causal forest-derived CATE regression differs from the moderator DID regression in important ways. First, the moderator regression uses the entire sample for estimation, while the causal forest, the basis for the CATE regression, uses a subset of the data (the training sample) for estimation. Second, the causal forest underlying the CATE regression splits the training data further to estimate an honest tree, estimating from an even smaller subset of the moderator regression sample. Third, relative to the linear moderator regression, the CATE regression can handle a much larger number of covariates. Because of these differences, the results of the

CATE regression model may not exactly mirror those of the moderator regression model.

9 In Web Appendix A, we provide an overview of causal trees and describe the algorithm for estimating a single causal tree followed by bagging a large number of causal trees into a forest. 30

ROBUSTNESS CHECKS AND RULING OUT ALTERNATIVE EXPLANATIONS

We perform several robustness checks and tests to rule out alternative explanations for the effect of app failure on purchases.

Alternative model specifications. Although the failure in our data is exogenous, to be sure, in addition to our proposed DID model, we also estimate models with shopper covariates to estimate the treatment effect of interest. Additionally, we estimate Poisson count data models for the frequency and quantity variables. The results from these models replicate the findings from

Tables 3 and 4 and appear in the Web Appendix Tables B1-B2 and C1-C2, respectively. The coefficients of the treatment effect from Table B1 and C1 represent changes in outcomes due to app failures, conditioned on covariates. These results are substantively similar to those in Tables

3 and 4. The insensitivity of the results to control variables suggests that the effect of unobservables relative to these observed covariates would have to be very large to significantly change our results (Altonji et al. 2005). Similarly, the results are robust to a Poisson specification, reported in Tables B2 and C2.

Outliers. We re-estimate the models by removing outliers (extremely heavy spenders who are greater than three standard deviations away from the mean in monetary value of purchases in the pre-period) from our data. Web Appendix Tables B3 and C3 report these results. We find the results to be consistent with and even stronger than those reported earlier.

Existing shoppers. Another possible explanation for app failures’ effect can be that only new or dormant shoppers are sensitive to failures, perhaps due to low switching costs. Therefore, we remove those with no purchases in the last 12 months to see if their behavior is similar to that of the existing shoppers. Indeed, Web Appendix Tables B4 and C4 report substantively similar results after excluding the new or dormant shoppers. 31

Alternative measures of digital channel use moderators. In lieu of past online purchases frequency as a measure of prior digital channel use, we use measures based on median split in the number, share of online purchases, and prior app usage in the time between app launch and server failure in the app. The results for alternative online purchase measures are almost the same as our proposed model results, except for prior app usage. Shoppers who use the app more frequently appear to be less sensitive to failures as shown in Web Appendix Tables B5 and C5.

Regression discontinuity analysis. To ensure that there are no unobservable differences between failure experiencers and non-experiencers based on the time of login, we carry out a

‘regression discontinuity’ (RD) style analysis in the one hour before the start time of the service failure. For the RD analysis, we consider only app users in the neighborhood of this time, using as control group those users who logged in one hour before and after the failure period and as treated the users who logged in during the failure period. The results are substantively similar to our main model results and are reported in Web Appendix Tables B6 and C6.

Longer-term effect of failures. Our main analysis shows 14-day effects of app failures. To explore if these effects continue over longer periods of time, we examined the outcomes four weeks pre- and post- failure event. There is a steep fall in the period immediately after the failure. However, purchases climb back to higher levels over the next three weeks. Nevertheless, they return to levels lower than the pre-period average. Thus, the diminished impact of failure persists over time. These patterns appear in Figure 6 and Web Appendix Table D1. The table shows the coefficients of the interactions of weekly dummies with TREAT for a DID regression.

Because an app failure occurs every 7-8 weeks, we estimate the effects four weeks pre and post so as to avoid our pre- or post- periods overlapping with other failures.

32

Figure 6 APP FAILURE EFFECTS ON VALUE OF PURCHASES OVER FOUR WEEKS

Note: The effects for all but one of the pre app failure weeks are insignificant. The horizontal line is average treatment effect.

Stacked model for channel effects. The results for online and offline purchases in Table 4 do not show the relative sizes of the effects across the two channels. To examine these relative effects, we estimate a stacked model of online and offline outcomes that includes a channel dummy. The results for this model appear in Web Appendix Table D2. We interpret the effects as a proportion of the purchases within the channel and conclude that the effects in the offline channel are more negative than those in the online channel (p < .01). We also estimated a DID regression model with value of purchases in the offline channel as a proportion of total purchases and found negative and significant effects of failure (p < .01).

DISCUSSION, MANAGERIAL IMPLICATIONS, AND LIMITATIONS

Summary

In this paper, we addressed novel research questions: What is the effect of a service failure in a retailer’s mobile app on the frequency, quantity, and monetary value of purchases in online and offline channels? What possible mechanisms may explain these effects? How do shoppers’ relationship strength and prior digital channel use moderate these effects? How heterogeneous is shoppers’ sensitivity to failures? By answering these questions, our research fills an important gap at the crossroads of three disparate streams of research in different stages of development: 33 the mature stream of service failures, the growing stream of omnichannel marketing, and the nascent stream of mobile marketing. We leveraged a random systemwide failure in the app to measure the causal effect of app failure. To our knowledge, this is the first study to causally estimate the effects of digital service failure using real world data. Using unique data spanning online and offline retail channels, we examined the spillover effects of such failures across channels and examined heterogeneity in these effects based on channels and shoppers.

Our results reveal that app failures have a significant negative effect on shoppers’ frequency, quantity, and monetary value of purchases across channels. These effects are heterogeneous across channels and shoppers. Interestingly, the overall decreases in purchases across channels are driven by reductions in store purchases and not in digital channels. Furthermore, we find that that shoppers with higher monetary value of past purchases are less sensitive to app failures.

Overall, our nuanced analyses of the mechanisms by which an app failure affects purchases offer new and insightful explanations in a cross-channel context. Our findings are consistent with the view that some customers may be tolerant of technological failures (Meuter et al. 2000).

Finally, our study offers novel insights into the cross-channel implications of app failures.

Economic Significance

The economic effects of failures are sizeable for any retailer to alter its service failure preventive and recovery strategies. Based on our estimates, the economic impact of an app failure is a revenue loss of about $.48 million.10 The retailer experiences about 5-7 failures each year, resulting in an annual loss of $2.4-$3.4 million. This loss may not amount to a sizeable portion of

10 We compute this figure by using the weekly effect coefficients in Table D1, i.e., $(.82 + 1.70 + .58 + .53)*N for the first four weeks and $(.64)*5*N assuming that the fifth week’s effects remain for another five weeks until the next failure for N = 70,568 failure experiencers, totaling $.48 million. 34 the retailer’s annual revenues. However, given the low retail margins and the retailer’s vulnerable financial condition, it forms a substantial amount for the retailer.

The economic effect is meaningful for several reasons. First, retailers operate on thin margins

(2-3% in many categories) and are cost-conscious, such an economic loss is impactful. Second, the effect size of 7.1% from our results is consistent with and even higher than those from other similar causal studies. For example, exposure to banner advertising has been shown to lift purchase intention by .473% worth 42 cents/click to the firm (Goldfarb and Tucker 2011).

Goldfarb and Tucker (2011) argue, “although the coefficient may seem small, it suggests an economically important impact of online advertising.” Third, in the mobile context, the effect of being in a crowd (of five people relative to two per square meter when receiving a mobile promotion) results in an economically meaningful 2.2% more clicks (Andrews et al. 2015).

Fourth, Akca and Rao (2020) argue that a revenue drop of $5.32 million is economically significant for a large company such as Orbitz. Fifth, as sales through the mobile app and online sales are growing rapidly, this effect is only getting larger. Sixth, our estimates are for one two- hour app failure in a year. Finally, the effects continue over a longer five-week period.

Managerial Implications

Service failure and low-quality service likely lead to termination (Sriram et al. 2015). The insights from our research better inform executives in managing their mobile app and channels and offer practitioner implications for service failure preventive and recovery strategies.

Preventive Strategies. Managers can use the estimate that an app failure results in a 7.1% decrease in monetary value of purchases to budget resources for their efforts to prevent or reduce app failures. The result that the adverse effect of failure is lower for shoppers closer to purchase and purchasing less recently offers interesting pointers for retailers to prevent damage to their 35 and revenues. In general, managers should encourage shoppers to use the app more, get closer to purchase, and purchase more through the app. Managers could offer limited-time incentives to shoppers who have not clicked the checkout or purchase tabs in the app.

By identifying failure-sensitive shoppers based on relationship strength, prior digital use, and individual-level CATE estimates, managers can take proactive actions to prevent these shoppers from reducing their shopping intensity with the firm. Figure 7 represents the loss of revenues

(spending) from each percentile of shoppers at different levels of failure sensitivity.

Figure 7 RETAILER’S REVENUE LOSS BY PERCENTILE OF SHOPPERS EXPERIENCING APP FAILURE

Note: CATE = Conditional Average Treatment Effect.

About 70% of the losses in revenues due to failure arise from just 47% of the shoppers.

Managers can manage these shoppers’ expectations through email and app notification messaging channels. Warning shoppers of typical number of disruptions in the app can preempt negative attributions and attitudes, and limit potential brand dilution and drop in revenues due to app failure.

Recovery Strategies. The finding that app failures result in reduced purchases across channels suggests that managers should develop interventions and recovery strategies to mitigate the negative effects of app failures not just in the mobile channel, but also in other channels, in particular, the offline channel. Thus, seamlessly integrating data from a mobile app with data 36 from its stores and websites can help an omnichannel retailer build continuity in shoppers’ experiences.

Immediately after a shopper experiences an app failure, the manager of the app should provide gentle nudges and even incentives for the shopper to complete an abandoned transaction on the app. Typically, a manager may need to provide these nudges and incentives through other communication channels such as email, phone call, or face-to-face chat. These nudges are similar in spirit and execution to those from firms like Fitbit and Amazon, who remind customers through email to reconnect when they disconnect their watch and smart speaker, respectively. If the store is a dominant channel for the retailer, the retailer should use its store associates to reassure or incentivize shoppers. In some cases, managers can even offer incentives in other channels to complete a transaction disrupted by an app failure.

Because diminished purchases after failure result from reduced engagement, managers should aim to enhance engagement after a systemwide failure. Once a failure is restored, managers could induce shoppers to use the app more through gamification features in the app or providing enhanced loyalty points for logging back into the app.

The finding that app failure can enhance spending for shoppers experiencing the failure close to the store offers useful cross-selling opportunities for the retailer. After a systemwide failure is resolved, retailers can proactively promote in the store nearest to each failure-experiencing shopper products based on the shoppers’ purchase history.

Managers should mitigate the negative effects of app failures for the most sensitive shoppers first. They should proactively identify failure-sensitive shoppers and design preemptive strategies to mitigate any adverse effects. We find that shoppers with weaker relationship with the provider are more sensitive to failures. Thus, firms should address such shoppers for recovery 37 after a careful cost-benefit analysis. This is important because apps serve as a gateway for future purchases for these shoppers.

Finally, our analysis of heterogeneity in shoppers’ sensitivity to app failures suggests that managers should satisfy first the shoppers with the highest values of CATE. Interventions targeted at the 47% of the shoppers who contribute to 70% of losses could lead to higher returns.

Limitations

Our study has limitations that future research can address. First, we have data on a limited number of failures, so we could not fully explore all the failures with varying durations. Second, our results are most informative for similar retailers that have a large brick-and-mortar presence but growing online and in-app purchases. If data are available, future research could study app failures for primarily online retailers with an expanding offline presence (e.g., Bonobos, Warby

Parker). Third, we do not have data on competing apps that shoppers may use. Additional research could study shoppers’ switching behavior if data on competing apps are available.

Fourth, our data contain relatively low number of purchases in the mobile channel. For better generalizability of the extent of spillover across channels, our analysis could be extended to contexts in which a substantial portion of purchases are made within the app. Fifth, we do not have data on purchases made through the app vs. mobile browser. Studying differences between these two mobile sub-channels is a fruitful future research avenue. Finally, mobile apps may be an effective way to recover from the adverse effects of service failures (Tucker and Yu 2018).

Our approach also provides a way to identify app-failure sensitive shoppers, but we do not have data on shoppers’ responses to service recovery to recommend the best mitigation strategy. The strategies we do recommend could be tested in ethically permissible field studies. 38

REFERENCES

Ahluwalia, Rohini, H. Rao Unnava, and Robert E. Burnkrant (2001), “The Moderating Role of Commitment on the Spillover Effect of Marketing Communications,” Journal of Marketing Research, 38 (4), 458–70. Akca, Selin and Anita Rao (2020), “Value of Aggregators,” Marketing Science, 39 (5), 893–922. Altonji, Joseph G., Todd E. Elder, and Christopher R. Taber (2005), “Selection on Observed and Unobserved Variables: Assessing the Effectiveness of Catholic Schools,” Journal of Political Economy, 113 (1), 151–84. Andreassen, Tor Wallin (2016), “What Drives Customer Loyalty with Complaint Resolution?” Journal of Service Research, 1 (4), 324-32. Andrews, Michelle, Xueming Luo, Zheng Fang, and Anindya Ghose (2015), “Mobile Ad Effectiveness: Hyper-Contextual Targeting with Crowdedness,” Marketing Science, 35 (2), 218– 33. Angrist, Joshua D. and Jörn-Steffen Pischke (2009), Mostly Harmless Econometrics: An Empiricist’s Companion, Princeton: Princeton University Press. Ansari, Asim, Carl F. Mela, and Scott A. Neslin (2008), “Customer Channel Migration,” Journal of Marketing Research, 45 (1), 60-76. Athey, Susan and Guido Imbens (2016), “Recursive Partitioning for Heterogeneous Causal Effects,” Proceedings of the National Academy of Sciences, 113 (27), 7353-7360. Athey, Susan, Guido Imbens, Thai Pham, and Stefan Wager (2017), “Estimating Average Treatment Effects: Supplementary Analyses and Remaining Challenges,” American Economic Review, 107 (5), 278–81. Avery, Jill, Thomas J. Steenburgh, John Deighton, and Mary Caravella (2012), “Adding Bricks to Clicks: Predicting the Patterns of Cross-Channel Elasticities Over Time,” Journal of Marketing, 76 (3), 96–111. Barron’s (2018), “Walmart: Can It Meet Its Digital Sales Growth Targets?,” (accessed November 5, 2020), [available at https://www.barrons.com/articles/walmart-can-it-meet-its-digital-sales- growth-targets-1519681783]. Bell, David R., Santiago Gallino, and Antonio Moreno (2018), “Offline Showrooms in Omni-channel Retail: Demand and Operational Benefits, Management Science, 64 (4), 1629-51. Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan (2004), “How Much Should We Trust Differences-In-Differences Estimates?” The Quarterly Journal of Economics, 119 (1), 249–75. Bitner, Mary Jo, Bernard H. Booms, and Mary Stanfield Tetreault (1990), “The Service Encounter: Diagnosing Favorable and Unfavorable Incidents,” Journal of Marketing, 54 (1), 71–84. Blancco (2016), “The State of Mobile Device Performance and Health: Q2,” (accessed November 5, 2020), [available at https://www2.blancco.com/en/research-study/state-of-mobile-device- performance-and-health-trend-report-q2-2016]. Bolton, Ruth N. (1998), “A Dynamic Model of the Duration of the Customer’s Relationship with a Continuous Service Provider: The Role of Satisfaction,” Marketing Science, 17 (1), 45–65. Brynjolfsson, Erik, Yu Jeffery Hu, and Mohammad S Rahman (n.d.), “Competing in the Age of Omnichannel Retailing,” MIT Sloan Management Review, (accessed November 5, 2020), [available at https://sloanreview.mit.edu/article/competing-in-the-age-of-omnichannel-retailing/]. Chandrashekaran, Murali, Kristin Rotte, Stephen S. Tax, and Rajdeep Grewal (2007), “Satisfaction Strength and Customer Loyalty,” Journal of Marketing Research, 44 (1), 153–63. 39

Chintagunta, Pradeep K., Junhong Chu, and Javier Cebollada (2011), “Quantifying Transaction Costs in Online/Off-line Grocery Channel Choice,” Marketing Science, 31 (1), 96–114. Cleeren, Kathleen, Marnik G. Dekimpe, and Kristiaan Helsen (2008), “Weathering Product-harm Crises,” Journal of the Academy of Marketing Science, 36 (2), 262–70. Cleeren, Kathleen, Harald J. van Heerde, and Marnik G. Dekimpe (2013), “Rising from the Ashes: How Brands and Categories can Overcome Product-Harm Crises,” Journal of Marketing, 77 (2), 58-77. Computerworld (2014), “iOS 8 app crash rate falls 25% since release,” Computerworld, (accessed November 5, 2020), [available at https://www.computerworld.com/article/2841794/ios-8-app- crash-rate-falls-25-since-release.html]. Dimensional Research (2015), “Mobile User Survey: Failing to Meet User Expectations,” TechBeacon, (accessed November 5, 2020), [available at https://techbeacon.com/resources/survey-mobile-app-users-report-failing-meet-user- expectations]. Dotzel, Thomas, Venkatesh Shankar, and Leonard L. Berry (2013), “Service Innovativeness and Firm Value,” Journal of Marketing Research, 50 (2), 259-76. Fong, Nathan M., Zheng Fang, and Xueming Luo (2015), “Geo-Conquesting: Competitive Locational Targeting of Mobile Promotions,” Journal of Marketing Research, 52 (5), 726–35. Forbes, Lukas P. (2008), “When Something Goes Wrong and No One is Around: Non‐internet Self‐ service Technology Failure and Recovery,” Journal of Services Marketing, 22 (4), 316–27. Forbes, Lukas P., Scott W. Kelley, and K. Douglas Hoffman (2005), “Typologies of e‐commerce retail failures and recovery strategies,” Journal of Services Marketing, (S. Baron, K. Harris, and D. Elliott, eds.), 19 (5), 280–92. Forman, Chris, Anindya Ghose, and Avi Goldfarb (2009), “Competition between Local and Electronic Markets: How the Benefit of Buying Online Depends on Where You Live,” Management Science, 55 (1), 47–57. Ghose, Anindya, Hyeokkoo Eric Kwon, Dongwon Lee, and Wonseok Oh (2018), “Seizing the Commuting Moment: Contextual Targeting Based on Mobile Transportation Apps,” Information Systems Research, 30 (1), 154-74. Gijsenberg, Maarten J., Harald J. Van Heerde, and Peter C. Verhoef (2015), “Losses Loom Longer than Gains: Modeling the Impact of Service Crises on Perceived Service Quality over Time,” Journal of Marketing Research, 52 (5), 642-56. Goldfarb, Avi and Catherine Tucker (2011), “Online Display Advertising: Targeting and Obtrusiveness,” Marketing Science, 30 (3), 389–404. Google (2020), “Find Out How You Stack Up to New Industry Benchmarks for Mobile Page Speed,” Think with Google, (accessed November 5, 2020), [available at https://www.thinkwithgoogle.com/marketing-strategies/app-and-mobile/mobile-page-speed-new- industry-benchmarks/]. Google M/A/R/C Study (2013), “Mobile in-store Research: How In-store Shoppers are Using Mobile Devices,” Google M/A/R/C. Guo, Tong, S. Sriram, and Puneet Manchanda (2017), “The Effect of Information Disclosure on Industry Payments to Physicians,” SSRN Scholarly Paper, Rochester, NY: Social Science Research Network. Halbheer, Daniel, Dennis L. Gärtner, Eitan Gerstner, and Oded Koenigsberg (2018), “Optimizing Service Failure and Damage Control,” International Journal of Research in Marketing, 35 (1), 100–15. 40

Hansen, Nele, Ann-Kristin Kupfer, and Thorsten Hennig-Thurau (2018), “Brand Crises in the Digital Age: The Short- and Long-term Effects of Social Media Firestorms on Consumers and Brands,” International Journal of Research in Marketing, 35 (4), 557–74. Hess, Ronald L., Shankar Ganesan, and Noreen M. Klein (2003), “Service Failure and Recovery: The Impact of Relationship Factors on Customer Satisfaction,” Journal of the Academy of Marketing Science, 31 (2), 127–45. Hoffman, K. Douglas and John E. G. Bateson (2001), Essentials of Services Marketing: Concepts, Strategies and Cases, Fort Worth: South-Western College Pub. Kim, Su Jung, Rebecca Jen-Hui Wang, and Edward C. Malthouse (2015), “The Effects of Adopting and Using a Brand’s Mobile Application on Customers’ Subsequent Purchase Behavior,” Journal of Interactive Marketing, 31, 28–41. Knox, George and Rutger van Oest (2014), “Customer Complaints and Recovery Effectiveness: A Customer Base Approach,” Journal of Marketing, 78 (5), 42-57. Liu, Yan and Venkatesh Shankar (2015), “The Dynamic Impact of Product-Harm Crises on Brand Preference and Advertising Effectiveness: An Empirical Analysis of the Automobile Industry,” Management Science, 61 (10), 2514–35. Ma, Liye, Baohong Sun, and Sunder Kekre (2015), “The Squeaky Wheel Gets the Grease—An Empirical Analysis of Customer Voice and Firm Intervention on Twitter,” Marketing Science, 34 (5), 627–45. McCollough, Michael A., Leonard L. Berry, and Manjit S. Yadav (2016), “An Empirical Investigation of Customer Satisfaction after Service Failure and Recovery,” Journal of Service Research, 3 (2), 121-37. Meuter, Matthew L., Amy L. Ostrom, Robert I. Roundtree, and Mary Jo Bitner (2018), “Self-Service Technologies: Understanding Customer Satisfaction with Technology-Based Service Encounters,” Journal of Marketing, 64 (3), 50-64. Narang, Unnati and Venkatesh Shankar (2019), “Mobile App Introduction and Online and Offline Purchases and Product Returns,” Marketing Science, 38 (5), 756–72. National Retail Federation (2018), “Top 100 Retailers 2018,” NRF, (accessed November 5, 2020), [available at https://nrf.com/resources/top-retailers/top-100-retailers/top-100-retailers-2018]. Neumann, Nico, Catherine E Tucker, and Timothy Whitfield (2019), “Frontiers: How Effective Is Third-Party Consumer Profiling? Evidence from Field Studies,” Marketing Science, 38 (6), 918- 26. Oliver, Richard L. (1980), “A Cognitive Model of the Antecedents and Consequences of Satisfaction Decisions,” Journal of Marketing Research, 17 (4), 460–69. Pauwels, Koen and Scott A. Neslin (2015), “Building With Bricks and Mortar: The Revenue Impact of Opening Physical Stores in a Multichannel Environment,” Journal of Retailing, Multi- Channel Retailing, 91 (2), 182–97. Schmittlein, David C., Donald G. Morrison, and Richard Colombo (1987), “Counting Your Customers: Who Are They and What Will They Do Next?” Management Science, 33 (1), 1–24. Shi S, Kalyanam K, Wedel M (2017), “What Does Agile and Lean Mean for Customers? An Analysis of Mobile App Crashes. Working Paper, Santa Clara University. Smith, Amy K. and Ruth N. Bolton (1998), “An Experimental Investigation of Customer Reactions to Service Failure and Recovery Encounters: Paradox or Peril?,” Journal of Service Research, 1 (1), 65–81. Sriram, S., Pradeep K. Chintagunta, and Puneet Manchanda (2015), “Service Quality Variability and Termination Behavior,” Management Science, 61 (11), 2739–59. 41

Tax, Stephen S., Stephen W. Brown, and Murali Chandrashekaran (1998), “Customer Evaluations of Service Complaint Experiences: Implications for Relationship Marketing,” Journal of Marketing, 62 (2), 60–76. Tucker, Catherine E, and Shuyi Yu (2019), “Does IT lead to More Equal treatment? An Empirical Study of the Effect of Smartphone use on Customer Complaint Resolution,” Working Paper, Massachusetts Institute of Technology. Wager, Stefan and Susan Athey (2018), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113 (523), 1228–42. Wang, Kitty and Avi Goldfarb (2017), “Can Offline Stores Drive Online Sales?” Journal of Marketing Research, 54 (5), 706-19. Xu, Kaiquan, Jason Chan, Anindya Ghose, and Sang Pil Han (2016), “Battle of the Channels: The Impact of Tablets on Digital Commerce,” Management Science, 63 (5), 1469–92. i

WEB APPENDIX A CAUSAL FOREST

Causal Trees: Overview

A causal tree is similar to a regression tree. The typical objective of a regression tree is to build accurate predictions of the outcome variable by recursively splitting the data into subgroups that differ the most on the outcome variable based on covariates. A regression tree has decision/internal/split nodes characterized by binary conditions on covariates and leaf or terminal nodes at the bottom of the tree. The regression tree algorithm continuously partitions the data, evaluating and re-evaluating at each node to determine (a) whether further splits would improve prediction, and (b) the covariate and the value of the covariate on which to split. The goodness- of-fit criterion used to evaluate the splitting decision at each node is the mean squared error

(MSE) computed as the deviation of the observed outcome from the predicted outcome. The tree algorithm continues making further splits as long as the MSE decreases by more than a specified threshold.

The causal tree model adapts the regression tree algorithm in several ways to make it amenable for causal inference. First, it explicitly moves the goodness-of-fit-criterion to treatment effects rather than the MSE of the outcome measure. Second, it employs “honest” estimates, that is, the data on which the tree is built (splitting data) are separate from the data on which it is tested for prediction of heterogeneity (estimating data). Thus, the tree is honest if for a unit i in the training sample, it only uses the response Yi to estimate the within-leaf treatment effect, or to decide where to place the splits, but not both (Athey and Imbens 2016; Athey et al. 2017). To avoid overfitting, we use cross-validation approaches in the tree-building stage. ii

Importantly, the goodness-of-fit criterion for causal trees is the difference between the estimated and the actual treatment effect at each node. While this criterion ensures that all the degrees of freedom are used well, it is challenging because we never observe the true effect.

Causal Tree: Goodness-of-fit Criterion

Following Wager and Athey (2018), if we have n independent and identically distributed training

d examples labeled i = 1, ..., n, each of which consists of a feature vector Xi Î [0, 1] , a response

Yi Î R, and a treatment indicator Wi Î [0, 1], the CATE at x is:

(2) �(�) = �[� − � | � = �]

We assume unconfoundedness, i.e., conditional on Xi, the treatment Wi is independent of outcome Yi. Because the true treatment effect is not observed, we cannot directly compute the goodness-of-fit criterion for creating splits in a tree. This goodness-of-fit criterion is as follows.

(3) � = �[((�(�) − �(��)) ]

Because �(�) is not observed, we follow Athey and Imbens’s (2016) approach to create a

∗ transformed outcome � that represents the true treatment effect. Assume that the treatment indicator Wi is a random variable. Suppose there is a 50% probability for a unit i to be in the treated or the control group, an unbiased true treatment effect can be obtained for that unit by just using its outcomes Y in the following way. Let

∗ ∗ (4) � = 2� �� � = 0 and � = −2� �� � = 1

It follows that:

(5) �[�∗] = 2. ( �[� (1)] − �[� (0)]) = �[� ]

Therefore, we can compute the goodness-of-fit criterion for deciding node splits in a causal tree using the expectation of the transformed outcome (Athey and Imbens 2016). Once we generate causal trees, we can compute the treatment effect within each leaf because it has a finite iii number of observations and standard asymptotics apply within a leaf. The differences in the treated and control units’ outcomes within each leaf produces the treatment effect in that leaf.

Causal Forest Ensemble

In the final step, we create an ensemble of trees using ideas from model averaging and bagging.

Specifically, we take predictions from thousands of trees and average over them (Guo et al.

2018). This step retains the unbiased, honest nature of tree-based estimates but reduces the variance. The forest averages over the estimates from B trees in the following manner.

(6) � (�) = � ∑ � (�)

Because monetary value of purchases is the key outcome variable of interest to the retailer, we estimate individual level treatment effect on value of purchases for each failure experiencer separately using the observed covariate data. These covariates include gender and loyalty program in addition to the three theoretically-driven moderators, namely, value of past purchases, recency of past purchases and online buying/digital experience.11 These individual attributes are important for identifying individual-level effects and for developing targeting approaches (e.g., Neumann et al. 2019). We use a random sample of two-thirds of our data as training data and the remaining one-third as test data for predicting CATE. We use half of the training data to maintain honest estimates and for cross-validation to avoid overfitting. The results appear in Tables A1 and A2.

Table A1 CAUSAL FOREST RESULTS: SUMMARY OF INDIVIDUAL SHOPPER TREATMENT EFFECT FOR VALUE OF PURCHASES Ntest Mean SD � 45,563 -1.660 1.136 �| � < 0 43,748 -1.739 1.089 �| � > 0 1,815 .239 .198 Note: �̂ represents the estimated Conditional Average Treatment Effect (CATE) for each individual in the test data.

11 Age and zip code information were not available for all the shoppers in our data period because the retailer followed strict privacy guidelines. iv

Table A2 RESULTS OF CAUSAL FOREST POST-HOC CATE REGRESSION FOR VALUE OF PURCHASES Variable Coefficient (Standard Error)

Intercept -.958***(.012) Past value of purchases .000***(.000) Recency of purchases -.005***(.000) Past online purchase frequency .037***(.002) Gender (female) -.190***(.008) Loyalty program -.340***(.011) R squared .493 Note: *** p < .001. N = 45,563

Figure A1 CAUSAL FOREST RESULTS: INDIVIDUAL CATE

Figure A2 CAUSAL FOREST RESULTS: QUINTILES BY CATE

Note: Segment 1 represents shoppers most adversely affected by failure while Segment 5 represents those who are least adversely affected.

v

WEB APPENDIX B ROBUSTNESS CHECK FOR TABLE 3 (MAIN TREATMENT EFFECT) RESULTS

In this section, we present the results for robustness checks for the main estimation in Table 3 relating to: (a) alternative models with covariates and using Poisson model (Tables B1-B2), (b) outliers (Table B3), (c) existing shoppers (Table B4), (d) alternative measures for prior use of digital channels (Table B5), and (e) regression-discontinuity style analysis (Table B6).

Table B1 ROBUSTNESS OF TABLE 3 RESULTS TO INCLUSION OF COVARIATES ACROSS CHANNELS Variable Frequency Quantity of Value of of purchases purchases purchases Failure experiencer x Post shock -.025* -.063* -2.092** (DID) (.011) (.027) (.763) Failure -.018* -.024 -.624 experiencer (.008) (.019) (.539) Post shock .180*** .238*** 27.182*** (.008) (.019) (.549) Gender -.050*** -.112*** -3.367*** (.011) (.028) (.809) Loyalty -.171*** -.416*** -8.733*** program (.006) (.015) (.415) Intercept .813*** 1.678*** 33.900*** (.006) (.014) (.395) R squared .0114 .0081 .0221 Mean Y .82 1.61 43.31 Notes: N = 273,378. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. DID = Difference-in-Differences.

Table B2 DID POISSON MODEL RESULTS ACROSS CHANNELS Variable Frequency of Quantity of purchases purchases Failure experiencer x -.0209* -.0309*** Post shock (DID) (.0124) (.0158) Failure experiencer -.0282*** -.0197*** (.0089) (.0117) Post shock .2133*** .1440*** (.0089) (.0111) Intercept -.2872*** .4209*** (.0063) (.0081) Log pseudo-likelihood -378,710 -711,963 Mean Y .82 1.61 Notes: Robust standard errors in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in-Differences.

vi

Table B3 ROBUSTNESS OF TABLE 3 RESULTS TO OUTLIER SPENDERS Variable Frequency Quantity of Value of of purchases purchases purchases Failure experiencer x Post -.023* -.055* -2.139** shock (DID) (.010) (.025) (.723) Failure experiencer -.022** -.031 -.742 (.007) (.018) (.511) Post shock .184*** .256*** 27.439*** (.007) (.018) (.519) Intercept .739*** 1.489*** 29.907*** (.005) (.013) (.367) R squared .004 .001 .012 Mean Y .81 1.59 42.69 Notes: N = 272,706. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. DID = Difference-in-Differences.

Table B4 ROBUSTNESS OF TABLE 3 RESULTS TO EXISTING SHOPPERS ACROSS CHANNELS Variable Frequency Quantity of Value of of purchases purchases purchases Failure experiencer -.025* -.061* -2.283** x Post shock (DID) (.01) (.026) (.744) Failure experiencer -.021** -.029 -.684 (.007) (.018) (.526) Post shock .178*** .233*** 27.191*** (.007) (.019) (.534) Intercept .766*** 1.556*** 31.414*** (.005) (.013) (.378) R squared .0039 .0010 .0181 Mean Y .84 1.64 44.07 Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; *** p < .001. DID = Difference-in- Differences.

vii

Table B5 ROBUSTNESS OF TABLE 3 RESULTS TO ALTERNATIVE MEASURES OF DIGITAL CHANNEL USE BASED ON APP USE FREQUENCY BEFORE FAILURE Variable Frequency of Quantity of Value of purchases purchases purchases Failure experiencer x Post shock (DID) -.244*** -.467*** -22.284*** (.024) (.064) (1.837) DID x Value of past purchases .000*** .000*** .019*** (.000) (.000) (.001) DID x Recency of purchases -.003*** -.007*** -.137*** (.000) (.001) (.015) DID x Past app use frequency -.004 -.034* 1.682*** (.006) (.017) (.476) Value of past purchases .000*** .001*** .021*** (.000) (.000) (.000) Recency of purchases .008*** .016*** .347*** (.000) (.000) (.007) Past app use frequency .041*** .080*** 1.535*** (.001) (.001) (.038) Failure experiencer .088*** .226*** 3.272*** (.010) (.027) (.788) Post shock .182*** .214*** 35.054*** (.010) (.027) (.766) Intercept .616*** 1.059*** 22.264*** (.010) (.026) (.744) R squared .2019 .1508 .1072 Mean Y .84 1.64 44.07 Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; Each moderator interacts with the difference-in-differences (DID) term failure experiencers x post shock; *** p < .001. The observations include those of shoppers with at least one purchase in the past for computing recency.

Table B6 ROBUSTNESS OF TABLE 3 RESULTS TO REGRESSION DISCONTINUITY STYLE ANALYSIS Variable Frequency of Quantity of Value of purchases purchases purchases Failure experiencer x Post shock (DID) -.045** -.09* -3.169** (.016) (.04) (1.167) Failure experiencer -.04** -.071* -1.261 (.012) (.029) (.825) Post shock .178*** .231*** 26.385*** (.015) (.037) (1.075) Intercept .759*** 1.538*** 31.287*** (.011) (.026) (.760) R squared .0032 .0008 .0160 Mean Y .80 1.56 42.07 Notes: N = 198,432. Robust standard errors clustered by shoppers are in parentheses; *** p < .001. DID = Difference-in- Differences. viii

WEB APPENDIX C ROBUSTNESS CHECK FOR TABLE 4 (BY CHANNEL) RESULTS

In this section, we present the results for robustness checks for the cross-channel estimation in Table 4 relating to (a) alternative models with covariates and using Poisson model (Tables C1- C2), (b) outliers (Table C3), (c) existing shoppers (Table C4), (d) alternative measures for prior use of digital channels (Table C5), and (e) regression-discontinuity style analysis (Table C6).

Table C1 ROBUSTNESS OF TABLE 4 RESULTS TO INCLUSION OF COVARIATES BY CHANNEL Offline Online Variable Frequency of Quantity of Value of Frequency of Quantity of Value of purchases purchases purchases purchases purchases purchases Failure experiencer x Post shock -.023* -.059* -1.967** -.002 -.003 -.125 (DID) (.010) (.026) (.739) (.002) (.004) (.165) Failure -.015* -.019 -.431 -.003* -.006* -.194 experiencer (.007) (.018) (.523) (.001) (.003) (.117) Post shock .171*** .222*** 25.462*** .009*** .016*** 1.720*** (.007) (.019) (.532) (.001) (.003) (.119) Gender -.05*** -.109*** -3.304*** -.001 -.002 -.063 (.011) (.028) (.784) (.002) (.004) (.175) Loyalty -.165*** -.404*** -8.306*** -.006*** -.012*** -.427*** program (.006) (.014) (.403) (.001) (.002) (.090) Intercept .775*** 1.620*** 32.260*** .038*** .058*** 1.640*** (.005) (.014) (.382) (.001) (.002) (.085) R squared .0112 .0079 .0208 .0006 .0005 .0018 Mean Y .78 1.56 41.08 .04 .06 2.23 Notes: N = 273,378. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01. DID = Difference-in-Differences.

Table C2 DID POISSON MODEL RESULTS BY CHANNEL Offline Online Variable Frequency Quantity of Frequency Quantity of of purchases purchases of purchases purchases

Failure experiencer x -.0209* -.0311*** -.019 -.0184 Post shock (DID) (.0127) (.016) (.0482) (.0643) Failure experiencer -.0253*** -.0169*** -.0885** -.0995*** (.009) (.0119) (.0358) (.0469) Post shock .2133*** .1401*** .213*** .2453*** (.0091) (.0113) (.0337) (.046) Intercept -.3363*** .3851*** -3.3261*** -2.9257*** (.0065) (.0083) (.025) (.033) Log pseudo-likelihood 368,831 -698,829 -46,197 -69,852 Mean Y .78 1.56 .04 .06 Notes: Robust standard errors in parentheses; *** p < .001, ** p < .01, * p < .05. N = 273,378. DID = Difference-in- Differences.

ix

Table C3 ROBUSTNESS OF TABLE 4 RESULTS TO OUTLIER SPENDERS Offline Online Variable Frequency of Quantity of Value of Frequency of Quantity of Value of purchases purchases purchases purchases purchases purchases Failure experiencer x Post shock -.021* -.053* -2.068** -.001 -.002 -.071 (DID) (.010) (.024) (.701) (.002) (.004) (.156) Failure -.018** -.026 -.571 -.003* -.006* -.171 experiencer (.007) (.017) (.496) (.001) (.003) (.110) Post shock .175*** .24*** 25.727*** .009*** .016*** 1.712*** (.007) (.017) (.504) (.001) (.003) (.112) Intercept .704*** 1.437*** 28.479*** .035*** .052*** 1.428*** (.005) (.012) (.356) (.001) (.002) (.079) R squared .0042 .0012 .0180 .0000 .0000 .0020 Mean Y .78 1.53 40.51 .04 .06 2.18 Notes: N = 272,706. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, * p < .05. DID = Difference-in-Differences.

Table C4 ROBUSTNESS OF TABLE 4 RESULTS TO EXISTING SHOPPERS BY CHANNEL Offline Online Variable Frequency of Quantity of Value of Frequency of Quantity of Value of purchases purchases purchases purchases purchases purchases Failure experiencer x Post shock -.024* -.059* -2.166** -.002 -.003 -.117 (DID) (.010) (.025) (.721) (.002) (.004) (.161) Failure -.018* -.024 -.515 -.003* -.005 -.169 experiencer (.007) (.018) (.510) (.001) (.003) (.114) Post shock .170*** .218*** 25.499*** .008*** .015*** 1.693*** (.007) (.018) (.518) (.001) (.003) (.115) Intercept .730*** 1.501*** 29.882*** .037*** .055*** 1.532*** (.005) (.013) (.366) (.001) (.002) (.082) R squared .0037 .0009 .0169 .0003 .0002 .0016 Mean Y .80 1.58 41.81 .04 .06 2.26 Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, * p < .05. DID = Difference-in-Differences.

x

Table C5 ROBUSTNESS OF TABLE 4 RESULTS TO ALTERNATIVE MEASURE OF DIGITAL CHANNEL USE BASED ON APP USAGE FREQUENCY BEFORE FAILURE Offline Online Variable Frequency of Quantity of Value of Frequency of Quantity of Value of purchases purchases Purchases purchases purchases purchases Failure experiencer x Post shock -.216*** -.415*** -19.428*** -.029*** -.052*** -2.856*** (DID) (.024) (.063) (1.784) (.005) (.010) (.425) DID x Value of past .000*** .000*** .017*** .000*** .000*** .001*** purchases (.000) (.000) (.001) (.000) (.000) (.000) DID x Recency of -.003*** -.006*** -.118*** .000*** .000*** -.019*** purchases (.000) (.001) (.015) (.000) (.000) (.004) DID x Past app use -.008 -.04* 1.146* .005*** .007* .536*** frequency (.006) (.016) (.462) (.001) (.003) (.110) Value of past .000*** .001*** .021*** .000*** .000*** .000*** purchases (.000) (.000) (.000) (.000) (.000) (.000) Recency of .008*** .016*** .335*** .000*** .001*** .012*** purchases (.000) (.000) (.007) (.000) (.000) (.002) Past app use .037*** .073*** 1.358*** .003*** .007*** .177*** frequency (.000) (.001) (.037) (.000) (.000) (.009) Failure .087*** .222*** 3.240*** .001 .004 .032 experiencer (.010) (.027) (.765) (.002) (.004) (.182) Post shock .176*** .202*** 32.959*** .006** .011* 2.096*** (.010) (.026) (.743) (.002) (.004) (.177) Intercept .583*** 1.028*** 21.406*** .034*** .031*** .859*** (.010) (.025) (.723) (.002) (.004) (.172) R squared .1944 .1450 .1026 .0151 .0146 .0080 Mean Y .80 1.58 41.81 .04 .06 2.26 Notes: N = 267,534. Robust standard errors clustered by shoppers are in parentheses; *** p < .001. DID = Difference-in-Differences. The observations include those of shoppers with at least one purchase in the past for computing recency. Table C6 ROBUSTNESS OF TABLE 4 RESULTS TO REGRESSION DISCONTINUITY STYLE ANALYSIS Offline Online Variable Frequency of Quantity of Value of Frequency Quantity of Value of purchases purchases purchases of purchases purchases purchases Failure experiencer -.044** -.087* -3.105** -.001 -.003 -.064 x Post shock (DID) (.016) (.040) (1.131) (.003) (.006) (.254) Failure experiencer -.034** -.062* -1.017 -.006** -.009* -.244 (.011) (.028) (.800) (.002) (.004) (.179) Post shock .172*** .218*** 24.835*** .006* .013* 1.550*** (.015) (.036) (1.042) (.003) (.005) (.234) Intercept .720*** 1.482*** 29.704*** .039*** .056*** 1.583*** (.010) (.026) (.736) (.002) (.004) (.165) R squared .0031 .0007 .0150 .0002 .0002 .0014 Mean Y .76 1.51 39.94 .04 .05 2.12 Notes: N = 198,432. Robust standard errors clustered by shoppers are in parentheses; *** p < .001, * p < .10. DID = Difference-in-Differences. xi

WEB APPENDIX D OTHER ROBUSTNESS CHECKS

Table D1 EFFECTS OF APP FAILURE ON AVERAGE VALUE OF PURCHASES EACH WEEK Variable Estimate (Standard error) Treat x Week -4 -.09 (.28) Treat x Week -3 -.58* (.25) Treat x Week -2 -.45 (.24) Treat x Week -1 -.37 (.25) Treat x Week 0 -.82* (.37) Treat x Week 1 -1.70** (.56) Treat x Week 2 -.581 (.31) Treat x Week 3 -.531 (.31) Treat x Week 4 -.64* (.29) Intercept 13.19*** (.09) Mean Y 15.92 Notes: Robust standard errors clustered by shoppers are in parentheses; week and individual fixed effects are included. *** p < .001, ** p < .01, * p < .05, 1 p < .1. N = 1,366,890. DID = Difference-in-Differences. Week 5 in the pre-failure period (Week - 5) is the base week.

Table D2 RESULTS OF DID MODEL WITH STACKED ONLINE AND OFFLINE PURCHASES AND CHANNEL DUMMIES Variable Frequency of Quantity of Value of purchases purchases purchases Failure experiencer x Post -.001 -.003 -.093 shock (DID) (.002) (.003) (.154) DID x Channel dummy -.021** -.052** -1.994** (.008) (.019) (.674) Failure experiencer -.003* -.005* -.167** (.001) (.002) (.064) Post shock .009*** .015*** 1.672*** (.001) (.002) (.113) Channel dummy .678*** 1.416*** 27.755*** (.005) (.012) (.217) Failure experiencer x -.015* -.020 -.360 Channel dummy (.006) (.017) (.299) Post shock x Channel dummy .161*** .206*** 23.602*** (.006) (.014) (.493) Intercept .036*** .054*** 1.500*** (.001) (.002) (.048) R squared .1398 .0949 .0911 Mean Y .82 1.61 43.31 Notes: Robust standard errors clustered by shoppers are in parentheses; *** p < .001, ** p < .01, * p < .05. N = 546,756. DID = Difference-in-Differences. Channel dummy is 1 for offline purchases and 0 for online purchases. xii

Figure D1. PRE-PERIOD PURCHASE TRENDS FOR FAILURE EXPERIENCERS AND NON-EXPERIENCERS

(a) Past Frequency of Purchases

(b) Past Quantity of Purchases

(c) Past Proportion of Online Purchases

Note: The unit of X axis is number of days before the failure event.