Choice Set Confounding in Discrete Choice

Kiran Tomlinson Johan Ugander Austin R. Benson Cornell University Stanford University Cornell University [email protected] [email protected] [email protected]

ABSTRACT analysis is to learn individual preferences over a set of available Standard methods in preference learning involve estimating the pa- items (the choice set), given observations of people’s choices. In rameters of discrete choice models from data of selections (choices) recent years, machine learning approaches have enabled more ac- made by individuals from a discrete set of alternatives (the choice curate choice modeling and prediction [11, 38, 43, 50]. However, set). While there are many models for individual preferences, exist- observational choice data analysis has thus far overlooked a crucial ing learning methods overlook how choice set assignment affects fact: the choice set assignment mechanism underlying a dataset can the data. Often, the choice set itself is influenced by an individual’s have a significant impact on the generalization of learned choice preferences; for instance, a consumer choosing a product from an models, in particular their validity on counterfactuals. online retailer is often presented with options from a recommender Understanding how new choice sets affect preferences in such system that depend on information about the consumer’s prefer- counterfactuals is key to many applications, such as determining ences. Ignoring these assignment mechanisms can mislead choice which alternative-fuel vehicles to subsidize or which movies to models into making biased estimates of preferences, a phenomenon recommend. In particular, chooser-dependent choice set assign- that we call choice set confounding. We demonstrate the presence ment coupled with heterogeneous preferences can severely mislead of such confounding in widely-used choice datasets. choice models, as they do not model the influence of preferences To address this issue, we adapt methods from causal inference on choice set assignment. Recommender systems are one extreme to the discrete choice setting. We use covariates of the chooser case, where items are selected specifically to appeal to a user. Such for inverse probability weighting and/or regression controls, accu- situations also arise in transportation decisions, online shopping, rately recovering individual preferences in the presence of choice and personalized Web search, resulting in widespread (but often set confounding under certain assumptions. When such covariates invisible) error in choice models learned from this data. are unavailable or inadequate, we develop methods that take advan- Drawing on connections with causal inference [24], we term tage of structured choice set assignment to improve prediction. We the issue of chooser-dependent choice set assignment choice set demonstrate the effectiveness of our methods on real-world choice confounding. Choice set confounding is a major issue for recent data, showing, for example, that accounting for choice set con- machine learning methods whose success is due to capturing devia- founding makes choices observed in hotel booking and commute tions from the traditional principles of rational maximization transportation more consistent with rational utility maximization. that underlie the workhorse multinomial model [31]. (Unlike older econometric models of “irrational” behavior [52, 56], these re- CCS CONCEPTS cent methods are practical for modern, large-scale datasets.) These deviations are known as context effects, and occur whenever the • Mathematics of computing → Probabilistic inference prob- choice set has an influence on a chooser’s preferences. Examples lems; • Information systems → Recommender systems. include the asymmetric dominance effect [21], where superior op- tions are made to look even better by including inferior alterna- KEYWORDS tives, and the compromise effect [45], where intermediate options discrete choice; causal inference; preference learning are preferred (e.g., choosing a medium-priced bottle of wine). While ACM Reference Format: context effects are widespread and worth capturing, choice set con- Kiran Tomlinson, Johan Ugander, and Austin R. Benson. 2021. Choice Set founding can result in spurious effects and over-fitting, and it is un- Confounding in Discrete Choice. In Proceedings of the 27th ACM SIGKDD clear if recent machine learning models are learning true effects or arXiv:2105.07959v2 [cs.LG] 17 Aug 2021 Conference on Knowledge Discovery and Data Mining (KDD ’21), August simply being misled by chooser-dependent choice set assignment. 14–18, 2021, Virtual Event, Singapore. ACM, New York, NY, USA, 11 pages. In this paper, we formalize when choice set confounding is an https://doi.org/10.1145/3447548.3467378 issue and show that it can result in arbitrary systems of choice probabilities, even if choosers are rational utility-maximizers (in 1 INTRODUCTION contrast, tractable choice models only describe a tiny fraction of Individual choices drive the success of businesses and public policy, possible choice systems). We also provide strong evidence of choice so predicting and understanding them has far-reaching applications set confounding in two transportation datasets commonly used to in, e.g., environmental policy [12], marketing [3], Web search [22], demonstrate the presence of context effects and to test new mod- and recommender systems [57]. The central task of discrete choice els [8, 26, 36, 43]. Then, to manage choice set confounding, we first adapt two causal inference methods—inverse probability weight- KDD ’21, August 14–18, 2021, Virtual Event, Singapore ing (IPW) and regression controls—to train choice models in the 2021. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of presence of confounding. These methods require chooser covari- the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21), ates satisfying certain assumptions that differ from the traditional August 14–18, 2021, Virtual Event, Singapore, https://doi.org/10.1145/3447548.3467378. KDD ’21, August 14–18, 2021, Virtual Event, Singapore Kiran Tomlinson, Johan Ugander, and Austin R. Benson causal inference setting. For instance, given access to the same co- key role. In another setting, Manski and Lerman [29] used an ap- variates used by a recommender system to construct choice sets, proach similar to our inverse probability weighting. They were con- we can reweight the dataset to learn a choice model as if choice cerned with “choice-based samples,” where we first sample an item sets had been user-independent. Alternatively, we can incorporate and then get an observation of a chooser who selected that item covariates into the choice model itself, recovering individual prefer- (usually, we sample a chooser and then observe their choice) [29]. ences as long as those covariates capture preference heterogeneity. The use of regression controls in discrete choice (i.e., including We also show how to manage choice set confounding without chooser covariates in the utility function) is standard in economet- such covariates, as many observational datasets have little infor- rics [9, 47, 51]. However, in these settings, regression aims to un- mation about the individuals making choices. We demonstrate a derstand how the attributes of an individual affect decision-making, link between models accounting for context effects and models for which can unknowingly and accidentally help with confounding. choice systems induced by choice set confounding. For example, This may explain why choice set confounding has not been widely we derive the context-dependent random utility model (CDM) [44] recognized (additionally, in an interview, Manksi discusses that from the perspective of choice set confounding, by treating the choice set generation has been under-explored [48]). We formalize choice set as a vector of substitute covariates (e.g., “someone who when and how regression adjusts for choice set confounding. is offered item 푖”) in a multinomial logit model. We develop spectral clustering methods typically used for co- 2 DISCRETE CHOICE BACKGROUND clustering [14] that exploit choice set assignment as a signal for We start with some notation and the basics of discrete choice mod- chooser preferences, as a way to improve counterfactual predictions els. Let U denote a universe of 푛 items and A a population of indi- for observed choosers. To show why and when this can work, we viduals. In a discrete choice setting, a chooser 푎 ∈ A is presented a frame the problem of finding sufficient chooser covariates asa nonempty choice set 퐶 ⊆ U and they choose one item 푖 ∈ 퐶. Specif- problem of recovering latent cluster membership in a stochastic ically, 푎 is sampled with probability Pr(푎), then 퐶 is presented to block model (SBM) of the bipartite graph that connects choosers to 푎 with probability Pr(퐶 | 푎), and finally 푎 selects 푖 with proba- the items in their choice sets. bility Pr(푖 | 푎,퐶). Most discrete choice analysis focuses only on In addition to theoretical analysis, we demonstrate the efficacy Pr(푖 | 푎,퐶) or Pr(푖 | 퐶), but we consider this entire process. A dis- of our methods on real-world choice data. We provide evidence that crete choice dataset D is a collection of tuples (퐶, 푖) generated by IPW reduces confounding when modeling hotel booking data, mak- this process. We use CD to denote the set of unique choice sets in D. ing the choice system more consistent with utility-maximization Discrete choice models posit a parametric form for choice proba- and making inferred parameters more plausible. For example, the bilities, with parameters learned from data. The universal logit [33] confounded data overweights the importance of price, since many can express any system of choice probabilities (called a choice sys- users are shown hotels matching their preferences and select the tem). Under a universal logit, each chooser 푎 has a scalar utility cheapest one. Factors such as star rating would play a more im- 푢푖 (퐶, 푎) for item 푖 in choice set퐶. Choice probabilities are then a soft- portant role in counterfactuals. We also evaluate our clustering ( ( )) Í max of these : Pr(푖 | 푎,퐶) = exp 푢푖 퐶,푎 / 푗∈퐶 exp(푢푗 (퐶,푎)). approach on online shopping data. By training separate models for This arises from a notion of rational utility-maximization [51]. different chooser clusters, we outperform a mixture model thatat- Specifically, these are the choice probabilities if 푎 observes random tempts to discover preference heterogeneity from choices alone, utilities 푢푖 (퐶, 푎) + 휖 (where the 휖 are i.i.d. Gumbel-distributed for ignoring the signal from choice set assignment. each item and choice) and selects the item with maximum observed All of our code, results, and links to our data are available at utility. The above model has too many degrees of freedom to be https://github.com/tomlinsonk/choice-set-confounding. practical (e.g., it has entirely separate parameters for every chooser 푎), and typically one assumes utilities are fixed across sets and indi- viduals. This is the logit model [31], where 푢푖 (퐶, 푎) = 푢푖, ∀퐶, ∀푎. Other discrete choice models come from different assumptions 1.1 Additional related work on 푢푖 (퐶, 푎), trading off descriptive power for ease of inference and This research is inspired by recent computational advances in learn- interpretation. For example, we may have access to a vector of ing context-dependent preferences [11, 35, 38, 43, 50]. These meth- 푑푥 covariates 풙풂 ∈ R for person 푎. Similarly, an item 푖 may be ods exhibit strong gains by exploiting context effects but are often 푑 described by a vector of features 풚풊 ∈ R 푦 . We can write 푢푖 (퐶, 푎) evaluated on data with possible choice set confounding. Similar con- as a function of 풙풂,풚풊, or both, yielding several choice models founding issues are well-studied in rating and ranking data within (Table 1)—the multinomial logit (MNL), conditional logit (CL), and recommender systems [30, 42, 53, 54], but these approaches do not conditional multinomial logit (CML).1 All of these models obey directly apply to choice data. The causal inference ideas that we a common assumption, the independence of irrelevant alternatives develop are based on long-standing methods [23, 24]; the challenge (IIA) [51]. IIA states that relative choice probabilities are conserved ′ we address is how to adapt them for discrete choice data. across choice sets: Pr(푖 |푎,퐶)/Pr( 푗 |푎,퐶) = Pr(푖 |푎,퐶 )/Pr( 푗 |푎,퐶′). To be The role of choice set assignment does ocassionally appear in precise, this is individual-level rather than group-level IIA. Among the choice literature. For instance, Manski used choice set assign- ment probabilities to derive random utility models [28]. More often, 1“MNL” sometimes refers to logit and conditional logit. Here, we follow the con- vention [20] that “multinomial” means chooser covariates are used and “conditional” traditional choice theory has focused on latent consideration sets, 푇 means item features are used. Additionally, for CML, we assume 휸풊 = 퐵 풙풂 , which which are subsets of alternatives that are actually considered by reduces the number of parameters from 푑푦 + 푛푑푥 to 푑푦 (푑푥 + 1), allowing us to use choosers [6, 10] where non-uniform choice set probabilities play a the model when the number of items is prohibitively large. Choice Set Confounding in Discrete Choice KDD ’21, August 14–18, 2021, Virtual Event, Singapore

Table 1: Discrete choice models. The item and chooser fea- 3 CHOICE SET CONFOUNDING 풚 풙 푢 ∈ R, 휽 ∈ ture vectors 풊 and 풂 are part of the dataset, while 푖 The traditional approach to choice modeling is to learn a single R푑푦 ,휸 ∈ R푑푥 , 퐵 ∈ R푑푦 ×푑푥 풊 and are learned parameters. model for Pr(푖 | 퐶) (such as a logit) and assume it represents overall Model 푢푖 (퐶, 푎) # Parameters choice behavior, namely, that the model accurately reflects average choice probabilities E [Pr(푖 | 푎,퐶)]. However, Pr(푖 | 퐶) need not logit 푢푖 푛 푎 푇 represent average choice behavior at all, as this is only guaranteed multinomial logit (MNL) 푢푖 + 풙풂휸풊 푛(푑푥 + 1) 푇 under restrictive independence assumptions. conditional logit (CL) 풚풊 휽 푑푦 푇 ( + ) ( + ) cond. mult. logit (CML) 풚풊 휽 퐵풙풂 푑푦 푑푥 1 Observation 1. If, for all 푎 ∈ A,퐶 ∈ CD, 푖 ∈ 퐶, at least one of (1) Pr(퐶) = Pr(퐶 | 푎) (chooser-independent choice sets) or 푑 푑 Table 2: Context effect models. 푝푖 푗 ∈ R,휸풊 ∈ R 푥 , 휽 ∈ R 푦 , 퐴 ∈ (2) Pr(푖 | 푎,퐶) = Pr(푖 | 퐶) (chooser-independent preferences) 푑 ×푑 푑 ×푑 R 푦 푦 , 퐵 ∈ R 푦 푥 are learned parameters. holds, then Pr(푖 | 퐶) = E푎 [Pr(푖 | 푎,퐶). If both conditions are violated, then this equality can fail. Model 푢푖 (퐶, 푎) # Parameters Í CDM [43] 푗 ∈퐶\푖 푝푖 푗 푛(푛 − 1) Appendix A has a proof for this fact and other theoretical state- Í 푇 ments presented later. When we have both chooser-dependent sets mult. CDM (MCDM) 푗 ∈퐶\푖 푝푖 푗 + 풙풂휸풊 푛(푛 + 푑푥 ) 푇 ( + ) ( + ) and preferences, observed choice probabilities Pr(푖 | 퐶) can dif- LCL [50] 풚풊 휽 퐴풚푪 푑푦 푑푦 1 푇 fer significantly from true aggregate choice probabilities E [Pr(푖 | mult. LCL (MLCL) 풚 (휽 + 퐴풚푪 + 퐵풙풂) 푑푦 (푑푦 + 푑푥 + 1) 푎 풊 푎,퐶)]. We call this phenomenon choice set confounding, and provide the models in Table 1, the latter is only obeyed by the logit and the following toy example as an illustration. conditional logit. In general, models obey individual-level IIA if Example 1. Let U = {cat, dog, fish}. Choosers are either cat utility is independent of 퐶, i.e., 푢 (퐶, 푎) = 푢 (푎) and obey group- 푖 푖 people or dog people choosing a pet, with choice probabilities level IIA if 푢푖 (퐶, 푎) is independent of both 퐶 and 푎. While the IIA assumption is convenient, it is commonly violated {cat, dog} {cat, dog, fish} through context effects [8, 21, 46]. Due to the ubiquity of context cat person 3/4, 1/4 3/4, 1/4, 0 effects, models incorporating information from the choice set have dog person 1/4, 3/4 1/4, 3/4, 0 become increasingly popular and have shown considerable suc- Note that the preferences of cat and dog people do not change when cess [11, 38, 43, 50]. Other models allow IIA violations without ex- fish are included in the choice set. Choice sets are assigned non- plicitly modeling effects of the choice set [8, 32, 36]. independently: cat people see {cat, dog} w.p. 3/4 and {cat, dog, fish} We briefly introduce two of these context effect models, the w.p. 1/4 (vice-versa for dog people). Let the population consist of context-dependent random utility model (CDM) [43] and the linear 1/4 cat people and 3/4 dog people. If we only observe samples (퐶, 푖) context logit (LCL) [50], that we use extensively. In the CDM, each without knowing who is a cat person and who is a dog person, item in the choice set exerts a pull on the utility of every other item: ( | { }) = 1/ · 1/ + 1/ · 3/ = 1/ Í Pr dog cat, dog 2 4 2 4 2 푢푖 (퐶, 푎) = 푗 ∈퐶\푖 푝푖 푗 . The CDM can be derived as a second-order (dog | {cat, dog, fish}) = 1/10 · 1/4 + 9/10 · 3/4 = 7/10. approximation to universal logit (where plain logit is the first-order Pr approximation) [43]. The LCL instead operates in settings with However, item features, adjusting the conditional logit parameter 휽 according E [Pr(dog | 푎, {cat, dog})] = 1/4 · 1/4 + 3/4 · 3/4 = 5/8 to a linear transformation of the choice set’s mean feature vector: 푎 ( ) 푇 ( + ) /| | Í E푎 [Pr(dog | 푎, {cat, dog, fish}) = 1/4 · 1/4 + 3/4 · 3/4 = 5/8. 푢푖 퐶, 푎 = 풚풊 휽 퐴풚푪 , where 풚푪 = 1 퐶 푗 ∈퐶 풚풋. To incorporate chooser covariates, we define multinomial versions of these models This mismatch is especially problematic for models that use (Table 2). For this paper, LCL and CDM should be thought of as the choice-set dependent utilities 푢푖 (퐶), such as those designed to ac- simplest context effect models with and without item features. count for context effects. From the above data, we might conclude In contrast, [32] accounts for group-level rather than that the presence of a fish causes a dog to become a more appeal- individual-level IIA violations and has a different structure than ing option. This spurious context effect would be seized upon by any of the models introduced thus far. A (discrete) mixed logit is a context-based models and even result in improved predictive per- mixture of 퐾 with mixing proportions 휋1, . . . , 휋퐾 such that formance on test data drawn form the same distribution. However, Í퐾 = ( ) 푘=1 휋푘 1. With 푢푖 푎푘 denoting the utility of the 푘th component these models would make biased predictions on counterfactual ex- for item 푖, a mixed logit has choice probabilities amples where sets are chosen from a different distribution. In reality, no one’s choice would be affected by adding fish to their choice set—it’s a red herring. This is a causal inference problem. 퐾 ∑︁ exp(푢푖 (푎푘 )) We want to know the cause of a choice, but we are being misled Pr(푖 | 퐶) = 휋푘 Í . (1) ∈ exp(푢푗 (푎 )) as to whether the change in preferences between the {cat, dog} 푘=1 푗 퐶 푘 and {cat, dog, fish} choice sets is due to the presence of fish or to a hidden confounder: the underlying preferences of cat and dog This can result in a choice system violating IIA but not because any people, coupled with chooser-dependent choice set assignment. individual chooser experiences context effects. Rather, the aggrega- Extending this idea, the equality in Observation 1 can fail dramat- tion of several choosers each obeying IIA can result in IIA violations. ically. If the population consists of individuals each of whom obeys KDD ’21, August 14–18, 2021, Virtual Event, Singapore Kiran Tomlinson, Johan Ugander, and Austin R. Benson

IIA (i.e., chooses according to a logit), then E푎 [Pr(푖 | 푎,퐶)] is exactly Table 3: Likelihood gains in sf-work, sf-shop, and expedia the mixed logit choice probability. On the other hand, Pr(푖 | 퐶) can from covariates and context with likelihood ratio test (LRT) express an arbitrary choice system with choice set confounding. 푝-values. Δℓ denotes improvement in log-likelihood. Δℓ 푝 Theorem 2. Mixed logit with chooser-dependent choice sets is Comparison Testing Controlling LRT powerful enough to express any system of choice probabilities. sf-work Logit to MNL covariates — 883 < 10−10 Arbitrary choice systems are much more powerful than mixed Logit to CDM context — 85 < 10−10 logit (even ones with continuous mixtures). For example, it is im- CDM to MCDM covariates context 819 < 10−10 possible for mixed logit to violate regularity, the condition that MNL to MCDM context covariates 20 0.08 Pr(푖 | 퐶) ≥ Pr(푖 | 퐶 ∪ {푗}) for all 퐶 ⊆ U, 푖 ∈ 퐶, 푗 ∈ U, as choice probabilities for 푖 can only go down in each mixture component sf-shop when we include 푗. On the other hand, even Example 1 has a regu- Logit to MNL covariates — 343 < 10−10 larity violation (picking a dog is more likely when a fish is available), Logit to CDM context — 96 < 10−10 despite there being only two types of choosers, both adhering to IIA. CDM to MCDM covariates context 276 < 10−10 We have shown that choice set confounding is an issue in theory, MNL to MCDM context covariates 29 0.36 and we now demonstrate it to be a problem in practice. We present expedia evidence of choice set confounding in two transportation choice CL to CML covariates — 1218 < 10−10 datasets, sf-work and sf-shop [26]. These datasets consist of San CL to LCL context — 2345 < 10−10 Francisco (SF) resident surveys for preferred transportation mode to LCL to MLCL covariates context 1167 < 10−10 work or shopping, where the choice set is the set of modes available CML to MLCL context covariates 2294 < 10−10 to a respondent. The SF datasets are common testbeds for choice models violating IIA [8, 26, 36, 43] and in choice applications [2, 49]. And in Section 5, we address what can be done without covariates The SF data have regularity violations (Table 5 in Appendix C), if we want to (1) make predictions under chooser-dependent choice ruling out the possibility that the IIA violations in these datasets are set assignment mechanisms or (2) make counterfactual predictions just due to mixtures of choosers obeying IIA. Thus, these datasets for previously observed choosers. either have (1) true context effects or (2) choice set confounding. So far, the literature has focused on (1), but we argue that (2) is more likely. We compare the likelihoods of logit, MNL, CDM, and MCDM (recall Tables 1 and 2) on these datasets through likelihood- 4 CAUSAL INFERENCE METHODS ratio tests (Table 3). MNL and MCDM both account for chooser- In traditional causal inference [23, 24, 39], we wish to estimate the dependent preferences through covariates, while CDM and MCDM causal effect of an intervention (e.g., a medical treatment) from both account for context effects. With true context effects, we would observational data. However, we cannot simply compare the out- expect CDM to be significantly more likely than logit and MCDM comes of the treated and untreated cohorts if treatment was not to be significantly more likely than MNL. However, this is not the randomly assigned—confounders might affect both whether some- case. While CDM is significantly more likely than logit, MCDM is one was treated and their outcome. There are many methods to not significantly more likely than MNL in both SF datasets. Thus, debias treatment effect estimation, including matching37 [ , 39], in- context effects only appear significant before controlling for pref- verse probability weighting (IPW) [19], and regression [40]. One erence heterogeneity through covariates. This is exactly what we can also combine methods, such as IPW and regression, which is would expect if the IIA violations in these datasets are due to choice the basis for doubly robust estimators [5]. set confounding rather than context effects. In contrast, we see sig- Here, we adapt causal inference methods to estimate unbiased nificant context effects in the expedia hotel-booking dataset [25] discrete choice models from data with choice set confounding. First, even after controlling for covariates (this dataset uses item features, we adapt IPW to learn unbiased models that do not use chooser hence the different models in Table 3), so context effects are likely. covariates in the utility function. After, we show an equivalence This dataset consists of search results (choice sets) and hotel book- between incorporating chooser covariates in the utility function and ings (choices), and we explore it further in Section 4.4. regression for causal inference. Finally, we combine these methods The choice set confounding leads to a key question: how were for doubly robust choice model estimation. For discrete choice, these choice sets constructed in sf-work and sf-shop? According to Kop- methods require new assumptions and have different guarantees. pelman and Bhat [26], choice sets were imputed based on chooser We first provide a brief introduction to causal inference terminology covariates from the survey. For instance, walking was included as in the binary treatment setting, such as an observational medical an option if a respondent’s distance to the destination was < 4 miles study (in contrast, we will think of choice sets as treatments). and driving was included if they had a driver’s license and at least In potential outcomes notation [41], each person 푖 has covariates one car in their household [26]. This choice set assignment is highly 푋푖 and is either treated (푇푖 = 1) or untreated (푇푖 = 0). At some point chooser-dependent, resulting in strong choice set confounding. after treatment, we measure the outcome 푌푖 (푇푖 ). A typical goal of Example 1 and the SF datasets highlight how confounding can the causal inference methods above is to estimate the average treat- lead to spurious context effects and incorrect average choice proba- ment effect E푖 [푌푖 (1)−푌푖 (0)]. All of these methods rely on untestable bilities. Next, in Section 4, we adapt methods from causal inference assumptions; in particular, they rely on strong ignorability [23, 37] so that chooser covariates can correct choice probability estimates. (also called unconfoundedness or no unmeasured confounders), which Choice Set Confounding in Discrete Choice KDD ’21, August 14–18, 2021, Virtual Event, Singapore

(1) 푎 (2) 푎 (3) 푎 (4) 푎 Just as in standard IPW, we also need positivity (of choice set chooser propensities). Under these assumptions, IPW guarantees that empir- ˜ 풙풂 풙풂 ical choice probabilities in the pseudo-dataset D reflect aggregate covariates choice probabilities in the true population. To formalize this, we in- 풙풂 퐶 풙풂 퐶 troduce D∗, an idealized dataset with uniformly random choice set 퐶 퐶 D D∗ choice assignment for every chooser (of the same size as ). consists set of |D| independent samples (푎,퐶, 푖) each occuring with probability 푢푖 (퐶, 푎) 푢푖 (퐶, 푎) 푢푖 (퐶, 푎) 푢푖 (퐶, 푎) 1 Pr(푎) |C | Pr(푖 | 푎,퐶). We now show that the IPW-weighted log- utilities D likelihood (eq. (2)) is, in expectation, the same as the log-likelihood 푖 푖 푖 푖 function over D∗. Since D∗ has chooser-independent choice sets, choice we can train a model for Pr(푖 | 퐶) using eq. (2) and expect it to Figure 1: Graphical representations of chooser covariate as- capture unbiased aggregate choice probabilities (by Observation 1). sumptions: (1) ignorability; (2) choice set ignorability; (3) preference ignorability; (4) no ignorability. Shaded nodes Theorem 4. If, for all 푎 ∈ A,퐶 ∈ CD, are observed, dashed nodes are deterministic. (1) 0 < Pr(퐶 | 풙풂) < 1 (positivity), and ( | ) = ( | ) requires that the treatment is independent from the outcome, con- (2) Pr 퐶 푎, 풙풂 Pr 퐶 풙풂 (choice set ignorability), ˜ ∗ ditioned on observed covariates: Pr(푇푖 | 푋푖, 푌푖 ) = Pr(푇푖 | 푋푖 ), ∀푖. then 퐸D [ℓ(휃; D)] = 퐸D∗ [ℓ(휃; D )]. Choice set ignorability is crucial to the success of IPW, so we 4.1 Inverse probability weighting should assess when this assumption is reasonable. If choice sets are IPW estimation commonly requires estimating propensity scores de- generated by an exogenous process (such as a recommender system, scribing the probability of each treatment assignment given individ- as in the expedia dataset), then as long as we have access to the same ual covariates. The true probabilities Pr(푇푖 | 푋푖 ) are unknown, so covariates as that process, choice set ignorability holds, although estimated “propensities” Prb (푇푖 | 푋푖 ) are learned from observed data, learning the propensities may still be a challenge. However, in other typically via [4]. Propensities can then be used datasets, choice sets are formed through self-directed browsing (e.g., to estimate average treatment effects or, as in our case, to re-weight clicking around an online shop, as in the yoochoose dataset we a model’s training data [16]. By weighting each sample by the in- examine later). In those cases, basic covariates (age, gender, etc.) verse of its propensity, we effectively construct a pseudo-population are unlikely to fully capture choice set generation, since sets result where treatment is assigned independently from covariates. In ad- from the complexities of human behavior rather than the simpler dition to ignorability, IPW requires positivity, the assumption that algorithmic behavior of a recommender system. As in traditional all propensities satisfy 0 < Pr(푇푖 | 푋푖 ) < 1. causal inference, the validity of choice set ignorability must be In the discrete choice setting, we think of choice sets as treat- determined by the practitioner applying the method. ments. By Observation 1, we need chooser-independent choice sets in order to learn an unbiased choice model. Our idea of IPW for dis- 4.2 Regression crete choice is to create a pseudo-dataset in which this is true and An alternative to using chooser covariates to learn choice set propen- to learn a choice model over that pseudo-dataset. To do this, we sities is to incorporate covariates directly into the utility formula- model choice set assignment probabilities Pr(퐶 | 푎). We can then re- tion, as in the multinomial or conditional multinomial logit models. place each sample (푖, 푎,퐶) with 1/[|CD | Pr(퐶 | 푎)] copies, creating If chooser covariates fully capture their preferences and the choice a pseudo-dataset D˜ with uniformly random choice sets (note that model is correctly specified, then the model that we learn is consis- we allow “fractional samples,” since we don’t explicity construct tent. We formalize the first condition as follows (see Figure 1). D˜ ). However, we cannot hope to learn Pr(퐶 | 푎) in datasets with only a single observation per chooser (which is very often the case). Definition 5. Preference ignorability is satisfied if choice proba- bilities are independent of choosers, conditioned on chooser covari- We instead need to rely on observed covariates 풙풂. We thus learn ates: Pr(푖 | 푎, 풙풂,퐶) = Pr(푖 | 풙풂,퐶). Pr(퐶 | 풙풂) and use these propensities to construct D˜ . For the anal- ysis, we assume we know the true propensities, but a correctly spec- Given correct specification and preference ignorability, the choice ified choice set assignment model learned from data is sufficient. model will be consistent in terms of aggregate choice probabilities To learn a choice model from D˜ , we can simply add weights to and result in accurate individual choice probability estimates. the model’s log-likelihood function, resulting in Theorem 6. If Pr(푖 | 푎, 풙풂,퐶) = Pr(푖 | 풙풂,퐶) for all 푎 ∈ A,퐶 ∈ ∑︁ log Pr (푖 | 퐶) C , 푖 ∈ 퐶 (preference ignorability), then the MLE of a correctly speci- ℓ(휃; D)˜ = 휃 . (2) D |CD | Pr(퐶 | 풙풂) fied (and well-behaved, in the standard MLE sense55 [ , Theorem 9.13]) (푖,퐶,푎) ∈D choice model that incorporates chooser covariates 풙풂 is consistent: In order for Pr(퐶 | 풙풂) to be an effective stand-in for Pr(퐶 | 푎), lim|D |→∞ Prb (푖 | 풙풂,퐶) = Pr(푖 | 푎,퐶). we need the following assumption (see Figure 1). While the guarantee of regression is stronger than IPW, prefer- Definition 3. Choice set ignorability is satisfied if choice sets ence ignorability is more challenging to satisfy in practice. Instead are independent of choosers, conditioned on chooser covariates: of needing all covariates used to generate choice sets, we need co- Pr(퐶 | 푎, 풙풂) = Pr(퐶 | 풙풂). variates to fully describe choice behavior. KDD ’21, August 14–18, 2021, Virtual Event, Singapore Kiran Tomlinson, Johan Ugander, and Austin R. Benson

4.3 Doubly robust estimation Confounded Counterfactual A constraint of both IPW and regression is correct model specifica- 0.875 MCDM tion, either of the choice set propensity model or of the choice model. 0.850

In traditional causal inference, one can combine both methods to 0.825 provide guarantees if either model is correctly specified, producing CDM 0.800 doubly robust estimators [5, 17]. In the same way, we can combine MNL IPW IPW and regression for choice models and achieve their respec- 0.775 no IPW

tive guarantees if their respective conditions are satisfied. In other Prediction Quality 0.750 words, the two methods do not interfere with each other. However, 0.725 Logit this increases the variance of estimates, so it may be advisable to only use one method if we are confident in one of the assumptions. 0 2 4 6 8 0 2 4 6 8 Confounding Strength Confounding Strength 4.4 Empirical analysis of IPW and regression Figure 2: Mean prediction quality of models on synthetic data with both context effects and choice set confound- We begin by evaluating regression and IPW adjustments in syn- ing, with IPW (bold) and without IPW (light). Left: out-of- thetic data, and then apply our methods to the expedia dataset sample predictions on data with confounding. Right: coun- (training details in Appendix C). terfactual predictions of models trained on confounded data. Counterfactual evaluation in synthetic data. We generate syn- Shaded regions show standard error over 16 trials. thetic data with heterogeneous preferences, CDM-style context ef- fects, and choice set confounding. Specifically, we use 20 items with CL LCL

2 No Regression embeddings 풚풊 ∈ R sampled uniformly from the unit circle. We On Promotion also generate embeddings 풙풂 in the same way for each chooser 푎. Price

Each chooser 푎 picks items according to an MCDM, where the util- Location Score 푇 ity for 푖 is a sum of 풙 풚풊 plus a CDM term shared by all choosers, 풂 Review Score ∼ (− ) no IPW with each “push/pull” term 푝푖 푗 Uniform 1, 1 . To generate a IPW choice set for 푎, we sample a uniformly random set with probabil- Star Rating ity 0.25 (to satisfy positivity) and otherwise include each item with CML MLCL 푇 probability 1/(1+푒−푐풙풂 풚풊 ), where 푐 is the confounding strength (we On Promotion Regression condition on having at least two items in the choice set). Higher Price confounding strength results in sets containing items more pre- Location Score ferred by 푎. Each trial consists of 10000 samples. Item embeddings Review Score are unobserved, but chooser embeddings are used as covariates. We train models on a confounded portion of the data and measure Star Rating −3 −2 −1 0 −3 −2 −1 0 prediction quality on a held-out confounded subset as well as a Coefficient Coefficient counterfactual portion with uniformly random choice sets. For IPW, Figure 3: Preference coefficients 휽 in expedia for CL and we estimate choice set propensities via per-item logistic regression, LCL (top row, no regression); and CML and MLCL (bottom multiplying item propensities to get set propensities. row, with regression), with and without IPW. A higher coef- To measure prediction quality, we use the mean relative position ficient means choosers prefer higher values of the feature. of the true choice in the list of predictions sorted in descending probability order. A value of 1 says that the true choices were all (i.e., search results). This is an excellent testbed for IPW since these predicted as most likely. As confounding strength increases, pre- covariates are likely informative about choice sets, making choice diction quality increases in the confounded data for logit, MNL, set ignorability more reasonable than preference ignorability. and CDM, while decreasing on counterfactual data (Figure 2). For We do not have counterfactual choices for the expedia data, but logit and MNL, IPW leads to models that generalize better to coun- we still consider several types of analysis. First, we recall the results terfactual data. For CDM, IPW correctly prevents the illusion of from Table 3 to see if apparent context effects are accounted for by increased performance with more confounding (although variance chooser covariates. There, in contrast to the SF datasets, context caused by IPW appears to result in a small dip in performance at effects still appear significant after controlling for covariates. In fact, low confounding). Since preference ignorability is satisfied, IPW context effects provide a larger likelihood boost than the chooser is unnecessary for MCDM: regression with the correctly specified covariates. Thus, either (1) there are true context effects or (2) the model successfully generalizes despite confounding. chooser covariates in expedia do not satisfy preference ignorability Empirical data with chooser covariates. We now consider the (or both). Based on the nature of the covariates, (2) seems very expedia hotel choice dataset [25] from Section 3, using five hotel likely: the number of children in the chooser’s party and the length features: star rating, review score, location score, price, and promo- of their stay are unlikely to fully describe hotel preferences. tion status. This allows us to use feature-based choice models (CL, Since regression is inconclusive, we also apply IPW. To learn CML, LCL, and MLCL; Tables 1 and 2). The dataset includes informa- choice set propensities, we use a probabilistic model of the mean tion about chooser searches, such as the number of adults and chil- feature vectors of choice sets. We assume these vectors follow dren in their party, which likely have strong effects on choice sets a multivariate Gaussian conditioned on chooser covariates, with Choice Set Confounding in Discrete Choice KDD ’21, August 14–18, 2021, Virtual Event, Singapore

No IPW IPW 5 MANAGING WITHOUT COVARIATES Star Rating So far, we have used chooser covariates to correct for choice set

Review Score LCL confounding. However, in some choice data, there are no covariates 1.0 Location Score available, or we are not willing to make ignorability assumptions. Price 0.5 Here, we show what can be done in this setting. On Promotion 0.0 Star Rating 5.1 Within-distribution prediction

MLCL −0.5 Review Score Unfortunately, by Theorem 2, it is impossible to determine whether Location Score −1.0 IIA violations are caused by choice set confounding or true context Price effects in the absence of chooser information. Nonetheless, wecan On Promotion still exploit IIA violations—whatever their origin—to improve pre- SR RS LS $ OP SR RS LS $ OP diction, as long as we are careful not to make counterfactual predic- Figure 4: LCL and MLCL context effect matrix 퐴 in expedia tions. This is essentially what researchers developing context effect with and without IPW. A higher value means choosers pre- models [11, 36, 38, 43, 50] have been doing (without a framework fer a row feature more in a set where the mean column fea- for understanding the possibility of choice set confounding and the ture (abbreviated) is high; 0 indicates no context effect. associated risks for counterfactual prediction). Beyond emphasiz- ing a need for caution, we also establish a duality between models Table 4: Log-likelihoods and estimated random-set log- accounting for context effects and models accounting for choice likelihoods with IPW on expedia. After adjusting for con- set confounding; specifically, we show that a model equivalent to founding, the data is far easier to explain. the CDM—which was designed with context effects in mind—can be derived purely from the perspective of choice set confounding. Model Confounded IPW-adjusted In a multinomial logit (MNL), we learn a latent parameter vector CL −839499 −786653 푇 휸풊 for each item 푖 ∈ U and model utilies as 푢푖 (푎) = 풙풂휸풊 (omitting CML −838281 −785753 the intercept term). Suppose we don’t have any chooser covariates, LCL −837154 −784770 but we know choice set assignment depends on choosers. We could MLCL −835986 −783928 then use the choice set itself as a surrogate for user covariates (e.g., one covariate could be “someone who is offered item 푖”). Let 1 mean푊 풙 +풛 for some푊 ∈ R푑푦 ×푑푥 , 풛 ∈ R푑푦 . Given observed mean 푪풂 풂 be a binary encoding of the choice set 퐶 of a chooser 푎 (a length choice set vectors 풚 and corresponding chooser covariates 풙 , we 푎 푪 풂 |U| vector with a 1 in position 푖 if 푖 ∈ 퐶 ). Consider treating 1 compute the maximum-likelihood 푊 , 풛, and covariance matrix (see 푎 푪풂 as a substitute for the user covariates 풙 . Then the MNL model is Appendix B). This model gives us propensities for any (푎,퐶) pair. 풂 푇 Using IPW with these propensities dramatically decreases the exp(1 휸풊) Pr(푖 | 퐶 ) = 푪풂 . 푎 Í exp(1푇 휸 ) negative impact of high price in all four models (Figure 3). After 푗∈퐶푎 푪풂 풋 adjusting for confounding, the models indicate that users are more The utility of 푖 in set 퐶 is Í 훾 , which is exactly the CDM willing to book more expensive hotels. This makes sense if Expedia 푎 푗 ∈퐶푎 푖 푗 (with self-pulls, since the sum is over 퐶 rather than 퐶 \푖), a model is recommending relevant hotels: among a set of hotels matching a 푎 푎 designed to capture choice-set-dependent utilities. Thus, the CDM user’s desired characteristics (such as location and star rating), we can either be thought of as accounting for pairwise interactions be- would expect them to select the cheapest option. On the other hand, tween items or using the choice set as a stand-in for user covariates. if we presented users with a set of random hotels, location and star One natural question this duality raises is how the set of choice rating might play a stronger role in determining their choice, since systems expressible by CDM (or other context-effect models) com- a random set might have many cheap hotels that are undesirable for pares to the choice systems induced by mixed populations of IIA other reasons. In addition to the preference coefficients, IPW affects choosers with choice set confounding, which take the form the context effect matrix 퐴 in the LCL and MLCL (Figure 4). In both models, IPW decreases (but does not entirely eliminate) the strong ( | ) = Í ( | ) exp(푢푖 (푎)) Pr 푖 퐶 푎∈A Pr 푎 퐶 Í exp(푢 (푎)) . (3) price context effects. This is evidence that some of the apparent 푗∈퐶 푗 context effects in the dataset are due to choice set confounding. Mixtures of logits such as eq. (3) are notoriously hard to analyze Finally, the estimated likelihoods of the models under IPW are (even in two-component case [13]), so no simple equivalence be- significantly better than without IPW (Table 4). We normalize tween a context-effect model and such a mixture is likely. Infact, the IPW-weighted log-likelihood by the sum of the IPW weights, eq. (3) is even trickier than standard mixed logit (eq. (1)), since the which provides an estimate of what the IPW-trained model’s log- mixture weights depend on the choice set. likelihood would be given random sets. The gap between likelihood Nonetheless, some progress in this direction is possible. Here, we with no IPW and estimated likelihood with IPW dwarfs the gaps provide an instance where the LCL approximates a choice system between different choice models, indicating that accounting for induced by choice set confounding (of the form of eq. (3)). Recall that 푇 choice set confounding makes the data much more consistent with the LCL has utilities 푢푖 (푎,퐶) = (휽 +퐴풚푪 ) 풚풊, where 풚푪 is the mean the random utility maximization principle underlying all four mod- feature vector over the choice set. If we make Gaussian assumptions els. (By Theorem 2, choice set confounding can result in choice sys- on the distribution of features and on choice set assignment, and tems far from rational behavior, even when choosers are rational.) if chooser utilities are inner products of chooser and item vectors, KDD ’21, August 14–18, 2021, Virtual Event, Singapore Kiran Tomlinson, Johan Ugander, and Austin R. Benson

then the LCL is a mean-field approximation to the induced choice −249000 system. In particular, we assume choice sets are generated to be Spectral cluster logit similar to items the chooser 푎 would like (as in a recommender −249500 system) by sampling items from a Gaussian with mean 풙풂. −250000

Theorem 7. Let items and choosers both be represented by vectors −250500 R푑 in . Suppose chooser covariates 풙풂 are distributed in the population −251000 according to a multivariate Gaussian N(흁, Σ0), and a choice set for

Log-Likelihood Mixed logit chooser 푎 is constructed by sampling 푘 items from the multivariate −251500 Gaussian N(풙풂, Σ). Additionally, assume choosers have the utility −252000 푇 function 푢푖 (푎,퐶) = 풙풂풚풊. Then the expected chooser given a choice Random cluster logit Logit ∗ −252500 set 퐶, 풙풂 = E[풙풂 | 퐶], has LCL choice probabilities, with 2 4 6 8 10 = 1 ( + 1 )−1 = ( + 1 )−1 Cluster Count 휽 푘 Σ Σ0 푘 Σ 흁, 퐴 Σ0 Σ0 푘 Σ . Figure 5: yoochoose log-likelihood comparison. Spectral Thus, the LCL can either be thought of as a context effect model, and random cluster results are averaged over eight trials, or as an approximation to the choice system induced by recommender- with one standard deviation shaded. style preferred item overrepresentation. There exists a constant 퐶 such that for large enough 푛, if

5.2 Counterfactuals for known choosers 푠(푝 − 푞)2 > 퐶푘 (푛/푠 + log 푛/훿) , (4) To make counterfactual predictions without chooser covariates or then w.p. 1 − 훿, we can efficiently learn the type of every item and insufficiently descriptive covariates (preventing us from applying every chooser given a dataset D with one choice from each 푎 ∈ 퐴. IPW or regression), we develop a clustering method for the chal- lenge of choice set confounding. Suppose a recommender system While McSherry’s algorithm has strong theoretical guarantees, a suggests two sets of movies to two users: {Romance A, Romance B} more practical implementation is spectral co-clustering [14], which to 푎1 and {Drama A, Drama B} to 푎2. While we know nothing about performs well for our purposes. Once we recover type memberships, 푎1 or 푎2, we might be inclined to think 푎1 is likely to pick Romance we train separate models for each type of chooser and use the model A from {Romance A, Drama A}, while 푎2 is likely to pick Drama A for a chooser’s type for deconfounded counterfactual predictions. from the same choice set. Similar to the CDM derivation in the pre- vious section, the choice set is a signal for chooser preferences. We 5.3 Empirical data without chooser covariates can also apply collaborative filtering principles, with the distinction We apply our spectral method to the yoochooose online shopping that instead of thinking that similar users like similar items, we as- dataset [7]. The dataset consists of all items clicked on in a session sume similar choosers are shown similar choice sets. There is a lim- and an indicator of whether each item was purchased. We consider itation, though, as this approach only lets us make predictions for each purchase to be a choice from the set of all items viewed in the choosers who appear in the original dataset. While there are many session. We group items by category (e.g., sports equipment) re- ways of using information from choice set assignment, we high- moving those with fewer than 100 purchases, leaving 29 categories. light an approach for the case where we have corresponding types We then perform spectral co-clustering [14] on the choice set of choosers and items (e.g., “romance fans” for “romance movies”). matrix 푅 with 2 to 10 chooser clusters and train a separate logit Suppose that choosers are more likely to have an item in their on each cluster. We ignore the item clusters. We compare against 푚 ×푛 푅 choice set if it matches their type. Define the matrix , where random clustering with the cluster sizes found by spectral clustering 푅 = 푖 푗 푅 = 푖 푗 1 if the th choice set includes item and 푖 푗 0 otherwise. and mixed logit with the same number of components. 푅 We can think of as the upper right block of the adjacency matrix Spectral clustered logit describes the data much better than ran- of a bipartite graph between choosers and items, in which an edge dom clustering or even mixed logit (Figure 5). Note that the clusters (푎, 푖) 푖 푎 means that is in ’s choice set. With fixed choice set inclusion are based only on choice set assignment, not choice behavior. In probabilities for each type, clustering choosers into types based on contrast, mixed logit bases its mixture components solely based on their choice sets is then an instance of the bipartite stochastic block choices. The strong performance of spectral clustering indicates model (SBM) recovery problem [1, 27]. that choice sets are informative about preferences, and our use of In Theorem 8, we apply a classic exact recovery result due to this information is much easier than learning a mixture model. McSherry [34] to show how a choice system with discrete types can be deconfounded without access to chooser covariates (i.e., 6 DISCUSSION knowledge of type membership), but any bipartite SBM clustering algorithm could be used (see Abbe [1] for a survey of SBM results). Choice set confounding is widespread and can affect choice proba- bility estimates, alter or introduce context effects, and lead to poor Theorem 8. Suppose items and choosers are jointly split into 푘 generalization. Existing models ignoring chooser covariates are par- types. Let 푠 be the smallest number of items or choosers of any type ticularly susceptible, but plugging in covariates is not a universal and let 푛 = |퐴| + |U|. Suppose that for each chooser 푎 ∈ 퐴, 푖 ∈ U is solution. We saw that covariates may be more informative about included in 푎’s choice set with probability 푝 if 푎 and 푖 are of the same choice sets than preferences, making IPW more viable than regres- type and with probability 푞 otherwise. sion. An important contribution is formalizing and demonstrating Choice Set Confounding in Discrete Choice KDD ’21, August 14–18, 2021, Virtual Event, Singapore choice set confounding, as it has significant implications for discrete [21] Joel Huber, John W Payne, and Christopher Puto. 1982. Adding asymmetrically choice modeling. For instance, initial research on the SF transporta- dominated alternatives: Violations of regularity and the similarity hypothesis. Journal of Consumer Research 9, 1 (1982). tion data used extensive nested logit modeling to account for IIA [22] Samuel Ieong, Nina Mishra, and Or Sheffet. 2012. Predicting preference flips in violations [26], which we can manage with choice set confounding. commerce search. In ICML. [23] Guido W Imbens. 2004. Nonparametric estimation of average treatment effects Our methods are a first step in addressing confounding. A chal- under exogeneity: A review. Review of Econ. and Stat. 86, 1 (2004). lenge was learning choice set propensities for IPW. Simple logistic [24] Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, regression can work for binary treatments, but estimating expo- and biomedical sciences. Cambridge University Press. [25] Kaggle. 2013. Personalize Expedia Hotel Searches — ICDM 2013. https://www. nentially many choice set propensities is difficult. In expedia, we kaggle.com/c/expedia-personalized-sort. learned a distribution over mean choice set feature vectors as an ap- [26] Frank S Koppelman and Chandra Bhat. 2006. A self instructing course in mode proximation. Other methods for learning set assignment probabili- choice modeling: multinomial and nested logit models. (2006). [27] Daniel B Larremore, Aaron Clauset, and Abigail Z Jacobs. 2014. Efficiently ties would be valuable. Instrumental variables are another causal in- inferring community structure in bipartite networks. Phys. Rev. E 90, 1 (2014). ference approach [18] that could be used in our setting, but identify- [28] Charles F Manski. 1977. The structure of random utility models. Theory and Decision 8, 3 (1977). ing instruments for choice data is difficult. Alternatively, a matching [29] Charles F Manski and Steven R Lerman. 1977. The estimation of choice probabil- approach [23] could compare pairs of similar choosers with different ities from choice based samples. (1977). choice sets. Other directions for future investigation include rigor- [30] Benjamin M Marlin, Richard S Zemel, Sam Roweis, and Malcolm Slaney. 2007. Collaborative filtering and the missing at random assumption. In UAI. ous methods of detecting choice set confounding, or verifying that [31] Daniel McFadden. 1974. Conditional logit analysis of qualitative choice behavior. it has been successfully accounted for, and of testing assumptions. Frontiers in Econometrics (1974). [32] Daniel McFadden and Kenneth Train. 2000. Mixed MNL models for discrete response. Journal of Applied Econometrics 15, 5 (2000). [33] Daniel McFadden, William B Tye, and Kenneth Train. 1977. An Application of ACKNOWLEDGMENTS Diagnostic Tests for the Independence From Irrelevant Alternatives Property of This research was supported by ARO MURI, ARO Awards W911NF19- the Multinomial Logit Model. Transp Res Rec (1977). [34] Frank McSherry. 2001. Spectral partitioning of random graphs. In FOCS. 1-0057 and 73348-NS-YIP, NSF Award DMS-1830274, the Koret Foun- [35] Karlson Pfannschmidt, Pritha Gupta, and Eyke Hüllermeier. 2019. Learning dation, and JP Morgan Chase & Co. We thank Spencer Peters for choice functions: Concepts and architectures. arXiv (2019). helpful discussions. [36] Stephen Ragain and Johan Ugander. 2016. Pairwise choice Markov chains. In NeurIPS. [37] Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983). REFERENCES [38] Nir Rosenfeld, Kojin Oshiba, and Yaron Singer. 2020. Predicting Choice with [1] Emmanuel Abbe. 2017. Community detection and stochastic block models: recent Set-Dependent Aggregation. In ICML. developments. JMLR 18, 1 (2017). [39] Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and [2] Arpit Agarwal, Prathamesh Patil, and Shivani Agarwal. 2018. Accelerated spectral nonrandomized studies. J. Edu. Psych. 66, 5 (1974), 688. ranking. In ICML. [40] Donald B Rubin. 1977. Assignment to treatment group on the basis of a covariate. [3] Greg M Allenby and Peter E Rossi. 1998. Marketing models of consumer hetero- J. Educational Stat. 2, 1 (1977). geneity. J. Econometrics 89, 1-2 (1998). [41] Donald B Rubin. 2005. Causal inference using potential outcomes: Design, mod- [4] Peter C Austin. 2011. An introduction to propensity score methods for reducing eling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322–331. the effects of confounding in observational studies. Multivariate Behav Res (2011). [42] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and [5] Heejung Bang and James M Robins. 2005. Doubly robust estimation in missing Thorsten Joachims. 2016. Recommendations as treatments: debiasing learning data and causal inference models. Biometrics 61, 4 (2005). and evaluation. In ICML. [6] Moshe Ben-Akiva and Bruno Boccara. 1995. Discrete choice models with latent [43] Arjun Seshadri, Alex Peysakhovich, and Johan Ugander. 2019. Discovering choice sets. Int. Journal of Research in Marketing 12, 1 (1995). Context Effects from Raw Choice Data. In ICML. [7] David Ben-Shimon et al. 2015. RecSys challenge 2015 and the YOOCHOOSE [44] Arjun Seshadri, Stephen Ragain, and Johan Ugander. 2020. Learning Rich Rank- dataset. In RecSys. ings. NeurIPS (2020). [8] Austin R Benson, Ravi Kumar, and Andrew Tomkins. 2016. On the relevance of [45] Itamar Simonson. 1989. Choice based on reasons: The case of attraction and irrelevant alternatives. In WWW. compromise effects. J. Consumer Research 16, 2 (1989). [9] Chandra R Bhat and Rachel Gossen. 2004. A mixed multinomial logit model [46] Itamar Simonson and Amos Tversky. 1992. Choice in context: Tradeoff contrast analysis of weekend recreational episode type choice. Transp Res Part B (2004). and extremeness aversion. Journal of Marketing Research 29, 3 (1992), 281–295. [10] Michel Bierlaire, Ricardo Hurtubia, and Gunnar Flötteröd. 2010. Analysis of im- [47] Leslie S Stratton, Dennis M O’Toole, and James N Wetzel. 2008. A multinomial plicit choice set generation using a constrained multinomial logit model. Trans- logit model of college stopout and dropout behavior. Econ. Edu. Rev. 27, 3 (2008). portation Research Record 2175, 1 (2010). [48] Elie Tamer. 2019. The ET Interview: Professor Charles Manski. Econometric [11] Amanda Bower and Laura Balzano. 2020. Preference Modeling with Context- Theory 35, 2 (2019). Dependent Salient Features. In ICML. [49] Kiran Tomlinson and Austin Benson. 2020. Choice Set Optimization Under [12] David Brownstone et al. 1996. A transactions choice model for forecasting Discrete Choice Models of Group Decisions. In ICML. demand for alternative-fuel vehicles. Res. Transp. Econ. 4 (1996). [50] Kiran Tomlinson and Austin R Benson. 2021. Learning Interpretable Feature [13] Flavio Chierichetti, Ravi Kumar, and Andrew Tomkins. 2018. Learning a mixture Context Effects in Discrete Choice. In KDD. of two multinomial logits. In International Conference on Machine Learning. PMLR, [51] Kenneth E Train. 2009. Discrete choice methods with simulation. 961–969. [52] A Tversky. 1972. Elimination by aspects: A theory of choice. Psych. Rev. (1972). [14] Inderjit S Dhillon. 2001. Co-clustering documents and words using bipartite [53] Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. 2019. Doubly robust joint spectral graph partitioning. In KDD. learning for recommendation on data missing not at random. In ICML. [15] Richard O Duda, Peter E Hart, and David G Stork. 2001. Pattern Classification. [54] Yixin Wang, Dawen Liang, Laurent Charlin, and David M Blei. 2020. Causal [16] David A Freedman and Richard A Berk. 2008. Weighting regressions by propen- Inference for Recommender Systems. In RecSys. sity scores. Evaluation Rev. 32, 4 (2008). [55] Larry Wasserman. 2013. All of Statistics: A Concise Course in Statistical Inference. [17] Michele Jonsson Funk et al. 2011. Doubly robust estimation of causal effects. Springer Science & Business Media. American Journal of Epidemiology 173, 7 (2011), 761–767. [56] Chieh-Hua Wen and Frank S Koppelman. 2001. The generalized nested logit [18] Miguel A Hernán and James M Robins. 2006. Instruments for causal inference: model. Transportation Research Part B: Methodological 35, 7 (2001). an epidemiologist’s dream? Epidemiology (2006). [57] Shuang-Hong Yang et al. 2011. Collaborative competitive filtering: learning [19] Keisuke Hirano, Guido W Imbens, and Geert Ridder. 2003. Efficient estimation recommender using context of user choice. In SIGIR. of average treatment effects using the estimated propensity score. Econometrica 71, 4 (2003). [20] Saul D Hoffman and Greg J Duncan. 1988. Multinomial and conditional logit discrete-choice models in demography. Demography 25, 3 (1988), 415–427. KDD ’21, August 14–18, 2021, Virtual Event, Singapore Kiran Tomlinson, Johan Ugander, and Austin R. Benson

A PROOFS also converge: Proof of Observation 1. Conditioning over choosers yields ˆ Í lim Pr(푖 | 풙풂,퐶) = Pr(푖 | 풙풂,퐶) Pr(푖 | 퐶) = 푎 Pr(푖 | 푎,퐶) Pr(푎 | 퐶). Meanwhile, E푎 [Pr(푖 | 푎,퐶)] = |D |→∞ Í ( | ) ( ) 푎 Pr 푖 푎,퐶 Pr 푎 . These are equal if condition (1) holds (since = Pr(푖 | 푎, 풙풂,퐶) independence also implies Pr(푎) = Pr(푎 | 퐶)). If condition (2) holds, (by preference ignorability) then we directly have E [Pr(푖 | 푎,퐶)] = Pr(푖 | 퐶). See Example 1 푎 = Pr(푖 | 푎,퐶). for an instance where this equality fails when neither (1) nor (2) hold. □ □

Proof of Theorem 2. First, notice that any universal logit choice Proof of Theorem 7. Observing the choice set gives us a noisy probabilities aggregated over a population can be expressed by measurement of 풙풂, which we can adjust using our knowledge of ∗ set-dependent utilities 푢푖 (퐶) for each 퐶 ⊆ U, 푖 ∈ 퐶. For every the distribution of 풙풂. The posterior of a Gaussian with a Gaussian choice set 퐶 ⊆ U, construct a chooser 푎퐶 with fixed utilities prior is also Gaussian—in particular, 풙풂 | 퐶 is Gaussian, with mean 푢 (푎 ) = 푢∗ (퐶). Let Pr(퐶 | 푎 ) = 1 and Pr(퐶′ | 푎 ) = 0 for all 푖 퐶 푖 퐶 퐶  −1  −1 other 퐶′ ≠ 퐶. The choice probabilities of this mixture with chooser- 1 1 1 E[풙풂 | 퐶] = Σ0 Σ0 + Σ 풚푪 + Σ Σ0 + Σ 흁 dependent sets is the same as in the original system, and the mixture 푘 푘 푘 |U | has finitely many2 ( − 1) components, one for each nonempty ∗ [15, Section 3.4.3]. Thus, the expected chooser 풙 has utilities choice set 퐶. □ 풂 " − − #푇  1  1 1  1  1 푢 (푎∗,퐶) = Σ Σ + Σ 풚 + Σ Σ + Σ 흁 풚 . Proof of Theorem 4. We need (1) in order for IPW (and there- 푖 0 0 푘 푪 푘 0 푘 풊 fore 퐷˜ ) to be well-defined. Fix 푖 and 퐶. Consider the coefficient ∗ of log Pr휃 (푖 | 퐶) in ℓ(휃; D ). In expectation, this term appears This is exactly an LCL with 휽 and 퐴 as claimed. □ |D| Pr(푖,퐶) times. Expanding this: ∑︁ Proof of Theorem 8. Consider the bipartite graph whose left |D| ( ) = |D| ( | ) ( ) Pr 푖,퐶 Pr 푖,퐶 푎 Pr 푎 nodes are choosers and whose right nodes are items, each split 푎∈퐴 ∑︁ into blocks according to their type. The choice set assignment = |D| Pr(푖 | 퐶, 푎) Pr(퐶 | 푎) Pr(푎) process above defines a bipartite SBM on this graph with intra- 푎∈퐴 type probabilities 푝 and inter-type probabilities 푞 (between chooser |D| ∑︁ nodes and item nodes). Recovering types from choice sets can then = Pr(푖 | 퐶, 푎) Pr(푎), |C | be viewed as an instance of the planted partition problem [34]. D 푎∈퐴 We can thus directly2 apply Theorem 4 of McSherry [34] to where the last step follows from D∗ having uniformly random achieve the desired result given Equation (4), with the caveat that choice sets. Now consider the coefficient of log Pr휃 (푖 | 퐶) in ℓ(휃; D)˜ . algorithm is random and succeeds with probability 1/푘. By IPW, this coefficient is Repeating the algorithm 푐푘 times achieves failure probability (1 − 1 )푐푘 ≤ 1/푒푐 , which is smaller than 훿 if 푐 > log(1/훿). We ∑︁ 1 1 ∑︁ ∑︁ 1 푘 = . can thus make 훿 smaller by a factor of 2 (absorbing this into the |CD | Pr(퐶 | 풙풂) |CD | Pr(퐶 | 풙풂) (푎,퐶′,푖) ∈D 푎∈퐴 (푎′,퐶′,푖′) ∈D constant 퐶 in eq. (4)) and we are left with the guarantee as stated, ′ ′ ′ ′ ′ 푖 =푖,퐶 =퐶 푎 =푎,퐶 =퐶,푖 =푖 only increasing the running time by a factor 푘 log(1/훿). □ In expectation, the sample (푎,퐶, 푖) occurs |D| Pr(푎,퐶, 푖) times. Ad- ditionally, by choice set ignorability, Pr(퐶 | 풙풂) = Pr(퐶 | 푎). We B AFFINE-MEAN GAUSSIAN CHOICE SET thus have that the expected coefficient is MODEL 1 ∑︁ 1 For estimating choice set propensities in expedia, we model the dis- |D| Pr(푎,퐶, 푖) tribution of mean choice set features using an affine-mean Gaussian. |CD | Pr(퐶 | 풙풂) 푎∈퐴 Here, we show how this model can be easily estimated from data. |D| ∑︁ 1 = Pr(푎) Pr(퐶 | 푎) Pr(푖 | 퐶, 푎) Given a dataset D, the model 풚 ∼ N (푊 풙 + |CD | Pr(퐶 | 푎) Proposition 9. 푪 풂 푎∈퐴 풛, Σ) is identifiable iff there are 푚 +1 choosers in D with affinely inde- |D| ∑︁ = Pr(푎) Pr(푖 | 퐶, 푎), pendent covariates. If the model is identified, the maximum likelihood |C | ∗ ∗ D 푎∈퐴 parameters 푊 , 풛 are the solution to the least-squares problem ∗ ∗ ∗ ∑︁ 2 which matches the coefficient in ℓ(휃; D ). Since the expected coef- (푊 , 풛 ) = arg min ∥풚푪 − (푊 풙풂 + 풛)∥2, (5) 푛×푚 ficients agree for all 푖 and 퐶, we then have the equality. □ 푊 ∈R (푎,퐶) ∈D 풛 ∈R푛

Proof of Theorem 6. By the consistency of the MLE, as |D| → 2Notice that 푠 (푝 − 푞)2 is a lower bound on the squared 2-norm of the columns of the ∞, parameter estimates for a correctly specified choice model con- SBM edge probability matrix required by [34, Theorem 4]. Additionally, we use the verge to the true parameters. Thus, estimated choice probabilities crude variance upper bound 휎2 = 1 for simplicity. Choice Set Confounding in Discrete Choice KDD ’21, August 14–18, 2021, Virtual Event, Singapore which have the closed form: with coefficient 휆 = 10−4 for all models to ensure identifiability. −1 For mixed logit, we use an expectation-maximization (EM) algo-     ∗  ∑︁ 푇   ∑︁ 푇  rithm [51] with a one hour timeout. Our code, results, and links to 푊 =  (풚 − 풚D)풙   (풙풂 − 풙D)풙  (6)  푪 풂   풂  data are available at (푎,퐶) ∈D  (푎,퐶) ∈D  ∗  ∗    https://github.com/tomlinsonk/choice-set-confounding. 풛 = 풚D − 푊 풙D, (7) Í Í where 풙D = 1/|D | (푎,퐶) ∈D 풙풂 and 풚D = 1/|D | (푎,퐶) ∈D 풚푪 . Table 5: Regularity violations in sf-work and sf-shop, im- Additionally, the maximum likelihood covariance matrix is the possible under mixed logit. Including additional item(s) ap- sample covariance: pears to increase the probability that DA or DA/SR is cho- 1 ∑︁ sen. The differences are significant according to Fisher’s ex- Σ∗ = (풚 − 푊 ∗풙 − 풛∗)(풚 − 푊 ∗풙 − 풛∗)푇 . (8) −9 |D| 푪 풂 푪 풂 act test (sf-work: 푝 = 6.5 × 10 , sf-shop: 푝 = 0.005). (푎,퐶) ∈D sf-work Proof sketch. This can be derived following the same steps Choice set (퐶) Pr(DA | 퐶) 푁 as the standard Gaussian MLE proof (with a bit of extra matrix {DA, SR 2, SR 3+, Transit} 0.72 1661 calculus): (1) take partial derivatives of the log-likelihood with {DA, SR 2, SR 3+, Transit, Bike} 0.83 829 respect to 푊 and 풛, (2) set them to zero, (3) solve for 풛, (4) plug this in to solve for 푊 , (5) do the same to solve for Σ in its partial sf-shop derivative. This works since the log-likelihood is still convex after Choice set (퐶) Pr(DA/SR | 퐶) 푁 adding in the affine trasformation. We omit the details as they are {DA, DA/SR, SR 2, SR 3+, 0.17 534 tedious and unenlightening. □ SR 2/SR 3+, Transit} {DA, DA/SR, SR 2, SR 3+, 0.23 1315 C EXPERIMENT DETAILS SR 2/SR 3+, Transit, Bike, Walk} We implemented all choice models with PyTorch and (except mixed DA: drive alone. SR: shared ride, number indicates car occupancy. logit) train them using Rprop with no minibatching to optimize the Slashes indicate different mode used for outbound and inbound trips. log-likelihood for 500 epochs or until convergence (squared gradi- −8 ent norm < 10 ), whichever comes first. We use ℓ2 regularization