Modeling human decision-making in spatial and temporal systems

by Nathan Gene Sandholtz

M.Sc., Brigham Young University, 2016 B.Sc., Brigham Young University, 2012

Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

in the Department of Statistics and Actuarial Science Faculty of Science

⃝c Nathan Gene Sandholtz 2020 SIMON FRASER UNIVERSITY Summer 2020

Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation. Approval

Name: Nathan Gene Sandholtz

Degree: Doctor of Philosophy (Statistics)

Title: Modeling human decision-making in spatial and temporal systems

Examining Committee: Chair: Jean-François Bégin Assistant Professor

Luke Bornn Senior Supervisor Associate Professor

Derek Bingham Supervisor Professor

Tim Swartz Internal Examiner Professor Department of Statistics and Actuarial Science

Robert B. Gramacy External Examiner Professor Department of Statistics Virginia Polytechnic and State University

Date Defended: August 20, 2020

ii Abstract

The first project in this thesis explores how efficiently players in a lineup col- lectively allocate shots. We propose a new metric for allocative efficiency by comparing a player’s field goal percentage (FG%) to their field goal attempt (FGA) rate in context of both their four teammates on the court and the spatial distribution of their shots. Lever- aging publicly available data provided by the National Basketball Association (NBA), we estimate player FG% at every location in the offensive half court using a Bayesian hierar- chical model. By ordering a lineup’s estimated FG%s and pairing these rankings with the lineup’s empirical FGA rate rankings, we detect areas where the lineup exhibits inefficient shot allocation. In the second basketball application, we model basketball plays as episodes from team- specific nonstationary Markov decision processes (MDPs) with shot clock dependent tran- sition probabilities. Bayesian hierarchical models are employed in the parametrization of the transition probabilities to borrow strength across players and through time. To enable computational feasibility, we combine lineup-specific MDPs into team-average MDPs us- ing a novel transition weighting scheme. We then utilize these nonstationary MDPs in the creation of a basketball play simulator with uncertainty propagated via posterior samples of the model components. After calibration, we simulate seasons both on policy and under altered policies and explore the net changes in efficiency and production under the alternate policies. In the final project, we explore the inverse problem of Bayesian optimization. Specifically, we seek to estimate an agent’s latent acquisition function based on their observed search paths. After introducing a probabilistic solution framework for the problem, we illustrate our method by analyzing human behavior from an experiment. The experiment was designed to force subjects to balance exploration and exploitation in search of a global optimum. We find that subjects exhibit a wide range of acquisition preferences; however, some subject’s behavior does not map well to any of the candidate acquisitions functions we consider. Guided by the model discrepancies, we augment the candidate acquisition functions to yield a superior fit to the human behavior in this task.

iii Keywords: Basketball statistics, Bayesian hierarchical model, Bayesian optimization, In- verse optimization, Markov decision process, Rankings and orderings

iv Dedication

To the future students I’ll teach and mentor, and to my family—those who are already here and those who are yet to come.

v Acknowledgements

I owe thanks to many people, without whom this work would never have happened. First of all, thank you to the Brigham Young University statistics faculty who helped me prepare me for a PhD program. In particular, I’m grateful to Shane Reese and William Christensen for helping me discover the statistics program at Simon Fraser. I’d like to thank the statistics faculty here at Simon Fraser University. I’d particularly like to thank Tim Swartz, who has been very generous to me with his time and resources since the day I met him. Most of all, I’m indebted to my supervisors, Luke Bornn and Derek Bingham, for their examples, guidance, support, and mentorship. They were always actively engaged and willing to help at a moments notice. I’d particularly like to thank Luke for the rich opportunities he’s provided me. Thanks to him this program has been a unique and singular experience, and I’ve received uncommon opportunities that I never anticipated before I began the program. I’d like to thank my coauthors, Jacob Mortensen, Luke Bornn, Yohsuke Miyamoto, and Maurice Smith. Their contributions to the chapters in this thesis were invaluable. I’d also like to thank the referees and editors who reviewed Chapters 2 and 3 of this thesis. Their input strengthened each project immensely. I’m grateful to my brother Wayne, who was a constant source of support to me during this program. A huge thank you to my friend and fellow student Jacob Mortensen. This program would have been much more difficult and overwhelming without him. More than anyone else, thank you to my wife Christina. I couldn’t have done it without her support.

vi Table of Contents

Approval ii

Abstract iii

Dedication v

Acknowledgements vi

Table of Contents vii

List of Tables x

List of Figures xi

1 Introduction 1 1.1 Connecting ideas between projects ...... 2 1.2 Background ...... 4 1.2.1 Hierarchical modeling ...... 4 1.2.2 Markov decision processes ...... 5 1.2.3 Bayesian optimization ...... 7

2 Measuring Spatial Allocative Efficiency in Basketball 9 2.1 Introduction ...... 9 2.1.1 Related work ...... 12 2.1.2 Data and code ...... 12 2.2 Models ...... 13 2.2.1 Estimating FG% surfaces ...... 13 2.2.2 Determining FGA rate surfaces ...... 17 2.3 Allocative efficiency metrics ...... 18 2.3.1 Spatial rankings within a lineup ...... 19 2.3.2 Lineup points lost ...... 22 2.3.3 Player LPL contribution ...... 26 2.3.4 Empirical implementation ...... 27

vii 2.4 Optimality - discussion and implications ...... 29 2.4.1 Do lineups minimize LPL? ...... 29 2.4.2 Does LPL relate to offensive production? ...... 31 2.4.3 How can LPL inform strategy? ...... 34 2.4.4 Is minimizing LPL always optimal? ...... 35 2.5 Conclusion ...... 36

3 Markov Decision Processes with Dynamic Transition Probabilities: An Analysis of Shooting Strategies in Basketball 37 3.1 Introduction ...... 37 3.1.1 Related work and contributions ...... 39 3.1.2 Description of data ...... 40 3.1.3 Outline ...... 40 3.2 Decision process framework ...... 41 3.2.1 Markov decision processes ...... 41 3.2.2 State and action space ...... 42 3.2.3 Defining the average chain ...... 43 3.2.4 Transition and policy tensors ...... 46 3.3 Hierarchical modeling and inference ...... 48 3.3.1 Shot policy model ...... 48 3.3.2 Transition probability model ...... 51 3.3.3 Transition model two-stage approximation ...... 52 3.3.4 Reward function ...... 53 3.3.5 Inference and validation ...... 55 3.3.6 Model fit ...... 56 3.4 Simulating plays ...... 58 3.4.1 Play simulation algorithm ...... 58 3.4.2 Calibration ...... 59 3.5 Altering policies ...... 61 3.5.1 Game theory ...... 61 3.5.2 Shot policy changes ...... 62 3.5.3 Passing policy changes ...... 64 3.6 Conclusion ...... 65

4 Inverse Decision Problems 66 4.1 Inverse optimization in operations research ...... 66 4.2 Examples ...... 67 4.2.1 Multi-armed bandit ...... 67 4.2.2 Fourth down decisions in American football ...... 68

viii 5 Inverse Bayesian Optimization: Learning Human Acquisition Prefer- ences in an Exploration vs. Exploitation Search Task 73 5.1 Introduction ...... 73 5.1.1 Background ...... 74 5.1.2 Hotspot search task ...... 74 5.1.3 Identifying risk preferences ...... 75 5.2 Bayesian optimization ...... 79 5.2.1 Choosing a surrogate function ...... 79 5.2.2 Updating the surrogate via Bayesian inference ...... 81 5.2.3 Choosing an acquisition function ...... 82 5.3 Inverse Bayesian optimization ...... 85 5.3.1 IBO under perfect acquisition ...... 88 5.3.2 IBO under imperfect acquisition ...... 88 5.3.3 Search task implementation ...... 89 5.3.4 Incorporating human tendencies ...... 91 5.4 Model Extensions ...... 93 5.4.1 Perception error ...... 93 5.4.2 Informative priors ...... 94 5.4.3 Search task implementation (version 2) ...... 97 5.5 Conclusion ...... 99

6 Conclusion 100 6.1 Summary ...... 100 6.2 Future work ...... 101

Bibliography 103

Appendix A Additional Details 108 A.1 IBO Prior Specification ...... 108

ix List of Tables

Table 2.1 Approximate p-values for H0 vs. HA for each team’s starting lineup in the 2016-17 NBA regular season...... 31

Table 3.1 Players were independently clustered by shot volume and shot region propensity. The table shows three players in each group after crossing the resulting clusters...... 55

Table 3.2 MCMC details and diagnostics for each fitted model. P1(·) and P2(·)

refer to the 1st and 2nd stages of P (·). Specifically, P2(·) refers to the 2nd stage fit on the Cavaliers TPT ...... 55 Table 3.3 Out-of-sample log-likelihoods for four models of increasing complexity over each component of the MDP ...... 56

x List of Figures

Figure 1.1 Decision tree for a binary decision in a probabilistic environment. The top square node represents the initial state. The two circular

nodes denote the actions (x and y) available to the agent in state s1. The bottom square nodes represent the next state of the environment after the decision is made (which is a random variable prior to the decision). The solid blue edges connecting the initial state node to the action nodes represent that the agent controls their choice, while the the dashed green and red lines show that the subsequent reward and state (respectively) are determined randomly. The different shades of green and red are to communicate that the probability distributions differ based on which action the agent chooses...... 2

Figure 2.1 Left: overall relationship between field goal attempt rate (x-axis) and points per shot (y-axis). Right: same relationship conditioned on various court regions. The Cleveland Cavaliers 2016-17 starting lineup is highlighted in each plot. The weighted least squares fit of each scatter plot is overlaid in each plot by a dotted line...... 11 Figure 2.2 Deterministic bases resulting from the non-negative matrix factor- ization of P. The plots are arranged such that the bases closest to the hoop are on the left (e.g. Under Hoop) and the bases furthest from the hoop are on the right (e.g. Arc 3). The residual basis, comprising court locations where shots are infrequently attempted from, is shown in the bottom-right plot...... 15 Figure 2.3 LeBron James 2016-17 FG% posterior mean (left) and posterior stan- dard deviation (right) projected onto the offensive half court. The prediction surfaces shown here and throughout the figures in this paper utilize projections onto a spatial grid of 1 ft. by 1 ft. cells. . . 17 Figure 2.4 Left: Kyrie Irving’s FGA rate per 36 minutes in the starting lineup (in which he shared the most minutes with LeBron James). Center: Kyrie Irving’s FGA rate per 36 minutes in the lineup for which he played the most minutes without LeBron James. Right: The differ- ence of the center surface from the left surface...... 18

xi Figure 2.5 Top: 20% quantiles of the Cleveland Cavaliers starting lineup poste- rior distributions of FG% ranks. Middle: medians of these distribu- tions. Bottom: 80% quantiles...... 20 Figure 2.6 Top: Estimated FG% ranks for the Cleveland Cavaliers’ starting lineup. Bottom: Deterministic FGA rate ranks...... 21 Figure 2.7 Rank correspondence surfaces for the Cleveland Cavaliers’ starting lineup...... 21 Figure 2.8 A toy LPL computation in an arbitrary 3-point court region for the Cleveland Cavalier’s starting lineup. The players are ordered from left to right according to FG% (best to worst). Below each player’s picture is the number of actual shots the player took from this location. The black arrows show how the function g(·) reallocates these shots according to the players’ FG% ranks. The filled gray dots show the number of shots the player would be allocated by according to the proposed optimum. Below the horizontal black line, each player’s actual expected points and optimal expected points are calculated by multiplying their FG% by the corresponding number of shots and the point value of the shot. LPL is the difference (in expectation) between the optimal points and the actual points. . . 23 Shot Figure 2.9 LPLd and LPLd surfaces for the Cleveland Cavaliers starting lineup. 24 ∑ M Figure 2.10 Histogram of i=1 LPLi for the Cleveland Cavaliers starting lineup. 500 posterior draws from each ξ , where i ∈ {1 ...,M} and j ∈ ij ∑ { } M 1,..., 5 , were used to compute the 500 variates of i=1 LPLi com- prising this histogram...... 24 Figure 2.11 Left: 20% quantile LPL surfaces for the Cleveland Cavaliers starting lineup. Middle: median LPL surfaces. Bottom: 80% quantile LPL surfaces. The top rows show LPL36 while the bottom rows show LPLShot...... 25 Shot Figure 2.12 PLCd surfaces for the Cleveland Cavaliers starting lineup. . . . . 27 Figure 2.13 Top: Empirical FG% ranks for the Cleveland Cavaliers starting lineup. Middle: Empirical FGA ranks. Bottom: Rank correspondence. . . . 28 Figure 2.14 Top: Empirical LPL and LPLShot for the Cleveland Cavaliers starting lineup. Bottom: Empirical PLC for the Cleveland Cavaliers starting lineup...... 29 Figure 2.15 Permutation test for the Cleveland Cavaliers’ 2016-17 starting lineup. The gray bars show a histogram of the variates from (2.12). The ap- proximate p-value for the Cavaliers starting lineup (i.e. the propor- tion of variates that are less than 0) is 1/500 or 0.002...... 31

xii Figure 2.16 Estimated density of actual points lost per game for every team’s 82 games in the 2016-17 NBA regular season...... 33 Shot Shot Figure 2.17 Utah Jazz 2016-17 starting lineup LPL,d LPLd , and PLCd surfaces. 35 Shot Figure 2.18 Oklahoma City 2016-17 starting lineup PLCd surfaces...... 36

Figure 3.1 (a) Breakdown of court locations as used in our models and simula- tions. Rim: Within a six-foot radius of the center of the hoop; Paint: Outside the restricted area but within the key; Midrange: Outside of the paint but within the 3-point line; Corner 3: Beyond the 3-point line but below the break of the 3-point arc; Arc 3: Beyond the 3-point line, above the arc 3 break but within 30 feet of the hoop; Backcourt: All locations beyond the arc 3 region. (b) Empirical league-average shot policies for the 2015–16 NBA regular season. We see lower shot probabilities in the midrange and arc 3 regions because the on-ball events in these regions are dominated by passes and dribbles. . . . 38 Figure 3.2 Illustration of the components of the MDP for a single player in context of a basketball play. The blue circles represent states, the solid green circles represent actions (shots) and the curved blue lines represent transition probabilities between states. The green lines of varying width connecting the blue state circles to the green action circles represent the policy. The purple lines connecting the green action circles to the squares represent the reward function. Players may pass the ball to another player (not shown) which is also a transition to a nonterminal state...... 42 Figure 3.3 A concept illustration of P (·) for the Cleveland Cavaliers most com- mon starting lineup in the 2015–16 NBA regular season. For illus- tration purposes the row and column space of the TPT has been condensed to five single-player states, however, in our models a typ- ical team has a state space of over 200 states. Each slice represents an approximation of the state-to-state transition probabilities dur- AE ing a 3-second interval of the shot clock. Hence in this example, p8 represents the average probability of Kyrie Irving passing the ball to Tristan Thompson when the shot clock is in the interval (21,24]. . . 47

xiii Figure 3.4 A graphical representation of our model for π(·). The observable random variable is denoted by a gray box while parameters are de- (x,y,z) noted by unfilled circles. An denotes the action of player x in

region y with defensive pressure z during time interval tn. It is gov- (x,y,z) erned by the corresponding state’s propensity parameter, θtn . These player/region/defense parameters θ are modeled by a multi- stage hierarchical prior, including a layer for position/region/defense parameters β and region/defense parameters γ. The parameters θ,

β and γ all have a temporal dimension of nTPT = 8, which we convey horizontally in each layer of the graph. These multivariate

parameter vectors have latent AR(1) covariance matrices, Σθ, Σβ,

Σγ. As shown by the plates in the figure, player-specific parameters are nested within position-specific parameters, while location and defensive pressure comprise the base model state space...... 50 Figure 3.5 Estimated league-wide transition probability tensor for the top level of the hierarchy on which each team’s TPT is built. Within each plot frame the 95% credible interval of the origin to destination transition probability is shown in dark gray and the posterior mean is shown by a black line. Within each plot frame the x-axis represents time on the shot clock, while the y-axis represents the transition probability. Across plot frames the y-axis represents the origin state and the x- axis represents the destination state. Corner 3, paint and rim states are omitted to maintain a practical size for the figure...... 56 Figure 3.6 Estimated shot policies (95% credible intervals) and reward func- tions (posterior densities) for LeBron James and Kyrie Irving in three sample states. The shot policies are overlaid with dots cor- responding to their empirical shot policies; the sizes of the dots are relative to how many shots they took within that time interval from the indicated state. The reward functions are also overlaid with the empirical points per shot and the number of shots they took from each state is given in the legend...... 57 Figure 3.7 300 simulated (gray) season-total transition counts over the shot clock overlaid with the corresponding observed counts (black) for the 2015–2016 season. Within each plot frame, the x-axis represents time on the shot clock, while the y-axis represents total transition counts. Across plot frames, the y-axis represents the origin state and the x-axis represents the destination state...... 60 Figure 3.8 Left to right: distribution of simulated contested mid-range shots, 3-point shots, expected points per shot, and expected points per play. 63

xiv Figure 3.9 Left to right: distributions of simulated transitions from Irving to James, Irving’s total shots, James’ total shots, and expected points per play...... 64

Figure 4.1 A multi-arm bandit with six arms. The expected value of the payout for each choice is 0, however the distributions governing the reward for each arm are different, as shown by the violin plots for each respective action. The colored lines overlaid on each violin plot show the .05 (blue), .25 (purple), .75 (red), and .95 (green) quantiles of each distribution. The left plot shows the reward distributions for a single play, while the right plot shows the distributions of cumulative reward after 30 plays...... 68 Figure 4.2 Estimated game-state value for each down/yard line combination. The x-axis shows the number of yards a team is from their own endzone and the y-axis denotes value. Each line plots the values for a different down as a function of yardline (red-1st down; green-2nd down; blue-3rd down; purple-4th down)...... 70 Figure 4.3 The next state value distribution for “go for it" (top), field goal attempt (middle), and punt (bottom) given initial state s = (4th, 62 yard line, 8 to go). In each panel, the x-axis shows possible values of the next state and the y-axis shows the likelihood of any given value. The colored lines denote different quantiles of each respective distributions (blue = 20th quantile, green = expected value, red = 80th quantile). Note that the value of a made field goal is not 3. This is because the team must immediately kick-off to the the opposing team essentially diminishing the value of the made field goal. . . . . 71

Figure 5.1 An example round of the experiment. The red target shows the in- visible hotspot location and the dots track the subject’s search path. The score at the hotspot is the score the subject would be given if they sampled the hotspot. Subjects always begin the search in the center of the click region (shown in blue) to minimize effects of the task-region borders guiding subject search behavior...... 75

xv Figure 5.2 Left: initial state for an example round of the task. The starting score is 93. Moving in any direction represents exploration, as shown by the dashed green lines. Right: the subject moves to the edge of the move 1 click-boundary, receiving a score of 98. This creates a conflict between exploration and exploitation on move 2. Moving perpendic- ular to direction of move 1 represents pure exploration (move 2a), while moving along the same direction of move 1 represents pure ex- ploitation (move 2b). Any move between these extremes represents a combination of exploration and exploitation (move 2c)...... 76 Figure 5.3 Move 2 behavior for three subjects in the experiment (Top: Subject 10; Middle: Subject 17; Bottom: Subject 19). The dots show the lo- cations of the subjects’ 2nd moves relative to their first moves. For visualization purposes, we display the first move along the horizontal axis. The color gradient denotes the degree of exploration/exploitation. Pure exploration (green) corresponds to moving directly perpendicu- lar on move 2. Pure exploitation (blue) corresponds to moving along the horizontal axis on move 2 (i.e. the same direction as move 1). The left-hand plots show the subjects’ moves after receiving a neg- ative change in score on their first move, while the right-hand plots are conditional on a positive change in score from move 0 to move 1. 78 Figure 5.4 Left: the global objective function for an example round of the game. Right: the objective function in the local region in the click-region of the first move (i.e. zoomed in to the green circle in the left plot). 80 Figure 5.5 Example posterior predictive mean (left) and standard deviation sur- faces (right) for move 2, given an initial score of 93, a move 1 score

of 98, and σs = 0.01, projected onto a fine grid over the click-region. 82 Figure 5.6 Acquisition surfaces for the PI, EI, and UCB functions defined in

(5.11)-(5.13) given r0 = 93 and r1 = 98. Each surface is on a different scale: the PI surface (left) is on the probability scale, EI (middle) is

in terms of points over 99 (since ξEI = 1), and UCB (right) is in terms of the 95th percentile of the surrogate. The click-region boundary is shown by the black circle encompassing the colored surfaces. The arg max(s) of each surface is denoted by a green star...... 84

xvi Figure 5.7 Acquisition surfaces over the range of possible ∆r1 values for the PI, EI, and UCB acquisition functions. The horizontal axes denote

the change in reward from m0 to m1. The vertical axes show the angle of the second move relative to the first move. Move 2 is as- sumed to be made on the click-region boundary. Color indicates the

acquisition function value for any given (∆r1, θ2) pair. High acqui- sition values are red while low values are shown in blue. In each panel, the pink curve denotes the angle that yields the maximum of

the acquisition surface as a function of ∆r1. The vertical black lines

at ∆r1 = 5 correspond to the circular black lines denoting the click- region boundaries in Figure 5.6. Similarly, the green stars correspond to the stars in Figure 5.6...... 85 Figure 5.8 (a): The left and right plots show the move 2 behavior for subjects 57 and 60 respectively. In each scatterplot, dots represent the subject’s move 2 angles given the changes in score on move 1 for each round

of the task. The horizontal axis denotes ∆r1, the change in reward from the starting spot to the first sampled location, and the vertical

axis shows θ2, the angle of the second move relative to the first move. (b): Each plot shows four candidate acquisition curves under various exploration parameter values for the corresponding family— PI (green), EI (blue), and UCB (red)...... 87

Figure 5.9 (∆r1, θ2) pairs for subjects 46 (left), 54 (middle), and 57 (right). The horizontal axis shows the change in score from move 0 to move 1 while the vertical axis shows the angle of their second move relative to their fist move. The subtitle of the plot lists the acquisition family and value of the exploration parameter within that family for the subject’s MAP acquisition function. Each scatterplot is overlaid with the subject’s corresponding MAP acquisition curve in color. Green curves denote PI, blue denotes EI (not shown), and red denotes UCB. Around each curve is the 95% highest density posterior prediction interval in light gray. The out-of-sample log likelihood of the fit is shown for each subject in the lower right corner of the plot. . . . . 91

Figure 5.10 All pairs of (∆r1, θ2) data for subjects 46 (left), 54 (middle), and 57 (right) as shown in Figure 5.9. Each scatterplot is overlaid with the subject’s corresponding MAP piecewise augmented acquisition curve in color. Green curves denote PI, blue denotes EI, and red denotes UCB. Around each curve is the 95% highest density posterior prediction interval in light gray. The out-ofsample log likelihood of the fit is shown for each subject in the lower right corner of the plot. 93

xvii Figure 5.11 Left: the joint prior density of β in experiment 2 as given by (5.25)-

(5.26). Right: the joint posterior density of β given ∆r1 = 0 after the first move. In both plots, the dashed pink circles represent the boundaries of the uniform distribution from which the√ reward gra- dient K is drawn for each new round of the task (i.e. β2 + β2 = 3 √ x y 4 2 2 15 and βx + βy = 4 )...... 96 Figure 5.12 All pairs of (∆r1, θ2) data for subjects 10 (left), 17 (middle), and 19 (right) in the second experiment. The top row shows each subject’s data overlaid with their MAP acquisition curve when considering the original candidate set U. In the bottom row the scatter plots are overlaid with the piecewise MAP acquisition curves over the aug- mented set Ue. Green curves denote PI, blue denotes EI, and red denotes UCB. Around each curve is the 95% highest density poste- rior prediction interval in light gray. The out-ofsample log likelihood of the fit is shown for each subject in the lower right corner of the plot. 98

xviii Chapter 1

Introduction

Decisions are a defining characteristic of human experience. Studying the decision-making process as an outsider looking in can be complicated, even for very basic decisions. This is particularly true when all one has to work with are observational data. For example, the variables that influence a person’s choices may be unobserved, the decision-maker’s goals may be unclear, and the consequences of their actions may be hard to measure. Under these conditions, modeling the dynamics of a decision process solely from observational data can be difficult, if not impossible. Sports offer an attractive environment to study decision-making from an observational lens. The environment is highly controlled, leaving fewer confounding variables. Further- more, the agents’ objectives are clear: the goal is to win. Despite these appealing features, prescriptive analytics in sports is anything but simple, particularly for spatially-dynamic, continuous-time, multi-agent team sports like basketball, hockey, and soccer. Decisions in these sports not only have to be made constantly (e.g. movement), they are also extremely complex. Players, for example, must subconsciously weigh how every action affects their team’s chances of winning, while knowing that the outcomes of their decisions are often distributional in nature (e.g. making or missing a shot) and that there will be hundreds, if not thousands, of additional actions that will take place before the end of the contest. Modeling the decision process in these environments is a significant statistical challenge. The first two projects in this thesis are statistical applications to basketball. In the first project, we introduce a novel measure for how efficiently players in a basketball lineup collectively allocate shots. In the second project, we combine several models of various basketball processes (e.g. shooting, movement, etc.) in the creation of a basketball play simulator, allowing us to test alterations to shot policies and explore the net changes in efficiency and production. While both of these projects offer practical tools and valuable insights, since the data we use are observational the prescriptive conclusions that can be drawn are limited. The ideal setting to study decision-making is in a controlled experiment. When carried out properly, experiments allow researchers to isolate and quantify treatment effects by

1 randomizing treatments to subjects. In a decision-making context, this allows researchers to study how people make choices under different conditions of interest. Unfortunately, controlled experiments in sports are hard to come by. An open area of research in basketball analytics, and sports analytics more broadly, is the creation simulated environments that can reliably quantify the causal effects of strategic intervention (Terner & Franks 2020). The final project takes a small step toward this goal. It uses a type of controlled ex- periment that aims to understand decision-making processes in games. This experiment was designed to test how people balance exploration and exploitation when searching for the maximum of an unknown function in two-dimensional continuous space. We propose a framework for learning each subject’s risk preferences based on their observed behavior in the game.

1.1 Connecting ideas between projects

The three projects in this thesis each relate to human decision-making in a spatial and temporal environment, but the goals of the analysis in each one are different. In order to connect the ideas presented across the three projects, consider the simple decision tree in Figure 1.1.

s1

a1 = x a1 = y

R2 R2

S2 S2

Figure 1.1: Decision tree for a binary decision in a probabilistic environment. The top square node represents the initial state. The two circular nodes denote the actions (x and y) available to the agent in state s1. The bottom square nodes represent the next state of the environment after the decision is made (which is a random variable prior to the decision). The solid blue edges connecting the initial state node to the action nodes represent that the agent controls their choice, while the the dashed green and red lines show that the subsequent reward and state (respectively) are determined randomly. The different shades of green and red are to communicate that the probability distributions differ based on which action the agent chooses.

2 The process begins in initial state s1, in which the agent is faced with a binary decision. The solid blue edges connecting the initial state to the action nodes represent the agent’s ability to control their choice. However, the agent does not control of the consequences of their choice (i.e. the resulting reward and state given their choice of action) as denoted by the dashed green and red lines. These are random variables, denoted by uppercase R2 and S2. The probability distributions governing these random variables come from a latent mapping:

× → Rt+1 : st at FRt+1 (1.1) × → St+1 : st at FSt+1 , (1.2)

where FRt+1 denotes the cumulative distribution function (CDF) of the random variable

Rt+1, and FSt+1 denotes the CDF of St+1. Assuming certain axioms of rational behavior (Von Neumann & Morgenstern 2007), the decision is an optimization problem where the optimal decision for a given step t of the process can be represented as:

∗ at = arg max g(Rt+1), (1.3) a∈A where g is a function of the probability distribution governing the reward (potentially in- cluding future rewards) for a given state-action pair, and A denotes the binary decision space {x, y}. In practice, g is often the expected value, but other loss functions can be used including Wald’s criterion (Wald 1939), Hurwicz’s criterion (Hurwicz 1951), and many oth- ers. Eq. (1.3) can equivalently be expressed in terms of regret, the observed deviation from ′ − the best possible reward value: Rt+1 = max(Rt+1) Rt+1. Under the regret formulation, (1.3) becomes:

∗ ′ at = arg min h(Rt+1), (1.4) a∈A where h is analogous to g defined above. The three projects in this thesis can all be described in terms of (1.3) and/or (1.4). Leveraging publicly available shot data from the NBA, Chapter 2 introduces a metric called ‘lineup points lost’ (LPL) to measure how well a lineup collectively allocates their shot attempts. This metric takes a regret formulation; we propose an optimal shot allocation strategy and then measure each lineup’s deviation from this optimum. In terms of (1.4), we evaluate how drastically each lineup deviates from a∗ by comparing the optimal LPL value to the values of h(ˆr′) we estimate from the data. Admittedly, since LPL is calculated over the entire season, relating season-aggregate shot allocation behavior to this decision

3 framework is a stretch; the decision process the players actually experience in-game is far more complex. We incorporate more of this complexity into the models of Chapter 3. In Chapter 3, we utilize high-resolution NBA player tracking data in the formulation and estimation of a Markov decision process (MDP) model of a basketball possession. In an MDP, the corollary of g(Rt+1) in (1.3) is the action-value function, which is the expected value of the return over an infinite horizon. Rather than estimating the action-value function, we build a basketball possession simulator which can be used to estimate (via sampling) the action-value distribution for any given policy π. We provide a brief overview of MDPs and introduce notation in Section 1.2. In the final project we take a different approach to the decision analysis. In broad terms, while Chapters 2 and 3 evaluate how far off the observed decisions a miss the optimum a∗, Chapter 5 assumes that the agents’ observed actions are optimal but that each agent’s function g of the reward random variable used in the optimization is latent. The goal of the analysis is to make inference on each agent’s g. We term this type of problem—where decisions are assumed to be optimal and the goal is to estimate features of the optimization such that the optimality assumption holds—an inverse decision problem. Specifically, we explore the inverse problem of Bayesian optimization. Based on an agent’s observed search paths, the goal is to estimate their latent acquisition function. We introduce a probabilistic solution framework for the problem and illustrate our method on human decision-making behavior from an experiment.

1.2 Background

1.2.1 Hierarchical modeling

The first two projects in this thesis are innovations in basketball statistics. These projects rely on statistical models of player skills such as shooting propensity and shot-make proba- bilities. Because the state spaces of our models are large, small sample sizes pose a significant challenge to estimating parameters; there are many states for which we have little or no data about certain players’ skills. Fortunately, in basketball there is a higher-order structure that provides information about players’ skills even in the absence of data. Each player is classified with a position and players within a given position generally exhibit similar roles and high-level behavior on the court. Statistical methods have been developed to model data with these characteristics; in instances when some model parameters are known to be related (i.e. there exists dependence among them in the joint likelihood) Bayesian hierarchical models can reduce noise in the parameter estimates (Efron & Morris 1975). We leverage this type of model in the basketball applications via position information.

4 Let yij be an observation (e.g. a made or missed shot by player j) and let θj be a parameter governing the data generating process for yij (e.g. player j’s shooting skill) where i ∈ {1, . . . , nj}, and j ∈ {1,...,J}. Assume the yij are i.i.d. for a given j. While the latent parameters θ1, θ2, . . . , θJ are fixed, in the Bayesian paradigm we express our uncertainty about them via probability distributions called prior distributions. Assuming

Θ = {θ1, θ2, . . . , θJ } are related (e.g. players 1 through J share the same position) we as- sume a common prior distribution governed by a hyperparameter ϕ. The uncertainty in ϕ is also modeled with a probability distribution which we term a hyperprior distribution. A two-stage Bayesian hierarchical model can be formulated as follows:

Likelihood: yij | θj, ϕ ∼ f(yij | θj, ϕ) (1.5)

Stage I Prior: θj | ϕ ∼ P (θj | ϕ) (1.6) Stage II Prior: ϕ ∼ P (ϕ). (1.7)

The likelihood depends on the joint prior distribution P (θj, ϕ), but the dependence on

ϕ only occurs through θj. Therefore, by Bayes’ theorem and the definition of conditional probability, the posterior distribution is proportional to:

∏J ∏nj P (ϕ, Θ | y) ∝ f(yij | θj, ϕ)P (θj, ϕ) (1.8) j=1 i=1 ∏J ∏nj ∝ f(yij | θj)P (θj | ϕ)P (ϕ). (1.9) j=1 i=1

This hierarchical structure is critical for the models we use in the basketball applications comprising Chapters 2 and 3. These models have massive player-specific parameter spaces and many of the model parameters are not directly informed by the data. This happens for a number of reasons. For example, some players only play a handful of minutes over the course of a season. Even for high-usage players, there are some model states that are rarely visited, such as shots taken way behind the 3-point line. The only information that is pro- vided for these parameters is often through the hierarchical structure imposed via the prior distribution specification. This induces “shrinkage” of these players’ parameter estimates toward the estimates of the players they are related to as defined by the hierarchical setup of the model. As will be seen in Chapter 3, this can lead to large improvements in model fit.

1.2.2 Markov decision processes

Markov decision processes (MDP) are utilized in many modern reinforcement earning prob- lems to characterize interactions between an agent and their environment. We will restrict our attention to finite MDPs, which can be represented as ⟨S, A,P (·),R(·)⟩, where S rep-

5 resents a discrete and finite set of states, A represents a finite set of actions the agent can take, P (·) defines the transition probabilities between states and R(·) defines the immediate reward the agent receives for any given state/action pair. The agent operates in the envi- ronment according to a policy, π(·), which defines the probabilities that govern the agent’s choice of action based on the current state of the environment. The policy is the only aspect of the system that the agent controls. We can define these functions succinctly in mathematical terms:

( ′) [ ′ ] P s, a, s = P St+1 = s | St = s, At = a , (1.10)

R(s, a) = E[Rt+1 | St = s, At = a], (1.11)

π(s, a) = P[At = a | St = s], (1.12) where nonitalicized uppercase letters (e.g., St, At and Rt+1) represent random variables and italicized lowercase letters represent realized values of the corresponding random variables.

Specifically, St represents the agent’s state in step t of an episode from the MDP, At repre- sents the action taken in that step and Rt+1 is the subsequent reward given that state-action pair. Typically, the agent’s goal is to maximize their cumulative rewards over the long run by making iterative modifications to their current policy in search of an optimal policy, guided by the feedback they receive from their actions. In order to operationalize the search for an optimal policy, we require a way to value state-action pairs not only with respect to the immediate rewards from taking action a in state s, but also with respect to the additional cumulative rewards that can be expected by following a given policy π thereafter. We term this the action-value function, which can be defined as: [ ] ∑∞ k qπ(s, a) = Eπ γ Rt+k+1|St = s, At = a , (1.13) k=0 where γ ∈ [0, 1] is a discount factor which modifies the value of future rewards. Note that the left-hand side of (1.13) does not depend on t. Solving the MDP amounts to finding the optimal policy π∗ ∈ Π for which

qπ∗ (s, a) = arg max qπ(s, a) (1.14) π∈Π for all s ∈ S and a ∈ A. There are many methods that can be used to solve (1.14) such as dynamic programming, Monte Carlo methods, and temporal-difference learning. For an expansive introduction to these methods, as well as many more topics relating to reinforce- ment learning and Markov decision processes, we refer the reader to Sutton & Barto (2018) and Puterman (2014).

6 In Chapter 3, we utilize high-resolution NBA player tracking data in the formulation and estimation of a Markov decision process (MDP) model of a basketball possession. However, rather than estimating the action-value function, we build a basketball possession simulator which can be used to estimate the action-value distribution for any given policy π. We show how the simulator can be used to test realistic alternate policies in comparison to the on-policy setting, thus informing decision-making for players and coaches.

1.2.3 Bayesian optimization

Consider the problem of finding the maximum of an unknown objective function f over a bounded space X ∈ Rd:

x∗ = arg max f(x). (1.15) x∈X

Bayesian optimization provides an efficient framework to balance exploration and exploita- tion in a sequential search for the optimum (Jones et al. 1998). Two components make up the Bayesian optimization paradigm: 1) a statistical model, or surrogate, fˆ, by which we approximate (and update, as more information is obtained) the latent objective, and 2) an acquisition function, which defines how we synthesize our uncertainty about f (as represented by fˆ) when selecting a new location to sample. In most Bayesian optimization applications, the default for fˆ is a Gaussian process. This is sensible in cases where the latent response is assumed to be non-linear, however, for our application (presented in Chapter 5) the shape of the objective function is approximately linear in the region in which the optimizer is constrained to explore. As additional locations are sampled and new feedback is received, the surrogate is updated to reflect the additional knowledge gained about the objective function. After updating the surrogate, the optimizer selects a new location to sample in search of the optimum. This process is governed by a criterion called the “acquisition function" which we denote by u(·). For a given step t of an optimization routine, the acquisition function is a function of a proposed location x and the updated surrogate, fˆt (which itself is a function of the data and outcomes already observed). Acquisition functions are typically constructed such that high values of the function correspond to potentially high values of the objective, either because the predicted mean is high, the uncertainty is high, or both (Brochu et al. 2010). The experimenter maximizes this function over the space of potential locations X to obtain a new location to sample:

xt+1 = arg max u(x|Dt, fˆt), (1.16) x∈X where fˆt represents the updated surrogate at step t, and Dt represents the locations and outcomes already observed (i.e. {X0:t, y0:t}, where yt = f(xt)). After the objective is sam-

7 pled at the resulting location, the surrogate is updated and the process is repeated until some stopping criterion is reached. Algorithm 1 shows pseudo-code for this process.

Algorithm 1: Bayesian optimization

Input: D0, fˆ0, u Output: x+ = arg max f(x ) xt∈X0:T t 1 for t = 0,...,T − 1 do | ˆ D 2 determine new location: xt+1 = arg maxx∈X u(x ft, t) ; 3 sample objective: yt+1 = f(xt+1);

4 augment data: Dt+1 = {Dt, (xt+1, yt+1)} ;

5 update surrogate: fˆt+1 = fˆt|Dt+1 6 end 7 return x+ = arg max f(x ) xt∈X0:T t

8 Chapter 2

Measuring Spatial Allocative Efficiency in Basketball

∗ A version of Chapter 2 is pending publication in the Journal of Quantitative Analysis in Sports. The paper was coauthored with Jacob Mortensen and Luke Bornn.

2.1 Introduction

From 2017 to 2019, the Oklahoma City Thunder faced four elimination games across three playoff series. In each of these games, Russell Westbrook attempted over 30 shots and had an average usage rate of 45.5%.1 The game in which Westbrook took the most shots came in the first round of the 2017-18 National Basketball Association (NBA) playoffs, where he scored 46 points on 43 shot attempts in a 96-91 loss to the Utah Jazz. At the time, many popular media figures conjectured that having one player dominate field goal attempts in this way would limit the Thunder’s success. While scoring 46 points in a playoff basketball game is an impressive feat for any one player, its impact on the overall game score is moderated by the fact that it required 43 attempts. Perhaps not coincidentally, the Thunder lost three of these four close-out games and never managed to make it out of the first round of the playoffs. At its core, this critique is about shot efficiency. The term ‘shot efficiency’ is used in various contexts within the basketball analytics community, but in most cases it has some reference to the average number of points a team or player scores per shot attempt. Modern discussion around shot efficiency in the NBA typically focuses on either shot selection or individual player efficiency. The concept of shot selection efficiency is simple: 3-pointers and shots near the rim have the highest expected points per shot, so teams should prioritize these high-value shots. The idea underlying individual player efficiency is also straightforward;

1Usage percentage is an estimate of the percentage of team plays used by a player while they were on the floor. For a detailed formula see www.basketball-reference.com/about/glossary.html

9 scoring more points on the same number of shot attempts increases a team’s overall offensive potential. However, when discussing a player’s individual efficiency it is critical to do so in context of the lineup. Basketball is not a 1-v-1 game, but a 5-v-5 game. Therefore, when a player takes a shot, the opportunity cost not only includes all other shots this player could have taken later in the possession, but also the potential shots of their four teammates. So re- gardless of a player’s shooting statistics relative to the league at large, a certain dimension of shot efficiency can only be defined relative to the abilities of a player’s teammates. Ap- plying this to the Oklahoma City Thunder example above, if Westbrook were surrounded by dismal shooters, 43 shot attempts might not only be defensible but also desirable. On the other hand, if his inordinate number of attempts prevented highly efficient shot opportuni- ties from his teammates, then he caused shots to be inefficiently distributed and decreased his team’s scoring potential. This aspect of efficiency—the optimal way to allocate shots within a lineup—is the primary focus of our paper. Allocative efficiency is spatially dependent. As illustrated in Figure 2.1, the distribution of shots within a lineup is highly dependent on court location. The left plot in Figure 2.1 shows the overall relationship between shooting frequency (x-axis) and shooting skill (y- axis), while the four plots on the right show the same relationship conditioned on various court regions. Each dot represents a player, and the size of the dot is proportional to the number of shots the player took over the 2016-17 NBA regular season. To emphasize how shot allocation within lineups is spatially dependent, we have highlighted the Cleveland Cavaliers starting lineup, consisting of LeBron James, Kevin Love, Kyrie Irving, JR Smith, and Tristan Thompson.

10 Overall Field Goal Attempt (FGA) Rate by Points Per Shot (PPS) Restricted Area: FGA Rate by PPS 3−pointers: FGA Rate by PPS

● ● ● ●● ●● ● ● ● Thompson Smith Love James Irving ●● ● ●●● ●● ● ● ● ●● ● ●● James Irving Love Smith ●●●●● ● ●●● ● ● 1.4 ●●●●●●●●● ●● ●●●● 1.5 ●●●●●●●●●●●●●●●● ● 1.5 ●●●●●●●●●●●●● ●● ●● ●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●● ●● ● ● ●●●● ● ● ●●● ●●●●●●● ● ● ●●● ● ● ●●●●●●●● ●● ●● ● ●●● ●● ●● ● ●●● ●●●●●●●●●●● ●● ●● ●●● ● ● ● ● ● ● ●●●●●● ● ●●● ● ● ● ●●●●●●●●●● ●●●● ● ● ●●●●●●●●●● ●● ● ●● ● ●●● ●● ●● ●● ● ● ● ● ●●●● ●● ●●●●● ●●●●●●●●●● ● ●● ● ●●●●● ● ●● ●●● ● ●●●●● ●●●●●●● ●●● ● ● ●●●●●●● ● ●● ● ● ●●●●●● ●●●●●● ●● ● ●● ● ● ●● ● ● ● ●●●●●●●●● ●●●●●●●●● ●●● ● 1.0 ●●● ● 1.0 ●● ● ●●●●●●●●●●●●●● ●●● ● ● ● ● ●● ● ●●●●●●●● ●●●●● ● ● ● ●● ● ● ● ●●●●●● ●● ●●● ● ● ● ● ● ●●●● ●●● ● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● Points per Shot Points per Shot Points ●● ● ● ● ● ●● ● ● ● ● ● 1.2 ● ● ● ●● Smith Love Irving James ●● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● 0.5 0.5 ● ● ●● ● ● ● ● ● ● ●● ● ● Thompson ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●●● ● ●● ● ● ● ● 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 ● ● ● ●● ●●●●●● ●● ● ● ● ● ●● ●●●● ●●● ● ●● Field Goal Attempt Rate (per 36 minutes) Field Goal Attempt Rate (per 36 minutes) ●● ● ●●● ● ● ● ● ● ●●●● ● ●● ● ●●●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●● ● ●● ●●●●●●● ● ● ● ● ● ●● ● ●● Paint: FGA Rate by PPS Mid−range: FGA Rate by PPS 1.0 ● ● ● ● ●●●● ●● ● ●●● ● ● ● ●● ● ●●●●● ●●●● ●●● ● ● ● ●●● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● Points per Shot Points ● ● ● ● ● ● Thompson Thompson ●●● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● Smith Love Irving James Smith Love James Irving ● ●● ● ●● ● ● ● ● ●● ●● ●●● ● ● ● 1.5 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ●●●●●●●● ●● ● ● ● ● ●●●●●●●●●●● ●●● ●● ●● ●● ● ● ●●●●●●●●●●●● ●●●●●●● ● ● 0.8 ●● ●●●●●●●●●●●●●●●● ●●●●● ● ● ●● ● ●●●●●●●●●●● ●●●● ● ● ● ● ● ● ● 1.0 ●●●●●●●●●●●●●●●● ●●●●●●● ● 1.0 ● ● ● ●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●● ● ●●●●●●●●●● ● ●● ● ●● ●●● ●●●● ● ● ●●●●●●●●●● ●● ●● ●● ●● ● ●● ● ● ●●● ● ●●●●●●●●●●● ● ●● ●● ●●●●●● ● ●●● ● ● ● ●●●●●●● ●●●●●●●●● ● ● ● ●●●● ●●● ●●●● ●●●● ● ● ● ●●●● ●●●● ● ● ● ● ●●●● ●●● ●●●●●● ●●● ● ●● ● ●●● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●● ● ●● ● ●● ●● ●●● ●●●● ● ●● ● ● ● ●● ● ●●● ●●●●●●●●●●●●●●● ●●● ●● ●●● ● ● ●● ●●●●●●●●●●●●● ● ● ● ● Points per Shot Points per Shot Points ● ●● ●●● ●●●● ● ●●●● ●●●●●●●● ● ●●● ● ● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●● ● ● ●● ● ● ● ● ● ●●●●●●● ● ●●●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● 0.5 0.5 ● ● ● ● ● ● ● ● ● ● ●● ●

5 10 15 20 25 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 Field Goal Attempt Rate (per 36 minutes) Field Goal Attempt Rate (per 36 minutes) Field Goal Attempt Rate (per 36 minutes)

Figure 2.1: Left: overall relationship between field goal attempt rate (x-axis) and points per shot (y-axis). Right: same relationship conditioned on various court regions. The Cleveland Cavaliers 2016-17 starting lineup is highlighted in each plot. The weighted least squares fit of each scatter plot is overlaid in each plot by a dotted line.

When viewing field goal attempts without respect to court location (left plot), Kyrie Irving appears to shoot more frequently than both Tristan Thompson and LeBron James, despite scoring fewer points per shot (PPS) than either of them. However, after conditioning on court region (right plots), we see that Irving only has the highest FGA rate in the mid- range region, which is the region for which he has the highest PPS for this lineup. James takes the most shots in the restricted area and paint regions—regions in which he is the most efficient scorer. Furthermore, we see that Thompson’s high overall PPS is driven primarily by his scoring efficiency from the restricted area and that he has few shot attempts outside this area. Clearly, understanding how to efficiently distribute shots within a lineup must be contextualized by spatial information. Notice that in the left panel of Figure 2.1, the relationship between field goal attempt (FGA) rate and PPS appears to be slightly negative, if there exists a relationship at all. Once the relationship between FGA rate and PPS is spatially disaggregated (see right hand plots of Figure 2.1), the previously negative relationship between these variables becomes positive in every region. This instance of Simpson’s paradox has non-trivial implications in the context of allocative efficiency which we will discuss in the following section. The goal of our project is to create a framework to assess the strength of the relationship between shooting frequency and shooting skill spatially within lineups and to quantify the consequential impact on offensive production. Using novel metrics we develop, we quantify how many points are being lost through inefficient spatial lineup shot allocation, visualize where they are being lost, and identify which players are responsible.

11 2.1.1 Related work

In recent years, a number of metrics have been developed which aim to measure shot effi- ciency, such as true shooting percentage (Kubatko et al. 2007), qSQ, and qSI (Chang et al. 2014). Additionally, metrics have been developed to quantify individual player efficiency, such as Hollinger’s player efficiency rating (Sports Reference LLC n.d.). While these met- rics intrinsically account for team context, there have been relatively few studies which have looked at shooting decisions explicitly in context of lineup, and none spatially. Goldman & Rao (2011) coined the term ‘allocative efficiency’, modeling the decision to shoot as a dynamic mixed-strategy equilibrium weighing both the continuation value of a possession and the value of a teammate’s potential shot. They propose that a team achieves optimal allocative efficiency when, at any given time, the lineup cannot reallocate the ball to increase productivity on the margin. Essentially, they argue that lineups optimize over all dimensions of an offensive strategy to achieve equal marginal efficiency for every shot. The left plot of Figure 2.1 is harmonious with this theory—there appears to be no relationship between player shooting frequency and player shooting skill when viewed on the aggregate. However, one of the most important dimensions the players optimize over is court location. Once we disaggregate the data by court location, as shown in the right plots of Figure 2.1, we see a clear relationship between shooting frequency and shooting skill. A unique contribution of our work is a framework to assess this spatial component of allocative efficiency. Shot satisfaction (Cervone et al. 2016) is another rare example of a shot efficiency met- ric that considers lineups. Shot satisfaction is defined as the expected value of a possession conditional on a shot attempt (accounting for various contextual features such as the shot location, shooter, and defensive pressure at the time of the shot) minus the unconditional expected value of the play. However, since shot satisfaction is marginalized over the al- locative and spatial components, these factors cannot be analyzed using this metric alone. Additionally, shot satisfaction is dependent on proprietary data which limits its availability to a broad audience.

2.1.2 Data and code

The data used for this project is publicly available from the NBA stats API (stats.nba.com). Shooter information and shot (x, y) locations are available through the ‘shotchartdetail’ API endpoint, while lineup information can be constructed from the ‘playbyplayv2’ endpoint. Code for constructing lineup information from play-by-play data is available at: https: //github.com/jwmortensen/pbp2lineup. Using this code, we gathered a set of 224,567 shots taken by 433 players during the 2016-17 NBA regular season, which is the data used in this analysis. Code used to perform an empirical version of the analysis presented in this paper is also available online: https://github.com/nsandholtz/lpl.

12 2.2 Models

The foundation of our proposed allocative efficiency metrics rest on spatial estimates of both player FG% and field goal attempt (FGA) rates. With some minor adjustments, we implement the FG% model proposed in Cervone et al. (2016). As this model is the backbone of the metrics we propose in Section 3, we thoroughly detail the components of their model in Section 2.1. In Section 2.2, we present our model for estimating spatial FGA rates.

2.2.1 Estimating FG% surfaces

Player FG% is a highly irregular latent quantity over the court space. In general, players make more shots the closer they are to the hoop, but some players are more skilled from a certain side of the court and others specialize from very specific areas, such as the corner 3-pointer. In order to capture these kinds of non-linear relationships, Cervone et al. (2016) summarizes the spatial variation in player shooting skill by a Gaussian process represented by a low-dimensional set of deterministic basis functions. Player-specific weights are esti- mated for the basis functions using a Bayesian hierarchical model (Gelman et al. 2013). This allows the model to capture nuanced spatial features that player FG% surfaces tend to exhibit, while maintaining a feasible dimensionality for computation.

We model the logit of πj(s), the probability that player j makes a shot at location s, as a linear model: ( ) πj(s) ′ log = β x + Zj(s). (2.1) 1 − πj(s)

Here β is a 4 × 1 vector of covariate effects and x is a 4 × 1 vector of observed covariates for the shot containing an intercept, player position (i.e. center, power forward, , , and ), shot distance, and the interaction of player position and shot distance. Zj(s) is a Gaussian process which accounts for the impact of location on the probability of player j making a shot and is modeled using a functional basis representation,

′ Zj(s) = wjΛΛΨΨΨ(s), (2.2)

′ where wj = (wj1,..., wjD) denotes the latent basis function weights for player j and ΛΛΨΨΨ(s) ′ ′ ′ × denotes the basis functions. Specifically, Λ = (λ1, . . . ,λλD) is a D K matrix, where each row vector λd represents the projection of the dth basis function onto a triangular mesh with K vertices over the offensive half court (more details on the construction of Λ follow below). We use the mesh proposed in Cervone et al. (2016), which was selected specifically ′ for modeling offensive spatial behaviour in basketball. ΨΨ(s) = (ψ1(s), . . . , ψK (s)) is itself a vector of basis functions where each ψk(s) is 1 at mesh vertex k, 0 at all other vertices, and values at the interior points of each triangle are determined by linear interpolation between vertices (see Lindgren et al. (2011) for details). Finally, we assume wj ∼ N (ωj,Σj), which

13 ′ makes (2.2) a Gaussian process with mean ωjΛΛΨΨΨ(s) and covariance function Cov(s1, s2) = ′ ′ ΨΨ(s1) Λ ΣjΛΛΨΨΨ(s2). Following Miller et al. (2014), the bases of shot taking behavior, Λ, are computed through a combination of smoothing and non-negative matrix factorization (NMF) (Lee & Seung 1999). Using integrated nested Laplace approximation (INLA) as the engine for our infer- ence, we first fit a log Gaussian Cox Process (LGCP) (Banerjee et al. 2015) independently to each player’s point process defined by the (x, y) locations of their made shots using the aforementioned mesh.2 Each player’s estimated intensity function is evaluated at each ver- tex, producing a K-dimensional vector for each of the L = 433 players in our data. These vectors are exponentiated and gathered (by rows) into the L × K matrix P, which we then factorize via NMF: ( )( ) P ≈ B Λ . (2.3) L×D D×K

This yields Λ, the deterministic bases we use in (2.2). While the bases from (2.3) are constructed solely with respect to the spatial variation in the FGA data (i.e. no basketball- specific structures are induced a priori), the constraint on the number of bases significantly impacts the basis shapes. In general, the NMF tends to first generate bases according to shot distance. After accounting for this primary source of variation, other systematic features of variation begin to appear in the bases, notably asymmetry. We use D = 16 basis functions, aligning with Miller et al. (2014) which suggests the optimal number of basis functions falls between 15 and 20. Collectively, these bases comprise a comprehensive set of shooting tendencies, as shown in Figure 2.2. We have added labels post hoc to provide contextual intuition.

2Players who took less than five shots in the regular season are treated as “replacement players.”

14 Under Hoop Hoop Right Lower Paint Top of Key Right Baseline Right Corner 3 Right Arc 3 Center Arc 3

Hoop Front Hoop Left Upper Paint Elbow Jumpers Left Baseline Left Corner 3 Left Arc 3 Residual

Figure 2.2: Deterministic bases resulting from the non-negative matrix factorization of P. The plots are arranged such that the bases closest to the hoop are on the left (e.g. Under Hoop) and the bases furthest from the hoop are on the right (e.g. Center Arc 3). The residual basis, comprising court locations where shots are infrequently attempted from, is shown in the bottom-right plot.

Conceptually, the Zj(s) term in (2.1) represents a player-specific spatial ‘correction’ to the global regression model β ′x. These player-specific surfaces are linear combinations of the bases shown in Figure 2.2. The weights of these combinations, wj, are latent parameters which are jointly estimated with β. Since these player weights can be highly sensitive for players with very little data, it is imperative to introduce a regularization mechanism on them, which is accomplished using a conditionally autoregressive (CAR) prior. Conveniently, the NMF in (2.3) provides player-specific loadings onto these bases, B, which we use in constructing this CAR prior on the basis weights, wj (Besag 1974). The purpose of using a CAR prior on the basis weights is to shrink the FG% estimates of players with similar shooting characteristics toward each other. This is integral for obtaining realistic FG% estimates in areas where a player took a low volume of shots. With only a handful of shots from an area, a player’s empirical FG% can often be extreme (e.g. near 0% or 100%). The CAR prior helps to regularize these extremes by borrowing strength from the player’s neighbors in the estimation. In order to get some notion of shooting similarity between players, we calculate the Euclidean distance between the player loadings contained in B and, for a given player, define the five players with the closest player loadings as their neighbors. This is intentionally chosen to be fewer than the number of neighbors selected by Cervone, recognizing that more neighbors defines a stronger prior and limits player-to-player variation in the FG% surfaces. We enforce symmetry in the nearest-neighbors relationship by assuming that if player j is a neighbor of player ℓ, then player ℓ is also a neighbor of player j, which results in some players having more than five neighbors. These relationships are encoded in a

15 player adjacency matrix H where entry (j, ℓ) is 1 if player ℓ is a neighbor of player j and 0 otherwise. The CAR prior on wj can be specified as ( ) ∑ 2 2 1 τ (wj|w−(j), τ ) ∼ N wℓ, ID (2.4) nj nj ℓ:Hjℓ=1 τ 2 ∼ InvGam(1, 1). where nj is the total number of neighbors for player j. Lastly, we set a N (0, 0.001 × I) prior on β, and fit the model using INLA. This yields a model that varies spatially and allows us to predict player-specific FG% at any location in the offensive half court. In order to get high resolution FG% estimates, we partition the court into 1 ft. by 1 ft. grid cells (yielding a total of M = 2350 cells) and denote player j’s b FG% at the centroid of grid cell i as ξij. The projection of the FG% posterior mean (ξj) for LeBron James is depicted in Figure 2.3. In order to have sufficient data to reliably estimate these surfaces, we assume that player FG%s are lineup independent. We recognize this assumption may be violated in some cases, as players who draw significant defensive attention can improve the FG%s of their teammates by providing them with more unguarded shot opportunities. Additionally, without defensive information about the shot opportunities, the FG% estimates are subject to systematic bias. Selection bias is introduced by unequal amounts of defensive pressure applied to shooters of different skill levels. The Bayesian modeling framework can amplify selection bias as well. Since the FG% estimates are regularized in our model via a CAR prior, players FG% estimates shrink toward their neighbors (which we’ve defined in terms of FGA rate). While this feature stabilizes estimates for players with low sample sizes, it can be problematic when entire neighborhoods have low sample sizes from specific regions. For example, there are many centers who rarely or never shoot from long range. Consequently, the entire neighborhood shrinks toward the global mean 3-point FG%, inaccurately inflating these players’ FG%s beyond the 3-point line. These are intriguing challenges and represent promising directions for future work.

16 LeBron James' Estimated FG% LeBron James' FG% Standard Deviation

FG% Std. Dev.

80 7 70 6 60 5 50 40 4 30

Figure 2.3: LeBron James 2016-17 FG% posterior mean (left) and posterior standard de- viation (right) projected onto the offensive half court. The prediction surfaces shown here and throughout the figures in this paper utilize projections onto a spatial grid of 1 ft. by 1 ft. cells.

2.2.2 Determining FGA rate surfaces

We determine a player’s FGA rate surface by smoothing their shot attempts via a LGCP. This model has the form

log λ(s) = β0 + Z(s), where λ(s) is the Poisson intensity indicating the number of expected shots at location s, β0 is an intercept, and Z(s) is a Gaussian process. We fit this model separately for each player using INLA, following the approach in Simpson et al. (2015). In brief, they demonstrate that the likelihood for the LGCP can be approximated using a finite-dimensional Gaussian random field, allowing Z(s) to be represented by the basis function expansion Z(s) = ∑ B b=1 zbϕb(s). The basis function ϕb(s) projects shot location onto a triangular mesh akin to the one detailed for (2.2). The expected value of λ(s) integrated over the court is equal to the number of shots a player has taken, however there can be small discrepancies between the fitted intensity function and the observed number of shots. In order to ensure consistency, we scale the resulting intensity function to exactly yield the player’s observed number of shot attempts in that lineup. We normalize the surfaces to FGA per 36 minutes by dividing by the total number of minutes played by the associated lineup and multiplying by 36, allowing us to make meaningful comparisons between lineups who differ in the amount of minutes played. As with the FG% surfaces (ξ), we partition the court into 1 ft. by 1 ft. grid cells and denote player j’s FGA rate at the centroid of grid cell i as Aij. Note that we approach the FGA rate estimation from a fundamentally different per- spective than the FG% estimation. We view a player’s decision to shoot the ball as being completely within their control and hence non-random. As such, we incorporate no uncer- tainty in the estimated surfaces. We use the LGCP as a smoother for observed shots rather

17 than as an estimate of a player’s true latent FGA rate. Other smoothing methods (e.g. kernel based methods (Diggle 1985)) could be used instead. Depending on the player and lineup, a player’s shot attempt profile can vary drastically from lineup to lineup. Figure 2.4 shows Kyrie Irving’s estimated FGA rate surfaces in the starting lineup (left) and the lineup in which he played the most minutes without LeBron James (middle). Based on these two lineups, Irving took 9.2 more shots per 36 minutes when he didn’t share the court with James. He also favored the left side of the court far more, which James tends to dominate when on the court.

Kyrie Irving FGA rate Kyrie Irving FGA rate Most minute Lineup w/ LeBron Most minute Lineup w/o LeBron Difference

FGA FGA Diff 0.050 0.06 0.10 0.025 0.04 0.05 0.000 0.02 −0.025

Figure 2.4: Left: Kyrie Irving’s FGA rate per 36 minutes in the starting lineup (in which he shared the most minutes with LeBron James). Center: Kyrie Irving’s FGA rate per 36 minutes in the lineup for which he played the most minutes without LeBron James. Right: The difference of the center surface from the left surface.

Clearly player shot attempt rates are not invariant to their teammates on the court. We therefore restrict player FGA rate estimation to lineup-specific data. Fortunately, the additional sparsity introduced by conditioning on lineup is a non-issue. If a player has no observed shot attempts from a certain region (e.g, Tristan Thompson from 3-point range), this simply means they chose not to shoot from that region—we don’t need to borrow strength from neighboring players to shed light on this area of “incomplete data".

2.3 Allocative efficiency metrics

The models for FG% and FGA rate described in Section 2 are the backbone of the allocative efficiency metrics we introduce in this section: lineup points lost (LPL) and player LPL contribution (PLC). Before getting into the details, we emphasize that these metrics are agnostic to the underlying FG% and FGA models; they can be implemented using even crude estimates of FG% and FGA rate, for example, by dividing the court into discrete regions and using the empirical FG% and FGA rate within each region. Also note that the biases affecting FG% and FGA rate described in Section 2 may affect the allocative efficiency metrics as well. Section 4 includes a discussion of the causal limitations of the approach.

18 LPL is the output of a two-step process. First, we redistribute a lineup’s observed distribution of shot attempts according to a proposed optimum. This optimum is based on ranking the five players in the lineup with respect to their FG% and FGA rate and then redistributing the shot attempts such that the FG% ranks and FGA rate ranks match. Second, we estimate how many points could have been gained had a lineup’s collection of shot attempts been allocated according to this alternate distribution. In this section, we go over each of these steps in detail and conclude by describing PLC, which measures how individual players contribute to LPL.

2.3.1 Spatial rankings within a lineup

With models for player FG% and player-lineup FGA rate, we can rank the players in a given lineup (from 1 to 5) on these metrics at any spot on the court. For a given lineup, ξ let Ri be a discrete transformation of ξi—the lineup’s FG% vector in court cell i—yielding each player’s FG% rank relative to their four teammates. Formally,

ξ { − ≡ (k)} Rij = (nξi + 1) k : ξij ξi , (2.5)

where nξi is the length of ξi, the vector being ranked (this length will always be 5 in our case), (k) ∈ { } and ξi is the kth order statistic of ξi (k 1, 2, 3, 4, 5 ). Since ξij is a stochastic quantity ξ governed by a posterior distribution, Rij is also distributional, however its distribution { } ξ is discrete, the support being the integers 1, 2, 3, 4, 5 . The distribution of Rij can be approximated by taking posterior samples of ξi and ranking them via (2.5). Figure 2.5 shows the 20% quantiles, medians, and 80% quantiles of the resulting transformed variates for the Cavaliers starting lineup.

19 LeBron James JR Smith Kevin Love Kyrie Irving Tristan Thompson

FG% Rank 20% Quantile 1 2 3 4 5

LeBron James JR Smith Kevin Love Kyrie Irving Tristan Thompson

FG% Rank Median 1 2 3 4 5

LeBron James JR Smith Kevin Love Kyrie Irving Tristan Thompson

FG% Rank 80% Quantile 1 2 3 4 5 Figure 2.5: Top: 20% quantiles of the Cleveland Cavaliers starting lineup posterior distri- butions of FG% ranks. Middle: medians of these distributions. Bottom: 80% quantiles.

We obtain ranks for FGA rates in the same manner as for FG%, except these will instead A be deterministic quantities since the FGA rate surfaces, A, are fixed. We define Rij as

A { − ≡ (k)} Rij = (nAi + 1) k : Aij Ai , (2.6)

(k) where nAi is the length of Ai and Ai is the kth order statistic of Ai. Figure 2.6 shows ξ the estimated maximum a posteriori3 (MAP) FG% rank surfaces, Rb , and the deterministic FGA rate rank surfaces, RA, for the Cleveland Cavaliers starting lineup.

3For the FG% rank surfaces we use the MAP estimate in order to ensure the estimates are always in bξ ∈ { } the support of the transformation (i.e. to ensure Rij 1,..., 5 ). For parameters with continuous support, such as bξ, the hat symbol denotes the posterior mean.

20 LeBron James JR Smith Kevin Love Kyrie Irving Tristan Thompson

FGA rate per 36

LeBron James JR Smith Kevin Love Kyrie Irving Tristan Thompson 0.4 0.3 0.2 FG% MAP Rank 0.1 1 2 3 4 5

LeBron James JR Smith Kevin Love Kyrie Irving Tristan Thompson

FGA rate Rank 1 2 3 4 5

Figure 2.6: Top: Estimated FG% ranks for the Cleveland Cavaliers’ starting lineup. Bottom: Deterministic FGA rate ranks.

ξ The strong correspondence between Rb and RA shown in Figure 2.6 is not surprising; all other factors being equal, teams would naturally want their most skilled shooters taking the most shots and the worst shooters taking the fewest shots in any given location. By taking ξ the difference of a lineup’s FG% rank surface from its FGA rate rank surface, RA − Rb , we obtain a surface which measures how closely the lineup’s FG% ranks match their FGA rate ranks. Figure 2.7 shows these surfaces for the Cavaliers’ starting lineup.

LeBron James JR Smith Kevin Love Kyrie Irving Tristan Thompson Rank Correspondence 4 3 2 1 0 −1 −2 −3 −4

Figure 2.7: Rank correspondence surfaces for the Cleveland Cavaliers’ starting lineup.

Note that rank correspondence ranges from -4 to 4. A value of -4 means that the worst shooter in the lineup took the most shots from that location, while a positive 4 means the best shooter took the fewest shots from that location. In general, positive values of rank correspondence mark areas of potential under-usage and negative values show potential over-usage. For the Cavaliers, the positive values around the 3-point line for Kyrie Irving suggest that he may be under-utilized as a 3-point shooter. On the other hand, the negative

21 values for LeBron James in the mid-range region suggest that he may be over-used in this area. We emphasize, however, these conclusions should be made carefully. Though inequal- ity between the FG% and FGA ranks may be indicative of sub-optimal shot allocation, this interpretation may not hold in every situation due to bias introduced by confounding variables (e.g. defensive pressure, shot clock, etc.).

2.3.2 Lineup points lost

By reducing the FG% and FGA estimates to ranks, we compromise the magnitude of player- to-player differences within lineups. Here we introduce lineup points lost (LPL), which measures deviation from perfect rank correspondence while retaining the magnitudes of player-to-player differences in FG% and FGA. LPL is defined as the difference in expected points between a lineup’s actual distribution of FG attempts, A, and a proposed redistribution, A∗, constructed to yield perfect rank ∗ correspondence (i.e. RA − Rξ = 0). Formally, we calculate LPL in the ith cell as

∑5 ( ) LPLi = vi · ξij · A ξ − Aij (2.7) i[g(Rij )] j=1 ∑5 ( ) · · ∗ − = vi ξij Aij Aij , (2.8) j=1 where vi is the point value (2 or 3) of a made shot, ξij is the FG% for player j in cell i, Aij ξ { ξ ≡ A } is player j’s FG attempts (per 36 minutes) in cell i, and g(Rij) = k : Rij Rik . The function g(·) reallocates the observed shot attempt vector Ai such that the best shooter always takes the most shots, the second best shooter takes the second most shots, and so forth. Figure 2.8 shows a toy example of how LPL is computed for an arbitrary 3-point region, contextualized via the Cleveland Cavaliers starting lineup. In this hypothetical scenario, James takes the most shots despite both Love and Irving being better shooters from this court region. When calculating LPL for this region, Irving is allocated James’ nine shots since he is the best shooter in this area. Love, as the second best shooter, is allocated Irving’s four shots (which was the second most shots taken across the lineup). James, as the third best shooter, is allocated the third most shot attempts (which is Love’s three shots). Smith and Thompson’s shot allocations are unchanged since their actual number of shots harmonizes with the distribution imposed by g(·). Each player’s actual expected points and optimal expected points are calculated by multiplying their FG% by the corresponding number of shots and the point-value of the shot (3 points in this case). LPL is the difference (in expectation) between the optimal points and the actual points, which comes out to 0.84.

22 Kyrie Irving Kevin Love LeBron James JR Smith Tristan Thompson FG%: 40% 38% 35% 32% 25%

Actual shots taken: 4 43 109 21 12

Optimal redistribution: 9 4 3 2 1 Actual points: ((.40 × 4)+ (.38 × 3) + (.35 × 9) + (.32 × 2) + (.25 × 1)) × 3 = 20.34 Optimal points: ((.40 × 9)+ (.38 × 4) + (.35 × 3) + (.32 × 2) + (.25 × 1)) × 3 = 21.18 Lineup points lost (LPL): Optimal points − Actual points = 0.84

Figure 2.8: A toy LPL computation in an arbitrary 3-point court region for the Cleveland Cavalier’s starting lineup. The players are ordered from left to right according to FG% (best to worst). Below each player’s picture is the number of actual shots the player took from this location. The black arrows show how the function g(·) reallocates these shots according to the players’ FG% ranks. The filled gray dots show the number of shots the player would be allocated by according to the proposed optimum. Below the horizontal black line, each player’s actual expected points and optimal expected points are calculated by multiplying their FG% by the corresponding number of shots and the point value of the shot. LPL is the difference (in expectation) between the optimal points and the actual points.

The left plot of Figure 2.9 shows LPLd (per 36 minutes) over the offensive half court for Cleveland’s starting lineup, computed using the posterior mean of ξ. Notice that the LPL values are highest around the rim and along the 3-point line. These regions tend to dominate LPL values because the density of shot attempts is highest in these areas. If we re-normalize LPL with respect to the number of shots taken in each court cell we can identify areas of inefficiency that do not stand out due to low densities of shot attempts:

LPL Shot ∑ i LPLi = 5 . (2.9) j=1 Aij

This formulation yields the average lineup points lost per shot from region i, as shown in the right plot of Figure 2.9.

23 Cleveland 2016−17 Starting Lineup Cleveland 2016−17 Starting Lineup L. James J. Smith K. Love K. Irving T. Thompson L. James J. Smith K. Love K. Irving T. Thompson

LPL per 36 LPL per Shot

0.006 0.06 0.004 0.04 0.002 0.02 0.000 0.00

Shot Figure 2.9: LPLd and LPLd surfaces for the Cleveland Cavaliers starting lineup.

Since LPL is a function of ξ , which is latent, the uncertainty in LPL is proportional i ∑i i 5 to the posterior distribution of j=1 ξij. Figures 2.10 and 2.11 illustrate the distributional nature of LPL.

M Cleveland Cavliers: Distribution of ∑ LPLi i = 1

140

120

100

80

60 Frequency

40

20

0

0 1 2 3 4 5 6 M ∑ LPLi i = 1

∑ M Figure 2.10: Histogram of i=1 LPLi for the Cleveland Cavaliers starting lineup. 500 pos- terior draws from each ξ , where i ∈ {1 ...,M} and j ∈ {1,..., 5}, were used to compute ∑ ij M the 500 variates of i=1 LPLi comprising this histogram.

24 LPL 20% Quantile (per 36) LPL Median (per 36) LPL 80% Quantile (per 36) 2016−17 CLE Starting Lineup: 2016−17 CLE Starting Lineup: 2016−17 CLE Starting Lineup: James Smith Love Irving Thompson James Smith Love Irving Thompson James Smith Love Irving Thompson

LPL per 36 LPL per 36 LPL per 36 0.06 0.06 0.06

0.04 0.04 0.04

0.02 0.02 0.02

LPL 20% Quantile (per shot) LPL Median (per shot) LPL 80% Quantile (per shot) 2016−17 CLE Starting Lineup: 2016−17 CLE Starting Lineup: 2016−17 CLE Starting Lineup: James Smith Love Irving Thompson James Smith Love Irving Thompson James Smith Love Irving Thompson

LPL per Shot LPL per Shot LPL per Shot 0.25 0.25 0.25 0.20 0.20 0.20 0.15 0.15 0.15 0.10 0.10 0.10 0.05 0.05 0.05

Figure 2.11: Left: 20% quantile LPL surfaces for the Cleveland Cavaliers starting lineup. Middle: median LPL surfaces. Bottom: 80% quantile LPL surfaces. The top rows show LPL36 while the bottom rows show LPLShot.

∗ LPL incorporates an intentional constraint—for any court cell i, Ai is constrained to be a permutation of Ai. This ensures that no single player can be reallocated every shot that was taken by the lineup (unless a single player took all of the shots from that region to begin with). It also ensures that the total number of shots in the redistribution will always ( ∑ ∑ ) 5 5 ∗ equal the observed number of shots from that location i.e. j=1 Aij = j=1 Aij, for all i . Ultimately, LPL aims to quantify the points that could have been gained had a lineup adhered to the shot allocation strategy defined by A∗. However, as will be detailed in Section 2.4, there is not a 1-to-1 relationship between ‘lineup points’ as defined here, and actual points. In other words, reducing the total LPL of a team’s lineup by 1 doesn’t necessarily correspond to a 1-point gain in their actual score. In fact, we find that a 1-point reduction in LPL corresponds to a 0.6-point gain (on average) in a team’s actual score. One reason for this discrepancy could be because LPL is influenced by contextual variables that we are unable to account for in our FG% model, such as the shot clock and defensive pressure. Another may be due to a tacit assumption in our definition of LPL. By holding each player’s FG% constant despite changing their volume of shots when redistributing the vector of FG attempts, we implicitly assume that a player’s FG% is independent of their FGA rate. The basketball analytics community generally agrees that this assumption does not hold—that the more shots a player is allocated, the less efficient their shots become. This concept, referred to as the ‘usage-curve’ or ‘skill-curve’, was introduced in Oliver (2004) and has

25 been further examined in Goldman & Rao (2011). Incorporating usage curves into LPL could be a promising area of future work.

2.3.3 Player LPL contribution

LPL summarizes information from all players in a lineup into a single surface, compromising our ability to identify how each individual player contributes to LPL. Fortunately, we can parse out each player’s contribution to LPL and distinguish between points lost due to undershooting and points lost due to overshooting. We define player j’s LPL contribution (PLC) in court location i as ( ) ∗ A − Aij PLC = LPL × ∑ ij , (2.10) ij i 5 | ∗ − | j=1 Aij Aij where all terms are as defined in the previous section. The parenthetical term in (2.10) apportions LPLi among the 5 players in the lineup proportional to the size of their individual ∗ contributions to LPLi. Players who are reallocated more shots under Ai compared to their observed number of shot attempts will have PLCij > 0. Therefore, positive PLC values indicate potential undershooting and negative values indicate potential overshooting. As in the case of LPL, if we divide PLC by the sum of shot attempts in cell i, we obtain average PLC per shot from location i:

PLC Shot ∑ i PLCi = 5 . (2.11) j=1 Aij

Shot The PLCi surfaces for the Cleveland Cavaliers’ 2016-17 starting lineup are shown in Figure 2.12. We see that Kyrie Irving is potentially being under-utilized from beyond the arc and that LeBron James is potentially over-shooting from the top of the key, which is harmonious with our observations from Figure 2.7. However, it is worth noting that the LPL per 36 plot (left plot in Figure 2.9) shows very low LPL values from the mid-range region since the Cavaliers have a very low density of shots from this area. So while it may be true that LeBron tends to overshoot from the top of the key relative to his teammates, the lineup shoots so infrequently from this area that the inefficiency is negligible.

26 LeBron James JR Smith Kevin Love Kyrie Irving Tristan Thompson

PLC per shot

0.02

0.00

−0.02

Shot Figure 2.12: PLC[ surfaces for the Cleveland Cavaliers starting lineup.

For every red region in Figure 2.12 (undershooting) there are corresponding blue regions (overshooting) among the other players. This highlights the fact that LPL is made up of balancing player contributions from undershooting and overshooting; for every player who overshoots, another player (or combination of players) undershoots. By nature of how LPL is constructed, there cannot be any areas where the entire lineup overshoots or undershoots. For this reason, our method does not shed light on shot selection. LPL and PLC say nothing about whether shots from a given region are efficient or not, instead they measure how efficiently a lineup adheres to optimal allocative efficiency given the shot attempts from that region.

2.3.4 Empirical implementation

To illustrate some important considerations associated with this approach, we present a brief example of LPL and PLC using empirical FG% and FGA rates. This example demonstrates that these quantities are agnostic to the underlying FG% model. In order to obtain FG% and FGA rate estimates, we divide the court into twelve dis- crete regions and calculate the empirical values for each player within these regions. We defined these regions based on our understanding of the court, but it is worth noting that defining these regions requires many of the same considerations as with any histogram style estimator; namely, that increasing the number of regions will decrease bias at the expense of increasing variance. In some cases, a player may have only one or two shots within an area, resulting in either unrealistically high or low field goal percentage estimates. As an ad hoc solution to this, we give all players one made field goal and five field goal attempts within each region, which means that players with just a handful of shots in a region will have their associated field goal percentage anchored near 20 percent. Rather than perform smoothing for the field goal attempt estimates, we simply count up the number of attempts for each player within each section, and normalize them to get the attempts per 36 minutes, as before. With these FG% and FGA estimates, we can replicate the analysis detailed in the previous sections.

27 Figure 2.13 shows the empirical ranks for this lineup, as well as the rank correspondence. Generally, it shows the same patterns as the model-based analysis in Figures 2.6 and 2.7. However, there are some key differences, including Tristan Thompson having a higher field goal percentage rank from the right midrange and a corresponding reduction in rank for Kevin Love in the same area. This pattern is also manifest in Figure 2.14, which shows the empirical LPL. We observe that most lineup points appear to be lost in the right midrange and in above the break three point shots. Finally, considering the empirical PLC in Figure 2.14, we notice that in addition to the Love-Thompson tradeoff in the midrange, JR Smith appears to be overshooting from the perimeter, while Kyrie Irving and LeBron James both exhibit undershooting.

JR Smith Kevin Love Kyrie Irving LeBron James Tristan Thompson

FG% Rank 1 2 3 4 5

JR Smith Kevin Love Kyrie Irving LeBron James Tristan Thompson

FGA Rank 1 2 3 4 5

JR Smith Kevin Love Kyrie Irving LeBron James Tristan Thompson

Rank Diff 4 3 2 1 0 −1 −2 −3 −4

Figure 2.13: Top: Empirical FG% ranks for the Cleveland Cavaliers starting lineup. Middle: Empirical FGA ranks. Bottom: Rank correspondence.

The persistence of the Love-Thompson connection in the midrange in this empirical analysis, and its divergence from what we saw in the model based analysis, merits a brief discussion. Kevin Love and Tristan Thompson both had a low number of shots from the far-right midrange region, with Love shooting 8 for 26 and Thompson shooting 4 for 6. Because they both shot such a low amount of shots, even with the penalty of one make and four misses added to each region, Thompson appears far better. This highlights the

28 fact that although LPL and PLC are model agnostic, the underlying estimates for field goal percentage do matter and raw empirical estimates alone may be too noisy to be useful in calculating LPL. One simple solution may be to use a threshold and only consider players in a region if the number of their field goal attempts passes that threshold.

Cleveland 2016−17 Starting Lineup Cleveland 2016−17 Starting Lineup L. James J. Smith K. Love K. Irving T. Thompson L. James J. Smith K. Love K. Irving T. Thompson

LPL per 36 LPL per shot 0.6 0.100 0.075 0.4 0.050 0.2 0.025 0.0 0.000

JR Smith Kevin Love Kyrie Irving LeBron James Tristan Thompson

PLC per 36 0.050 0.025 0.000 −0.025 −0.050

Figure 2.14: Top: Empirical LPL and LPLShot for the Cleveland Cavaliers starting lineup. Bottom: Empirical PLC for the Cleveland Cavaliers starting lineup.

2.4 Optimality - discussion and implications

We have now defined LPL and given the theoretical interpretation (i.e. overuse and under- use), but we have not yet established that this interpretation is valid in practice. The utility of LPL as a diagnostic tool hinges on the answers to four questions, which we explore in detail in this section:

1. Do lineups minimize LPL? 2. Does LPL relate to offensive production? 3. How can LPL inform strategy? 4. Is minimizing LPL always optimal?

2.4.1 Do lineups minimize LPL?

In Figure 2.9, cell values range from 0 to 0.008, and the sum over all locations in the half court is 0.68. While this suggests that the Cavaliers’ starters were minimizing LPL, we need

29 a frame of reference to make this claim with certainty. The frame of reference we will use for comparison is the distribution of LPL under completely random shot allocation. This is not to suggest offenses select shooting strategies randomly. Rather, a primary reason why lineups fail to effectively minimize LPL is because the defense has the opposite goal; defenses want to get the opposing lineup to take shots from places they are bad at shooting from. In other words, while the offense is trying to minimize LPL, the defense is trying to maximize LPL. By comparing LPL against random allocation, this provides a general test for whether offenses are able to pull closer to the minimum than defenses are able to pull toward the maximum, or the absolute worst allocation possible. In statistical terms, this comparison can be stated as a hypothesis test. We are interested in testing the null hypothesis that offenses minimize and defenses maximize LPL with equal magnitudes. We consider a one-sided alternative—that the offensive minimization outweighs the defensive response (as measured on by LPL). A permutation test allows us to test these hypotheses by comparing a lineup’s observed total LPL (summing over all court locations, ∑ M i LPLi, where M is the total number of 1 ft. by 1 ft. cells in the half court) against the total LPL we would expect under completely random shot allocation. To ensure the uncertainty in ξ is accounted for, we simulate variates of the test statistic T as

∑M ∑M g H0 − g T = LPLi LPLi (2.12) (i=1 i=1 ) ( ) ∑M ∑5 ( ) ∑M ∑5 ( ) · e · ∗ − † − · e · ∗ − = vi ξij Aij Aij vi ξij Aij Aij (2.13) i=1 j=1 i=1 j=1 ∑M ∑5 ( ) · e · − † = vi ξij Aij Aij , (2.14) i=1 j=1

e † where ξij is a sample from player j’s posterior distribution of FG% in cell i, Aij is the jth element of a random permutation of the observed FGA rate vector Ai, and all other symbols are defined as in (2.7)-(2.8). Note that a different random permutation is drawn for each court cell i. After simulating 500 variates from the null distribution, we approximate the one-sided p-value of the test as the proportion of variates that are less than 0. Figure 2.15 illustrates this test for the Cleveland Cavaliers’ starting lineup. The gray bars show a histogram of the variates from (2.12). Bars to the left of the dashed line at 0 represent variates for which random allocation outperforms the observed allocation. The approximate p-value of the test in this case is 1/500, or 0.002. We can therefore say with high certainty that the Cleveland starters minimize LPL beyond the defense’s ability to prevent them from doing so.

30 Permutation test of H0 vs HA

120

100

80

60

Frequency 40 1 p^ = 20 500

0

−1 0 1 2 3 4 5 6

M M ∑ H0 − ∑ T = LPLi LPLi i = 1 i = 1

Figure 2.15: Permutation test for the Cleveland Cavaliers’ 2016-17 starting lineup. The gray bars show a histogram of the variates from (2.12). The approximate p-value for the Cavaliers starting lineup (i.e. the proportion of variates that are less than 0) is 1/500 or 0.002.

The computational burden of performing the test precludes performing it for every lineup, but we did perform the test for each team’s 2016-17 starting lineup. The results are shown in Table 2.4.1. Across the NBA’s starting lineups, only two teams had no variates less than 0—the Golden State Warriors and the Portland Trailblazers. The Sacramento Kings showed the worst allocative efficiency with an approximate p-value of 0.44 for their starting lineup. Based on these results we are confident that most lineups employ shot allocation strategies that minimize LPL to some degree, though it appears that some teams do so better than others.

Approximate p-values for H0 vs. HA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Team GSW POR CLE LAC ATL HOU TOR IND LAL DET DEN NOP CHA UTA OKC pˆ 0.000 0.000 0.002 0.002 0.014 0.014 0.016 0.020 0.022 0.024 0.028 0.030 0.030 0.038 0.042 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Team DAL MIA MIN BOS NYK ORL SAS BKN PHI MIL WAS PHX MEM CHI SAC pˆ 0.044 0.046 0.054 0.056 0.058 0.064 0.104 0.106 0.130 0.134 0.144 0.148 0.170 0.210 0.442

Table 2.1: Approximate p-values for H0 vs. HA for each team’s starting lineup in the 2016-17 NBA regular season.

2.4.2 Does LPL relate to offensive production?

We next want to determine whether teams with lower LPL values tend to be more proficient on offense. In order to achieve greater discriminatory power, we’ve chosen to make this assessment at the game level. Specifically, we regress a team’s total game score against their

31 total LPL generated in that game, accounting for other relevant covariates including the team’s offensive strength, the opponents’ defensive strength, and home-court advantage. This framework is analogous to the model proposed in Dixon & Coles (1997). We calculate game LPL (GLPL) by first dividing the court into three broad court regions (restricted area, mid-range, and 3-pointers). Then, for a given game and lineup, we calculate GLPL in each of these court regions (indexed by c) by redistributing the lineup’s observed b vector of shot attempts using on a weighted average of each player’s ξj:

∑5 ( ) · b · ∗ − GLPLc = vc fc(ξj) Acj Acj (2.15) j=1 where ∑ w ξb b ∑i∈c ij ij fc(ξj) = . (2.16) i∈c wij

In (2.16), w is a weight proportional to player j’s total observed shot attempts in court ij ∑ cell i over the regular season. The notation i∈c means we are summing over all the 1 ft. by 1 ft. grid cells that are contained in court region c. Finally, for a given game g and team a, we calculate the team’s total game LPL (TGLPL) by summing GLPLc over all court regions c and all lineups ℓ:

∑La ∑ ℓ TGLPLag = GLPLc (2.17) ℓ=1 c∈C where C = {restricted area, mid-range, 3-pointers} and La is the total number of team a’s lineups. This process is carried out separately for the home and away teams, yielding two TGLPL observations per game. Equipped with a game-level covariate measuring aggregate LPL, we model team a’s game score against opponent b in game g as

Scoreabg = µ + αa + βb + γ × I(Homeag) + θ × TGLPLag + ϵabg (2.18) 2 ϵabg ∼ N(0, σ ), (2.19) where µ represents the global average game score, αa is team a’s offensive strength pa- rameter, βb is team b’s defensive strength parameter, γ governs home court advantage,

θ is the effect of TGLPL, and ϵabg is a normally distributed error term. θ is the pa- rameter that we are primarily concerned with. We fit this model in a Bayesian frame- work using Hamiltonian Monte Carlo methods implemented in Stan (Carpenter et al. 2 2 2017). Our prior distributions are as follows: µ ∼ N(100, 10 ); αa, βb, γ, θ ∼ N(0, 10 ); σ ∼ Gamma(shape = 2, rate = 0.2).

32 The 95% highest posterior density interval for θ is (-1.08, -0.17) and the posterior mean is -0.62. Therefore, we estimate that for each additional lineup point lost, a team loses 0.62 actual points. Put differently, by shaving roughly 3 points off of their TGLPL, a team could gain an estimated 2 points in a game. Given that 10% of games were decided by 2 points or less in the 2016-17 season, this could have a significant impact on a team’s win-loss record and could even have playoff implications for teams on the bubble. Figure 2.16 shows the estimated density of actual points lost per game for every team’s 82 games in the 2016-17 b NBA regular season (i.e. density of θ × TGLPLag, g ∈ {1,..., 82} for each team a).

Distribution of Actual Points Lost per Game from LPL

Washington Brooklynn New Orleans Memphis Oklahoma City Denver Boston Atlanta Cleveland Sacramento Dallas Golden State Points Lost Portland 6 Miluakee Miami 4 Detroit Team San Antonio 2 Phoenix 0 New York Philidelphia Minnesota Chicago Charlotte Utah Indiana L.A. Clippers Toronto L.A. Lakers Orlando Houston 0 2 4 6 Actual Points Lost

Figure 2.16: Estimated density of actual points lost per game for every team’s 82 games in the 2016-17 NBA regular season.

33 Houston was the most efficient team, only losing about 1 point per game on average due to inefficient shot allocation. Washington, on the other hand, lost over 3 points per game on average from inefficient shot allocation.

2.4.3 How can LPL inform strategy?

At this point, we offer some ideas for how coaches might use these methods to improve their teams’ offense. First, for lineups with high LPL, coaches could explore the corresponding PLC plots to ascertain which players are primarily responsible. If the coach determines that the LPL values do indeed represent areas of inefficiency, they could consider interventions targeting the player’s shooting habits in these areas. This short-term intervention could be coupled with long-term changes to their practice routines; coaches could work with players on improving their FG% in the areas shown by the PLC plots. Also, by exploring lineup PLC charts, coaches could identify systematic inefficiency in their offensive schemes, which could prompt changes either in whom to draw plays for or whether to change certain play designs altogether. Coaches are not the only parties who could gain value from these metrics; players and front office personnel could utilize them as well. Players could use PLC plots to evaluate their shooting habits and assess whether they exhibit over-confident or under-confident shot-taking behavior from certain areas of the court. Front office personnel may find trends in the metrics that indicate a need to sign players that better fit their coach’s strategy. LPL and PLC could help them identify which players on their roster to shop and which players to pursue in free agency or the trade market. Consider these ideas in context of the Utah Jazz LPL/PLC charts for the 2016-17 regular season shown in Figure 2.17. On reviewing the LPL per shot plot for the starting lineup, the coach might flag the left baseline and top of the key as areas of potential inefficiency to investigate. On exploring the corresponding PLC plots, they would see Derrick Favors as the driving force behind the high LPL numbers from these regions. Interestingly, from the 2013-14 season through 2016-17, the Derrick Favors baseline and elbow jump shots were go-to plays for the Jazz. Across these four seasons, Favors took over 1500 mid-range shots for an average of 0.76 points per shot (PPS). In the 2017-18 and 2018-19 seasons, the Jazz drastically altered Favors’ shot policy from the mid-range. Beginning in 2017, the Jazz started focusing on running plays for 3-pointers and shots at the rim, a trend that was becoming popular throughout the league. As part of this change in play-style, they tried turning Favors into a stretch four4; he went from taking a total of 21 3-point shots over the previous four seasons, to 141 3-point shots in these two seasons alone. Unfortunately, their intervention for Favors appears to have been misguided;

4A stretch four is a player at the power forward position that can generate offense farther from the basket than a traditional power forward.

34 his average PPS for these 141 shots was 0.66. The front office eventually determined that Favors wasn’t the best fit for their coach’s offensive strategy; they opted not to re-sign Favors at the end of the 2019 season.

Utah 2016−17 Starting Lineup Utah 2016−17 Starting Lineup G. Hill R. Hood G. Hayward D. Favors R. Gobert G. Hill R. Hood G. Hayward D. Favors R. Gobert

LPL per 36 LPL per Shot 0.03 0.125 0.100 0.02 0.075 0.050 0.01 0.025 0.00 0.000

George Hill Rodney Hood Gordon Hayward Derrick Favors Rudy Gobert

PLC per shot 0.050 0.025 0.000 −0.025 −0.050

Shot Shot Figure 2.17: Utah Jazz 2016-17 starting lineup LPL,d LPLd , and PLC[ surfaces.

This process took place over six years—perhaps it could have been expedited had LPL and PLC been available to the coaches and front office staff.

2.4.4 Is minimizing LPL always optimal?

While we have demonstrated that lower LPL is associated with increased offensive produc- tion, we stress that LPL is a diagnostic tool that should be used to inform basketball experts rather than as a prescriptive measure that should be strictly adhered to in all circumstances. As mentioned previously, the LPL and PLC values presented in this paper are influenced by contextual variables that we are unable to account for because they are not available in public data sources, such as the shot clock and defensive pressure. Additionally, there are certain game situations where minimizing LPL may be sub-optimal. One such situation is illustrated in Figure 2.18, which shows the PLCShot surfaces for the Oklahoma City 2016-17 starting lineup.

35 Russell Westbrook Andre Roberson Steven Adams Victor Oladipo Domantas Sabonis

PLC per shot

0.10 0.05 0.00 −0.05

Shot Figure 2.18: Oklahoma City 2016-17 starting lineup PLC[ surfaces.

The first panel from the left in this figure shows positive PLC values for Russell West- brook in the corner 3-point regions, suggesting that Westbrook should be taking more shots from these areas. However, anyone who watched the Thunder play that season will know that many of these corner 3-point opportunities were created by Westbrook driving to the basket, drawing extra defenders toward him, then kicking the ball out to an open team- mate in the corner. Obviously, Westbrook cannot both drive to the rim and simultaneously pass to himself in another area of the court. In this case, strictly minimizing LPL would reduce the number of these drive-and-kick plays, potentially attenuating their offensive fire- power. Shot-creation is not accounted for by LPL and should be carefully considered when exploring LPL and PLC. There are game theoretic factors to be considered as well. Beyond the defensive elements discussed in Section 4.1, rigid adherence to minimizing LPL could lead to a more predictable offense and thus make it easier to defend (D’Amour et al. 2015). Needless to say, offensive game-planning should be informed by more than LPL metrics alone.

2.5 Conclusion

Our research introduces novel methods to evaluate allocative efficiency spatially and shows that this efficiency has a real impact on game outcomes. We use publicly available data and have made an empirical demonstration of our methods available online, allowing our methods to be immediately accessible. Also, since LPL and PLC do not depend on specific models for FG% and FGA rate, LPL and PLC could readily be calculated at G-league, NCAA, and international levels using a simplified model of FG% and FGA rate. We hope that the methods introduced in this chapter will be built upon and improved.

36 Chapter 3

Markov Decision Processes with Dynamic Transition Probabilities: An Analysis of Shooting Strategies in Basketball

∗ A version of Chapter 3 is pending publication in the Annals of Applied Statistics. The paper was coauthored with Luke Bornn.

3.1 Introduction

A basketball game can be framed as a collection of episodes from complex stochastic pro- cesses. Each episode, or play, is comprised of a finite number of transitions between players and locations ultimately terminating in a shot, turnover or foul. An integral attribute of the game is that it is nonstationary; the transition probabilities are not constant over the 24 seconds in which a team has to shoot the ball. For example, consider the relationship between time on the shot clock, which counts down these 24 seconds, and the probability of taking a shot as shown in Figure 3.1 for the National Basketball Association (NBA) 2015-16 regular season. The plot in panel (b) shows empirical league-average shot policies, which we define as the probability that any on-ball event (i.e., dribbles, passes and shots) will be a shot, for the set of court regions defined in panel (a). As the shot clock counts down, the probability of shooting increases—quite dramatically in the final seconds of the shot clock.

37 1.0 Arc 3 Corner 3 Mid−range Paint 0.8 Rim Rim

0.6 Corner 3

Paint 0.4 Probability Ball Event is a Shot Probability Ball Event 0.2 Midrange

Arc 3 ● 0.0 24 22 20 18 16 14 12 10 8 6 4 2 0

Shot Clock (a) (b)

Figure 3.1: (a) Breakdown of court locations as used in our models and simulations. Rim: Within a six-foot radius of the center of the hoop; Paint: Outside the restricted area but within the key; Midrange: Outside of the paint but within the 3-point line; Corner 3: Beyond the 3-point line but below the break of the 3-point arc; Arc 3: Beyond the 3-point line, above the arc 3 break but within 30 feet of the hoop; Backcourt: All locations beyond the arc 3 region. (b) Empirical league-average shot policies for the 2015–16 NBA regular season. We see lower shot probabilities in the midrange and arc 3 regions because the on-ball events in these regions are dominated by passes and dribbles.

Determining optimal policies for player shooting is a critical problem in the game of basketball, and it remains an active area of research (Skinner & Goldman 2019, Goldman & Rao 2014a). However, the inherent nonstationarity introduced by the shot clock makes assessing shot selection optimality a complex problem. In this project we propose a method to test and to compare shot policies which accounts for the dynamic nature of a basketball play. Two critical assumptions underlying our approach are that shot policies are both time-varying and malleable. Basketball analysts often focus on the less flexible component of shot efficiency—field goal percentage (i.e., the percentage of a player’s shots that he makes). Improving shooting skill can take years of practice, whereas the shot policy is comparatively controllable; players choose where and how often they shoot when they have the ball in their possession. Given the malleable nature of shot policies, we explore what could have happened if a player’s shot policy had changed. To enable this exploration, we model plays as episodes from latent Markov decision processes (MDPs) with dynamic within-episode transition probabil- ities. We approximate these functional transition probabilities, via transition probability tensors (TPTs), then estimate the latent components of the MDP using Bayesian hierarchi- cal models. Our method involves combining several Markov chains with overlapping state

38 spaces into an average Markov chain, which we derive subject to the constraint that the expected total transition count for an arbitrary state-pair is equal to the weighted sum of the expected counts of the separate chains. We then develop a method to simulate from these processes not simply by outcome but rather at the subsecond level, incorporating every intermediary and terminal on-ball event over the course of a play. The uncertainty in our estimation of the MDP gets propagated into the simulations via posterior samples of the MDP model. Ultimately, our method allows us to make distributional estimates of counter-factual scenarios such as, “What could have happened if a team took contested midrange shots less frequently early in the shot clock?” While we focus primarily on shot policies in this paper, narrowing in on a player’s choice to shoot or not at any given instant, the framework presented here can be altered to accommodate the whole space of decisions players can make on-ball, including movement and passing.

3.1.1 Related work and contributions

This paper adds to the growing literature of spatiotemporal analyses of team invasion sports (e.g., basketball, football, soccer and hockey). We refer the reader to Gudmundsson & Horton (2017) for a survey. Within this body of work, Markov models have significant presence: Goldner (2012) uses a Markov model as a framework for evaluating plays in American football; Hirotsu & Wright (2002) use Markov processes to determine optimal substitution patterns in an English Premier League match; and Thomas et al. (2013) use a semi-Markov process to model team scoring rates in hockey. The landmark work of Cervone et al. (2016) is particularly relevant to the methods we introduce in this paper. The state space and hierarchical models we develop have sim- ilarities to the coarsened possession process they employ. However, our ultimate goals are fundamentally different. Cervone et al. (2016) aim to estimate instantaneous point values of possessions whereas we utilize a decision process framework to estimate the macro effects if player decisions were to change. We approach the problem similarly to Routley (2015), who applies a Markov game formalism to value player actions in hockey, incorporating context and a lookahead window in time. As in Routley (2015), we do not aim to compute optimal strategies. Instead, we provide a basketball play simulator by which alternate policies can be explored. Since the defense is not an adversarial agent in our model but is built into the system via the probabilistic components of the MDP, this simulator is proposed as an exploratory tool as opposed to a mechanism to compute policy optima. Several papers have endeavored to simulate a basketball game using Markov models (Štrumbelj & Vračar 2012, Vračar et al. 2016). Our simulator is unique among these stud- ies in a number of ways. We account for the uncertainty in every estimated parameter, propagating this uncertainty through to the simulations. Also, though Vračar et al. (2016)

39 incorporate game-clock time, these simulation methods do not account for the inherent non- stationarity within a possession introduced by the shot clock. We propose a novel method to account for the nonstationarity of basketball plays using tensors in the MDP framework. By incorporating this dependency in our model, we can explore far more detailed policy changes with correspondingly more accurate results, particularly with respect to shot clock violations and time-specific policy changes within plays. This work also contributes to the literature and practical application of discrete absorb- ing Markov chains. We formalize a method to construct and estimate an average chain from several independent Markov chains with overlapping state spaces. The term “average,” as used here, refers to a chain that yields the same number of state-to-state transition counts in expectation for all unique state pairs spanned by the set of independent chains. This allows us to reliably estimate aggregate counts across the system of chains without hav- ing to estimate each chain individually. This result is critical in making our method both parsimonious and computationally feasible.

3.1.2 Description of data

We use high-resolution optical tracking data collected by STATS LLC from the 2015– 2016 NBA regular season. These data include the x, y coordinates of all 10 players on the court and the x, y, z coordinates of the ball at a frequency of 25 observations per second. These data are further annotated with features such as shots, passes, dribbles, fouls, etc. For this project we subset the data to observations with tagged ball-events, including dribbles, passes, rebounds, turnovers and shots. This significantly reduces the number of observations while retaining the core structure of a possession. Sandholtz & Bornn (2020) provides a simplified walkthrough in R of the methods presented in this chapter and one game of data provided by STATS LLC. This can also be found on GitHub at https://github.com/nsandholtz/nba_replay..

3.1.3 Outline

The rest of the chapter is outlined as follows: In Section 3.2 we give a brief overview of Markov decision processes and describe how we incorporate tensors in the framework of an MDP. We also detail the construction of an average chain from several independent Markov chains with overlapping state spaces. In Section 3.3 we define probabilistic models for each latent component of the MDP. The inference procedures used in fitting the models is detailed, and we illustrate the model fits. In Section 3.4 we describe how we simulate plays from team-specific MDPs and show calibration results from our simulations. In Sec- tion 3.5 we show the results of our simulations under various altered policies and discuss potential game-theoretic ramifications of altering policies. Our concluding remarks comprise Section 3.6.

40 3.2 Decision process framework

We begin this section with a brief overview of Markov decision processes and frame the process in basketball terms. Next, we show how to construct an average chain from several independent Markov chains with overlapping state spaces. The section concludes with our framework for incorporating tensors into the transition dynamics and action policies in order to account for the nonstationarity of these probabilities in time.

3.2.1 Markov decision processes

Markov decision processes are utilized in many modern reinforcement learning problems to characterize interactions between an agent and his environment. In this paper we restrict our attention to finite MDPs, which can be represented as a tuple: ⟨S, A,P (·),R(·)⟩, where S represents a discrete and finite set of states, A represents a finite set of actions the agent can take, P (·) defines the transition probabilities between states and R(·) defines the immediate reward the agent receives for any given state/action pair. The agent operates in the environment according to a policy, π(·) which defines the probabilities that govern the agent’s choice of action based on the current state of the environment. π(·) is the only aspect of the system that the agent controls. Typically, the agent’s goal is to maximize their long-term rewards by making iterative modifications to their policy based on the rewards they receive. We can define these functions succinctly in mathematical terms:

( ′) [ ′ ] P s, a, s = P Sn+1 = s | Sn = s, An = a , (3.1)

R(s, a) = E[Rn+1 | Sn = s, An = a], (3.2)

π(s, a) = P[An = a | Sn = s]. (3.3)

In (3.1)–(3.3), nonitalicized uppercase letters (e.g., Sn, An and Rn+1) represent random variables while italicized lowercase letters represent realized values of the corresponding random variables. Specifically, Sn represents the agent’s state in step n of an episode from the MDP, An represents the action taken in that step and Rn+1 is the subsequent reward given that state-action pair. For our basketball context, we define an episode as the sequence of events comprising a single play from start to finish. A play is initialized in some state s0, and π(·) determines the probability that the ballcarrier takes a shot (or other actions, as we explore later) given that state. If he decides not to shoot, P (·) governs the probability of the ball entering any other state given the current state. If he does take a shot, R(·) dictates the expected point value of that shot. Figure 3.2 shows a graph of these relationships in context of a basketball play.

41 Make Miss Make Miss Make Miss 3 0 2 0 2 0

●S ●S ●S

Contested Contested Contested Corner 3 Mid−range Paint

Pass

Open Open Open Corner 3 Mid−range Paint

●S ●S ●S

Make Miss Make Miss Make Miss 3 0 2 0 2 0

Figure 3.2: Illustration of the components of the MDP for a single player in context of a basketball play. The blue circles represent states, the solid green circles represent actions (shots) and the curved blue lines represent transition probabilities between states. The green lines of varying width connecting the blue state circles to the green action circles represent the policy. The purple lines connecting the green action circles to the squares represent the reward function. Players may pass the ball to another player (not shown) which is also a transition to a nonterminal state.

In our case the governing probabilities are unobserved for each of these components and, hence, must be estimated. We refer the reader to Sutton & Barto (2018) and Puterman (2014) for a more expansive introduction to reinforcement learning and Markov decision processes.

3.2.2 State and action space

Following Cervone et al. (2016), at any given moment the state of a team’s MDP is given by the identity of the ballcarrier, his court region and an indicator of his defensive pressure (open or contested). Formally, Steam is defined by the Cartesian product of three sets,

Steam = Splayer × Sregion × Sdefense, (3.4)

42 where Splayer is the set of all players on the team’s roster, Sregion comprises the six regions shown in Figure 3.1(b) and defensive pressure is binary (“open” or “contested”). At any player region given moment n of a play, Sn is determined by the ballcarrier, Sn is a function of his defense x, y coordinates and Sn is based on both his nearest defender’s distance (n.d.d.) and his court region at that moment. The specific rules we follow for determining contested vs. open shots are given by (3.5)   contested, sregion = rim & n.d.d. < 3ft,  n   region contested, sn = paint & n.d.d. < 3.5ft, defense region Sn = contested, sn = mid-range & n.d.d. < 4ft, (3.5)   contested, sregion ∈ {corner 3, arc 3, back-court} & n.d.d. < 5ft,  n  open otherwise.

To avoid these lengthy superscripts, going forward we will index players by x ∈ Splayer, court regions by y ∈ Sregion and defensive pressure by z ∈ Sdefense. Consequently, from this point onward, x and y no longer refer to Cartesian coordinates but rather to players and (x,y,z) court regions, respectively. Putting this all together, sn denotes an observed state from Steam at the nth step of an episode from the team’s MDP. As we are primarily interested in shooting decisions, we have chosen a binary action space. At each step in the process, the ballcarrier decides either to shoot or not to shoot

(A = {“Shoot,” “Not Shoot”}) according to his policy π(·). If an = “Shoot,” the play terminates; otherwise, the subsequent transition is generated by P (·). Later, we explore changes to passing probabilities via perturbations to P (·).

3.2.3 Defining the average chain

Because most teams use upward of 500 lineups over the course of a season, we assume that P (·), R(·) and π(·) are invariant to the lineup, that is, other players do not impact transitions and rewards. This allows us to approximate the data generating process (i.e., state transitions and player actions) at any given point in time using a single team-average transition probability matrix which is essential in developing both a tractable model for player-specific transitions and a computationally feasible method to simulate multiple sea- sons’ worth of plays. We construct this team-average chain such that it yields the same number of state-pair transitions in expectation as the sum across all the independent chains for every unique state pair spanned by the set of lineups. Technically, for a fixed interval in time a basketball play cannot yield an arbitrarily high number of transient transitions before absorption due to the process of the shot clock. In this sense the episodes we observe come from censored Markov chains. However, as this result could be useful in many Markov

43 chain applications outside of basketball, we will detail this derivation in general limiting terms. Consider a process defined by iterative episodes from two absorbing Markov chains, M1 and M2 (with initial distributions d1 and d2), where each successive episode is randomly determined to come from M1 or M2 with weights w1 and w2 = (1−w1), respectively.1 Each ∈ { } T ℓ { ℓ ℓ } chain, indexed by ℓ 1, 2 , has transient states = t1, . . . , tT and absorbing states Aℓ { ℓ ℓ } = a1, . . . , aA . For simplicity, we assume that each chain has the same total number of transient states and absorbing states (i.e., |T ℓ| = T and |Aℓ| = A for each ℓ ∈ {1, 2}); however, we note that set T 1 is not equal to set T 2, though there may exist overlapping states.2 Our task is to construct a chain, MAVG, such that the expected transition count for an arbitrary state pair from one episode of this chain is equal to the weighted average of the expected transition counts of the separate chains for this same state pair. We first write M1 and M2 in canonical form,

T 1 A1 T 2 A2     T 1 Q1 U1 T 2 Q2 U2 M1 =   , M2 =   , A1 0 I A2 0 I where I is an A-by-A identity matrix, 0 is an A-by-T zero matrix, each Uℓ is a nonzero T -by-A matrix containing the absorption probabilities for chain ℓ and each Qℓ is a T -by-T matrix of the transient state to transient state probabilities. Each individual (j, k) entry of ℓ ℓ ℓ ℓ M is the probability of immediately transitioning to state tk (or ak) given current state tj. Following conventional notation (e.g., Grimstead & Snell (1997)), we define the funda- ℓ ℓ − ℓ −1 ℓ ℓ mental matrix for chain M as N = (I Q ) . The nij entry of N is the expected number ℓ of times that the chain visits transient state tj given that the episode is initialized in state ℓ ℓ ti . Next, we define the matrix of expected state-pair transition counts, S , in a cellwise manner such that the (j, k) cell of Sℓ equals ∑ ℓ ℓ ℓ ℓ sjk = di nijmjk, (3.6) i

1There are many terms throughout this paper that must be indexed across multiple dimensions. For example, in this derivation we need to differentiate one transition probability matrix M from another while also being able to index the (j, k) elements of these separate matrices. To communicate indexes over various dimensions, we have chosen to use superscripts in addition to subscripts to differentiate these dimensions that must be indexed. For this reason, superscripts do not refer to exponents throughout this paper with the exception of inverses (e.g., X−1) and variance parameters which are exclusively denoted by σ2.

2Aℓ, aℓ and A, as defined here, are not the same quantities defined previously in Sections 3.2.1 and 3.2.2. We felt that this abuse of notation was warranted in order to make the derivation clearer to the reader.

44 ℓ { ℓ ℓ } ℓ ∈ { } where d = d1, . . . , dT is the initial distribution of M . In (3.6) i 1,...,T indexes the initial starting state of the chain, j ∈ {1,...,T } indexes the origin state and k ∈ {1,..., T + A} indexes the destination state. Hence, Sℓ is a T -by-(T + A) matrix. Conceptually, to create the average chain MAVG, we will combine S1 and S2 proportional to the number of episodes that come from each chain using their respective weights, then normalize the rows of the resulting matrix. We then define matrix SAVG such that any entry of this matrix equals the weighted average of the expected transition counts of the separate chains. This requires a bit more notation. For each chain Mℓ, we define set Vℓ as the outer product of its transient state space T ℓ with its total state space T ℓ ∪ Aℓ. Vℓ T ℓ × {T ℓ ∪ Aℓ} { ℓ ℓ ℓ ℓ ℓ ℓ } AVG Specifically, = = (t1, t1), (t1, t2), (t2, t1),... . S can then be defined cellwise as  ( )  1 1 − 1 2 ′ ′ ∈ V1 ∩ V2 w sj′k′ + 1 w sj′k′ , (tj , tk ) ,   ( )c 1 1 ′ ′ ∈ V1 ∩ V2 AVG w sj′k′ , (tj , tk ), , sj′k′ = ( ) ( ) (3.7)  1 2 1 c 2  1 − w s ′ ′ , (t ′ , t ′ ) ∈ V ∩ V ,  j k j k  0 otherwise, where j′ ∈ {1,..., |T 1 ∪ T 2|} indexes the collective set of origin states (i.e., T 1 ∪ T 2) and k′ ∈ {1,..., |(T 1∪T 2)∪(A1∪A2)|} indexes the collective set of destination states. Therefore, SAVG is a |(T 1 ∪ T 2)|-by-|(T 1 ∪ T 2) ∪ (A1 ∪ A2)| matrix. Finally, let T 1 ∪ T 2 A1 ∪ A2   T 1 ∪ T 2 QAVG UAVG MAVG =   , A1 ∪ A2 0 I|A1∪A2| where I is an |A1 ∪A2|-by-|A1 ∪A2| identity matrix and the (j′, k′) cells of QAVG and UAVG are defined

AVG ′ ′ AVG sj k ′ 1 2 ′ ′ ∑ ≤ |T ∪ T | qj k = AVG , for k , (3.8) i sj′i AVG s ′ ′ AVG (j )(k +|T 1∪T 2|) ′ 1 2 ′ ′ ∑ |T ∪ T | uj k = AVG , for k > . (3.9) i sj′i

In (3.8)–(3.9) i ∈ {1,..., |(T 1 ∪ T 2) ∪ (A1 ∪ A2)|} and j′ and k′ are indexed as in (3.7). Clearly, the rows of this matrix sum to 1, and all entries are nonnegative, thereby making it a valid transition probability matrix. By (3.7) it is trivial that the expected transition count for an arbitrary state pair from MAVG is equal to the weighted average of the M1 and M2 expected state-pair transition counts. Finally, the j′th element of the initial distribution,

45 dAVG, for MAVG is  ( )  1 1 1 2 1 2 w d ′ + 1 − w d ′ , t ′ ∈ T ∩ T ,  j j j AVG ( )c d ′ = 1 1 1 2 (3.10) j w d ′ , tj′ , ∈ T ∩ T ,  j ( ) ( )c − 1 2 ′ ∈ T 1 ∩ T 2 1 w dj′ , tj , where j′ ∈ {1,..., |T 1 ∪ T 2|}. While we have shown this only for two chains, we can make this same argument, recursively, with the current iteration of MAVG and a subsequent chain to incorporate into the average, Mn+1, showing that this result holds for an arbitrary number of Markov chains. The average chain allows us to accurately estimate aggregate counts (e.g., over the course of the regular season) across all lineups without having to estimate each lineup’s transition probabilities individually, making the problem tractable while still retaining enough detail to explore the nuanced questions we are interested in.

3.2.4 Transition and policy tensors

In most MDP applications the transition dynamics, P (·), are treated as being static while R(·) is assumed to vary temporally. However, in this paper we assume the opposite; only the reward function is time-independent. The reason for this is that in our case, time—or the shot clock rather—resets with each new episode of the process, as opposed to contin- uing globally across episodes. We are concerned about within-episode temporal dynamics, whereas most MDP applications consider time globally. As such, the way we consider non- stationarity is quite different than how it is primarily treated in the literature. Our frame- work requires a functional form of P (·) and π(·), whereas these are conventionally modeled statically.3 To incorporate within-episode nonstationarity in P (·) and π(·), we propose using ten- sors to allow for dynamic transition probabilities and shot policies over the shot clock. In the stochastic processes literature the term “transition probability tensor” arises (albeit infrequently) in reference to the series of m transition probability matrices induced by an mth-order Markov chain (e.g., Li & Ng (2014)). This is not what we mean by this term. Rather, we refer to a transition probability tensor (TPT) as an approximation to a dynamic transition probability function of a continuous temporal covariate which in this case is the shot clock.

To this end, we approximate P (·) and π(·) as tensors with nTPT matrix slices, each representing a transition probability matrix for a 24 -second interval of the shot clock. We nTPT

3In reinforcement learning applications π(·) typically gets updated as the agent learns more about his environment. In this sense π(·) is dynamic, but this is different than the within-episode functional form for π(·) that we refer to here.

46 want to choose nTPT to be as small as possible while still allowing our model to adequately describe the nonstationary dynamics over the 24-second shot clock. From a model parsi- mony and computational feasibility point of view, minimizing the number of TPT slices is desirable; each additional TPT slice adds roughly one million more model parameters which leads to a significant increase to the computational burden when fitting the model.

We settled on nTPT = 8, meaning that each TPT slice summarizes the transition dynamics over a three-second interval of the shot clock.

We effectuate this approximation via a simple indexing function T (cn),   1, c ∈ (0, 3],  n . T (cn) =  . (3.11)   8, cn ∈ (21, 24], where cn represents the shot-clock time at the nth moment of a play. Going forward, we will represent realized values of T (cn) as tn. An illustrative eight-slice TPT is shown in Figure 3.3 for the Cleveland Cavaliers 2015–16 most common starting lineup.

Shot clock ∈ (0,3] (A) ... (E)

AA ... AE Shot clock ∈ (3,6] K.( IrvingA) (A)... p1 (E) p1

J. SmithAA (B)... BA AE ... BE Shot clock ∈ (6,9] K.( IrvingA) (A)... p2 (E) p1 p2 p1 CA CE J. SmithAA (B)...L. JamesBA (AEC)... p BE ... p Shot clock ∈ (9,12] K.( IrvingA) (A)... p3 (E) p2 p3 1 p2 1

BA CA BE DA CE ... DE J. SmithAA (B)...L. James (AEC)...K.p Love (D)... p1 p p1 Shot clock ∈ (12,15] K.( IrvingA) (A)... p4 (E) p3 p4 2 p3 2 EA EE BA CA BET. ThompsonDA (CEE)... DE ... J. SmithAA (B)...L. James (AEC)...K.p Love (D)... p2 p p1 p2 p1 Shot clock ∈ (15,18] K.( IrvingA) (A)... p5 (E) p4 p5 3 p4 3 EA EE BA CA BET. ThompsonDA (CEE)... DE ... J. SmithAA (B)...L. James (AEC)...K.p Love (D)... p3 p p2 p3 p2 Shot clock ∈ (18,21] K.( IrvingA) (A)... p6 (E) p5 p6 4 p5 4 EA EE BA CA BET. ThompsonDA (CEE)... DE ... J. SmithAA (B)...L. James (AEC)...K.p Love (D)... p4 p p3 p4 p3 Shot clock ∈ (21,24] K.( IrvingA) (A)... p7 (E) p6 p7 5 p6 5 EA EE BA CA BET. ThompsonDA (CEE)... DE ... J. SmithAA (B)...L. James (AEC)...K.p Love (D)... p5 p p4 p5 p4 K. Irving (A) p8 p7 p8 6 p7 6 CA DA CE EA DE ... EE L. JamesBA (C)...K. Love (BEDT.) Thompson... p (E)... p5 p p5 J. Smith (B) p8 p7 p8 6 p7 6 EA EE K.CA Love (DT.) Thompson... DA (CEE)... p DE ... p L. James (C) p8 p7 p8 6 p7 6 T. ThompsonDA (E)... EA DE ... EE K. Love (D) p8 p7 p8 p7 EA ... EE T. Thompson (E) p8 p8

Figure 3.3: A concept illustration of P (·) for the Cleveland Cavaliers most common starting lineup in the 2015–16 NBA regular season. For illustration purposes the row and column space of the TPT has been condensed to five single-player states, however, in our models a typical team has a state space of over 200 states. Each slice represents an approximation of the state-to-state transition probabilities during a 3-second interval of the shot clock. Hence AE in this example, p8 represents the average probability of Kyrie Irving passing the ball to Tristan Thompson when the shot clock is in the interval (21,24].

The policy tensor is virtually identical to the transition probability tensor (TPT) in form, the only difference being the column space. Since the ballcarrier makes only a binary decision at every step of the process, the shot policy given any time interval tn, is a matrix

47 with row space equal to that of the corresponding TPT slice and a column space of length two (“Shoot” and “Not shoot”). This tensor framework is the key to accurately exploring the effects of altering shot policies. The efficiency of a shot is dependent on the time remaining on the shot clock, and this model framework allows us to account for this temporal dependency and tailor our policy alterations accordingly.

3.3 Hierarchical modeling and inference

In this section we propose models for the latent components of the MDP—P (·),R(·) and π(·)—with temporally varying dynamics as outlined in Section 3.2.4. For each component we employ a Bayesian hierarchical modeling approach, which provides a natural way to share strength across parameters that are alike. While we fit each model independently of the others, they each employ a common hierarchical structure—player-specific parameters borrow strength from position-specific parameters (e.g., point guards, power forwards, etc.) which in turn borrow strength from global location and defensive pressure parameters.

3.3.1 Shot policy model

For an arbitrary play from a given team and letting nTPT = 8, we model the probability that the nth action of the play is a shot as a function of the ballcarrier’s propensity to take a shot given the current state of the MDP (defined by the ballcarrier, his court region and defensive pressure) and the shot clock time interval: ( ) ( ) P | (x,y,z) (x,y,z) π(s, a) = An = “Shoot” sn , tn, θ = expit θtn . (3.12)

In this equation An is a Bernoulli random variable for whether the nth action of a play is a (x,y,z) shot, sn is as defined in Section 3.2.2, tn represents the interval of the shot clock at the (x,y,z) nth moment of the play, θtn denotes player x’s propensity to shoot the ball when in court · exp(·) region y under defensive pressure z and expit( ) is the inverse logit function: 1+exp(·) . Note team that θ is a large parameter matrix of dimension |S |-by-nTPT. For a 15-player roster using three-second shot-clock intervals for the TPT, this results in a 180-by-8 matrix.4

4Due to the extreme infrequency of backcourt shots we don’t estimate player-specific coefficients for backcourt shot policies and field goal percentages. For notational simplicity we have omitted this technicality in the model definition.

48 We employ multilevel hierarchical priors for θ to pool information across similar states of the MDP: ( ) (x,y,z) (G(x),y,z) θ ∼ N8 β , Σθ , (3.13) ( ) (g,y,z) (y,z) β ∼ N8 γ , Σβ , (3.14) (y,z) γ ∼ N8(0, Σγ), (3.15) where G(x) returns the position type g of player x (e.g., point-guard, center, etc.). In (3.13) the eight-dimensional state-specific shot propensity parameter θ(x,y,z) is given a multivariate normal prior with mean vector β(G(x),y,z), denoting the average shot propensity parameter { } vector for all players who have the same position as player x (i.e., all players x1, . . . , xNg for whom G(xi) = G(x)). This effectively shrinks player x’s estimated shooting propensity toward players who share his same position and the shrinkage is more pronounced if he has less observed data. The second layer of the hierarchical prior has virtually an identical structure to (3.13); only here, the position-specific parameter vector β(g,y,z) is given a multivariate normal prior with mean vector γ(y,z), which denotes the average propensity to shoot in court region y under defensive pressure z, regardless of player or position. For the final stage of the hierarchy, (3.15), we use the eight-dimensional zero-vector as the prior mean, yielding a 0.5 prior probability of shooting in any giving region/defense combination. While this is an unrealistic shot probability for many states, our model is not sensitive to the values for this prior mean given the immense amount of data we have for each region/defense combination across the season in conjunction with weakly informative priors on each Σ in the hierarchy. Each Σ in (3.13)–(3.15) is an AR(1) covariance matrix with correlation parameters and variance parameters corresponding to their respective levels of the hierarchy (i.e., ρθ, ρβ, ργ, 2 2 2 and σθ , σβ, σγ). We chose this covariance structure under the assumption that player shoot- ing behavior is more similar over intervals of the shot clock that are close in time and that this correlation diminishes as intervals between times increases. The AR(1) structure provides a natural construct to model this type of correlation. Specifically, ρ represents the correlation between shot propensity parameters in the same state of the MDP for adjacent intervals of the shot clock. We use half-Cauchy priors (Polson et al. 2012) on all the scale parameters and Uniform[0, 1) priors on the temporal correlation parameters:

σθ, σβ, σγ ∼ half-Cauchy(0, 2.5), ρθ, ρβ, ργ ∼ Uniform[0, 1).

The half-Cauchy(0, 2.5) prior contains 90% of its mass in the interval (0, 15.78) which is sufficiently noninformative for our application, and we only consider positive correlation for

49 ρ given that temporal trends in shooting behavior are generally smooth. Figure 3.4 shows a graphical representation of the model for π(·).

If a shot is taken (i.e., if an = “Shoot”), then the MDP episode terminates and the reward, Rn+1, is determined by R(·). Otherwise, rn+1 = 0 and the next state is determined by P (·).

(y,z) γ(y,z) γ(y,z) (y,z) Σγ γ1 ··· tn tn+1 ··· γ8

(g,y,z) β(g,y,z) β(g,y,z) (g,y,z) Σβ β1 ··· tn tn+1 ··· β8

(x,y,z) θ(x,y,z) θ(x,y,z) (x,y,z) Σθ θ1 ··· tn tn+1 ··· θ8

(x,y,z) An Player: x = 1,...,NG(x)

Position: g = 1,...,Ng

Location: y = 1,...,Ny Defensive pressure: z = 1, 2

Figure 3.4: A graphical representation of our model for π(·). The observable random vari- (x,y,z) able is denoted by a gray box while parameters are denoted by unfilled circles. An denotes the action of player x in region y with defensive pressure z during time inter- (x,y,z) val tn. It is governed by the corresponding state’s propensity parameter, θtn . These player/region/defense parameters θ are modeled by a multistage hierarchical prior, includ- ing a layer for position/region/defense parameters β and region/defense parameters γ. The parameters θ, β and γ all have a temporal dimension of nTPT = 8, which we convey hori- zontally in each layer of the graph. These multivariate parameter vectors have latent AR(1) covariance matrices, Σθ, Σβ, Σγ. As shown by the plates in the figure, player-specific param- eters are nested within position-specific parameters, while location and defensive pressure comprise the base model state space.

50 3.3.2 Transition probability model

Conditional on an = “Not shoot,” we model the probability that the play transitions to state ′ ′ ′ ′ ′ ′ s(x ,y ,z ) as a function of the ballcarrier’s latent propensity to transition to state s(x ,y ,z ) (x,y,z) given his current state sn and the current interval of the shot clock tn:

((x,y,z),(x′,y′,z′)) ( ) ( ′ ′ ′ ) exp(λ ) P s, a, s′ = P S = s(x ,y ,z )|a , s(x,y,z), t , λ = tn , (3.16) n+1 n n n ∑ ((x,y,z),(i,j,k)) i,j,k exp(λtn )

((x,y,z),(x′,y′,z′)) where Sn+1 is a categorical random variable and λtn denotes player x’s propen- ′ ′ ′ sity to transition to s(x ,y ,z ) when in court region y under defensive pressure z. The symbols (x,y,z) an, sn and tn are all as defined in Section 3.3.1. The support of Sn+1 is equivalent to the state space Steam, defined in Section 3.2.2, with one additional state—the terminal state representing a turnover. Note that λ is a massive three-dimensional parameter array with team team dimensions |S | × (|S | + 1) × nTPT. For a 15-player roster using three-second shot clock intervals for the TPT, this results in a (180 × 181 × 8) array, yielding a minimum of 260,640 player-specific parameters to estimate for a single team. As with the models for π(·) and R(·), we employ multilevel hierarchical priors for λ: ( ) ((x,y,z),(x′,y′,z′)) ((G(x),y,z),(G(x′),y′,z′)) λ ∼ N8 ζ , Σλ , (3.17) ( ) ((g,y,z),(g,y′,z′)) ((y,z),(y′,z′)) ζ ∼ N8 ω , Σζ , (3.18) ((y,z),(y′,z′)) ω ∼ N8(0, Σω). (3.19)

′ ′ ′ In (3.17) ζ((G(x),y,z),(G(x ),y ,z )) denotes the the eight-dimensional average propensity for all players who have the same position as player x in court region y under defensive pressure z to transition to players with the same position as player x′ in court region y′ under ′ ′ defensive pressure z′. In (3.18) the position-specific parameter vector ζ((g,y,z),(g,y ,z )) is given ′ ′ a multivariate normal prior with mean vector ω((y,z),(y ,z )) which can be similarly defined at the global region/defense level. As in the model for π(·), we use the eight-dimensional zero- ′ ′ vector as the prior mean for ω((y,z),(y ,z )), and each Σ is an AR(1) covariance matrix with correlation parameters and variance parameters corresponding to their respective levels of 2 2 2 the hierarchy (i.e., ρλ, ρζ , ρω, and σλ, σζ , σω). Due to the computational burden of jointly fitting this three-stage model (i.e., jointly fitting all 30 team’s transition dynamics—a total of nearly 10 million parameters), we use a two-stage modeling approach for P (·) which was imperative given our computational constraints. This enables us to fit each team’s transition dynamics separately while still allowing us to borrow strength from the position-specific and location/defense levels of the hierarchy. The details of the two-stage fitting process for P (·) are described in the next section.

51 3.3.3 Transition model two-stage approximation

This section describes our two stage approximation for the transition probability function P (·). We first lay out both stages in mathematical terms, then define all terms and the mechanics of the fitting process:

STAGE 1

( ) ((g,y,z),(g,y′,z′)) ((y,z),(y′,z′)) ζ ∼ N8 ω , Σζ , (3.20) ((y,z),(y′,z′)) ω ∼ N8(0, Σω), (3.21)

STAGE 2

( ′ ′ ′ ) ((x,y,z),(x′,y′,z′)) ∼ b((G(x),y,z),(G(x ),y ,z )) approx. λ N8 ζ , Sλ . (3.22)

In stage 1 all symbols are as defined in the main text for equations (3.18) and (3.19). We fit these two layers of P (·) exactly as described for π(·) in Section 3.3.1—the hyperpriors for σζ and σω are half-Cauchy(0, 2.5) and the corresponding correlation parameters are Uniform[0, 1). After fitting stage 1 we can fit stage 2 separately for each team by using a fixed covariance approx. matrix Sλ in the multivariate normal prior on λ. For a given team we initialize the prior means of the player-specific parameters, λ, at the corresponding posterior means of the position-specific parameter estimates, ζb, estimated in stage 1. Choosing the prior variance for λ requires greater discretion, since the covariance matrix governs how strongly λ will shrink toward each player’s position-specific parameter estimate b · approx. ζ from stage 1. As in our model for π( ), Sλ is given an AR(1) structure, therefore fixing the covariance matrix amounts to selecting values for σλ and ρλ. We assume that the temporal covariance structure governing the shot policy model π(·) resembles that of the transition model P (·). This allows us to leverage our knowledge about of the ratio between

bσθ σθ and σβ in selecting a value for σλ. Using the posterior means as estimates, ( ) = 1.8, bσβ hence, the player-specific parameters have a standard deviation roughly twice as big as that of the group-specific parameters in the policy model. Since we have an estimate for the group-level standard deviation of the transition model from stage 1, σbζ , we multiply this value by the ratio from the policy model, 1.8, to select an appropriate value for σbλ. Hence, we fix σλ = σbζ × 1.8 = 2.14. Likewise, we fix ρλ at the corresponding value in the policy approx. model which is 0.94. We can then create the covariance matrix Sλ and fit stage 2 of the approximation to get estimates for λ.

52 3.3.4 Reward function

In context of a basketball play, (3.2) can be restated as, “How many immediate points do we expect when a player in state s takes action a?” If the action is a shot, then this expected value is his expected points per shot from the given state, otherwise, it is zero.5 This allows us to define the reward function of the MDP completely in terms of a shot-efficiency model. Prior to formally defining R(·), we propose a model for the probability that a shot is made. Given a shot, we model the make-probability as a function of the shooter’s skill and a region-specific additive effect if the shot was uncontested or open, ( ) ( ) P | (x,y,z) (x,y) × (y) Make(s) = Mn = 1 sn , µ, ξ = expit µ + I(zn = “Open”) ξ , (3.23) where Mn is a Bernoulli random variable for whether the attempted shot in moment n was made, µ(x,y) denotes player x’s contested shooting skill in court region y, I(·) is an indicator (y) function of the defensive pressure zn in moment n and ξ is the effect in court region y if the shot is uncontested. Note that, in this model, defensive pressure is a region-specific additive effect rather than being built into the player-specific parameters. This is because detecting player-specific differences in how defensive pressure affects shot-make probability is impracticable. While minor differences certainly exist, it would take massive amounts of shot data—more than we have—to detect these differences with statistical confidence. The reward function R(·) is the scaled make-probabilities for each state if shot is taken (scaled by 2 or 3, depending on the court-region) and 0 in the case that a shot is not taken or if a turnover occurs,    × ∈ { } 3 Make(s), y 3-pointer , an = “Shot”, R(s, a) = 2 × Make(s), y ∈ {2-pointer}, a = “Shot”, (3.24)  n  0 otherwise.

As with P (·) and π(·), we use a multi-stage hierarchical prior for the player-specific parameters µ: ( ) (x,y) ∼ (H(x),y) 2 µ N ψ , σµ , (3.25) ( ) (h,y) ∼ y 2 ψ N φ , σψ , (3.26) ( ) y ∼ 2 φ N 0, σφ . (3.27)

In (3.25) µ(x,y) is given a normal prior with mean vector ψ(H(x),y), which denotes the av- erage shooting skill of all players who share the same group as player x (i.e., all players {x′ , . . . , x′ } for whom H(x′ ) = H(x)) in location y. We emphasize that, while P (·) and 1 Nh i

5In our analysis we have omitted plays that ended in fouls and any free throw situations.

53 π(·) define player groups g, using naive player positions (center, power forward, point guard, etc.) this model uses new groups, h, on which to base the regularization. The reason for this change is that a player’s shooting skill does not have as clear a correspondence to his naive position. As such, we create customized groupings to ensure sensitivity to this variation. We will detail how we constructed these new groups shortly. In the second layer of the hierarchical prior, ψ(h,y) is given a normal prior with mean vector φy which denotes the global average shooting skill from court region y. As in the model for π(·) and P (·), the final stage of the hierarchy is given a mean-0 normal prior with 2 variance σφ. Note that all the parameters in this model are given univariate priors rather than multivariate priors since we model a player’s shooting skill as being constant over the shot clock. Next, we define the priors for ξ, the region-specific additive effects for contested shots, ( ) y ∼ 2 ξ half-normal 0, σξ . (3.28)

We use a half-normal prior distribution since, all else being equal, open shots have higher make probabilities than contested shots. Finally, as with the scale hyperpriors for π(·), we use half-Cauchy priors for the scale parameters in R(·),

σµ, σψ, σφ, σξ ∼ half-Cauchy(0, 2.5).

To create the new player groupings h, we first clustered players into three categories based exclusively on the volume of shots they took over the course of the season, irrespective of the shot locations. Next, we reclustered players into six shot propensity categories based on the proportional breakdown of their shots by court region, irrespective of volume. In both clustering procedures we used the k-means algorithm initialized at cluster centroids calculated via Ward linkage (Ward Jr 1963). We then crossed these clusters, giving a total of 18 groups which differentiate players by how much they shoot and from where they tend to shoot. Table 3.1 shows three example players in each cluster. We also assume independence across plays for shot make probabilites which is a debated area of research (e.g. Neiman & Loewenstein (2011)).

54 Table 3.1: Players were independently clustered by shot volume and shot region propensity. The table shows three players in each group after crossing the resulting clusters. Shot Shot Region Propensity Vol- Equal Balance 3-point Heavy Mid Heavy Rim Heavy 3-point Specialist Rim Specialist ume L. James D. Lillard J. Wall A. Davis S. Curry A. Drummond High R. Westbrook K. Love D. Nowitzki I. Thomas T. Ariza G. Monroe D. Cousins J. Harden K. Leonard D. Wade W. Matthews J. Okafor L. Barbosa P. Beverly R. Rubio D. Favors K. Korver D. Jordan Med L. Stephenson E. Ilyasova M. Speights E. Turner J. Terry T. Booker N. Jokic O. Porter M. Belinelli B. Portis N. Mirotic C. Capela A. Roberson J. Jerebko M. Muscala A. Miller J. Ingles A. Bogut Low D. Motiejunas B. Jennings C. Watson A. Varejao J. Ennis B. Bass K. McDaniels D. Augustin T. Prince D. Powell B. Rush J. McGee

3.3.5 Inference and validation

After removing plays we are not interested in modeling (plays that terminated in either fouls, timeouts, jumpballs or in the backcourt), we have 155,656 plays (≈ 1.93 million observations) on which we fit our models. We held out a sample of approximately 28,000 plays to use for model validation. We fit our models using Stan, an open-source software package which offers a suite of MCMC methods for statistical inference Carpenter et al. (2017). For each model we initialized two chains and let them mix long enough to ensure we had a potential scale reduction factor < 1.05 for every parameter. Effective sample sizes ranged from 48 to 15,000 across the set of parameters. Details on the Stan model scripts for π(·), P (·) and R(·) can be found in the companion GitHub repository for this paper.

Table 3.2: MCMC details and diagnostics for each fitted model. P1(·) and P2(·) refer to the

1st and 2nd stages of P (·). Specifically, P2(·) refers to the 2nd stage fit on the Cavaliers TPT Parameter π(·) P1(·) P2(·) R(·) MCMC samples (per chain) 2000 16,000 1500 11,000 Burn-in 500 1000 500 1000 Minimum eff. sample size 48 76 371 181 Maximum Rˆ 1.039 1.022 1.000 1.026

Following Franks et al. (2015) and Cervone et al. (2016), both of which utilize Bayesian hierarchical models in conjunction with NBA optical tracking data, we use out-of-sample log-likelihood as a mechanism for model validation. Table 3.3 shows out-of-sample log- likelihoods for four models of increasing complexity for each component of the MDP. The transition model column, P (·), represents log-likelihoods computed using only the Cleve- land Cavaliers TPT model, whereas the other columns comprise the entire league. For all

55 three components, the models with player-specific shrinkage perform best. All subsequent references to MDP model components refer to the models in row D of Table 3.3.

Table 3.3: Out-of-sample log-likelihoods for four models of increasing complexity over each component of the MDP Model π(·) P (·) R(·) A. Empirical model −36,808 −17,299 −5956 B. Model A + location shrinkage −25,187 −38,702 −4571 C. Model B + position shrinkage −24,467 −25,099 −4561 D. Model C + player shrinkage −21,553−13,478−4540

3.3.6 Model fit

Figure 3.5 shows 95% credible intervals for the transition probabilities in the base level of P (·).

Estimated Global Transition Probability Tensor

Backcourt Backcourt Arc 3 Arc 3 Midrange Midrange Turnover Contested Open Contested Open Contested Open 1.0 0.8 0.6 0.4 Contested 0.2 Probability Backcourt 0.0 Open Backcourt

1.0 0.8 0.6 0.4 Arc 3 Contested 0.2 Probability 0.0 Origin Open Arc 3

1.0 0.8 0.6 0.4 Contested 0.2 Probability Midrange 0.0 Open Midrange

24−21 15−12 9−6 3−0 24−21 15−12 9−6 3−0 24−21 15−12 9−6 3−0 24−21 15−12 9−6 3−0 Shot clock Shot clock Shot clock Shot clock Destination

Figure 3.5: Estimated league-wide transition probability tensor for the top level of the hierarchy on which each team’s TPT is built. Within each plot frame the 95% credible interval of the origin to destination transition probability is shown in dark gray and the posterior mean is shown by a black line. Within each plot frame the x-axis represents time on the shot clock, while the y-axis represents the transition probability. Across plot frames the y-axis represents the origin state and the x-axis represents the destination state. Corner 3, paint and rim states are omitted to maintain a practical size for the figure.

56 As shown in the block diagonal frames of Figure 3.5, the highest transition probabilities are to the same state, due to the predominant influence of dribbles in the data. Conversely, it is improbable for the ball to transition immediately to a state which is not directly geographically adjacent. Interestingly, the defensive pressure of the destination state appears to have a much larger impact on transition probabilities than the defensive pressure of the origin state. The estimated shot policies and reward functions for LeBron James and Kyrie Irving of the Cleveland Cavaliers are shown in Figure 3.6. The strong temporal autocorrelation captured by the model (ρbθ = 0.94) significantly smooths jagged empirical policies yielding more plausible results. The two players’ policies look quite similar, with the exception that Irving tends to take contested midrange shots more frequently than James.

Contested Shot Policies 1.0 LeBron James Kyrie Irving Arc 3 Arc 3 Rim Rim 0.8 Mid−range Mid−range

0.6

0.4 Probability of Shot

Probability of taking shot 0.2 Probability of taking shot

0.0

24−21 21−18 18−15 15−12 12−9 9−6 6−3 3−0 24−21 21−18 18−15 15−12 12−9 9−6 6−3 3−0 Shot Clock Shot Clock Shot Clock Contested Reward Functions Shot Clock

LeBron James Kyrie Irving 8 Arc 3 − 93 Arc 3 − 74 Rim − 269 Rim − 92 Mid − 152 Mid − 144 6

4 Density Density Density

2

0

0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 Expected Points Per Shot Expected Points Per Shot Points per shot Points per shot

Figure 3.6: Estimated shot policies (95% credible intervals) and reward functions (posterior densities) for LeBron James and Kyrie Irving in three sample states. The shot policies are overlaid with dots corresponding to their empirical shot policies; the sizes of the dots are relative to how many shots they took within that time interval from the indicated state. The reward functions are also overlaid with the empirical points per shot and the number of shots they took from each state is given in the legend.

Interestingly, Irving also takes contested midrange shots more frequently than he takes arc 3-point shots. In general, this is considered poor shot selection because most players have

57 a higher expected points per shot (EPPS) from beyond the arc than from the midrange. However, Irving appears to be an anomaly in this respect; his midrange reward distribution is greater than his arc 3 distribution in expectation. His estimated shot policy evidences that he knows his strengths and acts accordingly.

3.4 Simulating plays

Having fit the models for the latent components of the MDP, our next task is to simulate plays using these models. In this section we first describe our algorithm for simulating a basketball play, and we conclude the section by comparing our simulations to the observed trajectories.

3.4.1 Play simulation algorithm

The algorithm requires seven inputs: s0, θ, λ, µ, ξ, c0 and L(·). The first piece, s0, denotes the starting state of a play. For these we use the observed starting states for each team’s collection of plays in the 2015–2016 NBA regular season. Note that we do not treat the number of plays in a season nor the states in which plays begin as random. Consequently, we do not analyze rebounding; the number of plays is fixed beforehand, and once a turnover happens or a shot is taken, the play ends. Next, we require parameters to govern the components of the MDP which stochastically generate the states visited in plays, when shots occur and the point values of taken shots. For these inputs we use the model fits for these components described in Section 3.3 (i.e., θ for π(·), λ for P (·) and (µ, ξ) for R(·)). Specifically, we take random draws from each param- eter’s respective set of Markov chains used to approximate its posterior distribution, fb(·|D), where D denotes the training data. By using random draws from fb(·|D) for each parameter rather than a single functional of the estimated posterior distribution (e.g., the posterior means), the uncertainty in our estimation gets propagated through to our simulations. In other words, a range of plausible parameter values are used in the simulations rather than point estimates. This is critical in accurately quantifying the uncertainty in our simulations. We denote a posterior draw from an arbitrary parameter’s posterior distribution by a tilde over the parameter (e.g., λe denotes a posterior draw from fb(λ|D)). Lastly, we must account for the shot clock, including starting shot-clock times for all plays and a mechanism to take time off of the shot clock at each step of the MDP. As in the case for s0, we use the observed shot-clock times at the start of each play, denoted c0, for each team’s collection of plays in the 2015–2016 NBA regular season. To take time off of the shot clock at each step of the MDP, we sample the team’s empirical distribution of time-lapses between events conditional on the current interval of the shot clock. This component of the simulator makes performing analytical operations on the process intractable because the distribution of time lapses between events does not lend itself to a parametric distribution.

58 We denote this empirical distribution by L(·) which is a function of tn. Algorithm 2 details the conceptual structure of the simulation process.

Algorithm 2: Basketball play simulator. e e e Input: s0, θ, λ, µe, ξ, c0, L(·) Output: List of the simulated states (terminal and intermediary), actions and rewards 1 n ← 0;

2 while sn ≠ Turnover do

3 tn ← T (cn) ; e 4 an ← Bernoulli variate from π(·|θ, sn, tn) ;

5 if an = Shot then e 6 rn+1 ← R(sn, an|µe, ξ) ; 7 break loop;

8 else

9 rn+1 ← 0; 10 end

11 lapse ← draw from L(tn) ;

12 cn+1 ← cn − lapse ;

13 if cn+1 < 0 then

14 sn+1 ← Turnover (shot clock violation) ;

15 an+1 ← NULL ;

16 rn+1 ← 0 ; 17 else e 18 sn+1 ← categorical variate from P (·|λ, sn, tn) ; 19 end 20 n ← n + 1 ;

21 end 22 return {s, a, r}

In Algorithm 2, n indexes the sequential events in the play. The while-loop iteratively generates actions, states and zero-valued rewards until either: (1) a shot is taken, at which point the reward is determined by R(·) and the loop breaks (i.e., the play terminates), or (2) the play transitions to the “Turnover” state which can also occur by the shot clock expiring. For each play we keep track of all generated states, actions and rewards.

3.4.2 Calibration

We can be extremely detailed in checking the calibration of our simulations since we keep track of all simulated intermediary and terminal transitions. To assess the calibration, we

59 simulate 300 seasons for the Cleveland Cavaliers using the observed starting states of all their 2015–16 plays and compare our simulations to the actual transition counts. Note that these simulations are on-policy, meaning they are computed using variates of the shot policy estimated on the observed data. In making this comparison we must be cognizant of overfitting; the empirical model will always yield optimal calibration because the empirical model fits both trends and errors. Models with regularization may appear less calibrated but ultimately give better fits because the modeling of errors is attenuated by the induced shrinkage. Figure 3.7 shows simulated player-aggregate transition counts for these 300 simulations (gray) for the Cavaliers’ starting lineup overlaid with the observed counts (black, dashed). Cavaliers 2015−16 Starting Lineup Aggregate Transition Counts

JR Smith Kevin Love Kyrie Irving LeBron James Tristan Thompson Shot

30 120 40 100 400 80 25 100 80 30 300 60 20 80 60 15 20 40 60 200 40 10 40 JR Smith 10 20 Simlation 100 20 5 Observed 20 counts Transition

t(team_agg_sims[, 8:1, i, j]) t(team_agg_sims[, 8:1, i, j]) t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 0

300 t(team_agg_sims_shots[, 8:1, i]) 140 500 200 40 40 250 120 400 200 30 100 30 150 300 150 80 20 100 20 60 200 100 10 Kevin Love Kevin 10 50 40 100 50 20 counts Transition t(team_agg_sims[, 8:1, i, j]) t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 0 120 3500 120 t(team_agg_sims_shots[, 8:1, i]) 40 140 60 3000 100 100 120 50 2500 30 100 80 80 40 2000 80 60 60 20 30 1500 60

Origin 40 20 40 1000 10 40 Kyrie Irving Kyrie 10 20 20

500 20 counts Transition

t(team_agg_sims[, 8:1, i, j]) t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 0 0 30 80 120 t(team_agg_sims_shots[, 8:1, i]) 150 2000 25 100 150 60 20 80 1500 100 15 40 60 100 1000 10 40 50 20 50 20 500 5 Transition counts Transition LeBron James

t(team_agg_sims[, 8:1, i, j]) t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0

50 t(team_agg_sims_shots[, 8:1, i]) 50 400 150 50 40 200 40 300 40 30 150 30 100 30 200 20 100 20 50 20 10 10 50 100

10 counts Transition Tristan Thompson Tristan t(team_agg_sims[, 8:1, i, j]) t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 t(team_agg_sims[, 8:1, i, j]) 0 0 t(team_agg_sims_shots[, 8:1, i]) 24−21 18−15 12−9 6−3 24−21 18−15 12−9 6−3 24−21 18−15 12−9 6−3 Shot clock Shot clock Shot clock Destination

Figure 3.7: 300 simulated (gray) season-total transition counts over the shot clock overlaid with the corresponding observed counts (black) for the 2015–2016 season. Within each plot frame, the x-axis represents time on the shot clock, while the y-axis represents total transition counts. Across plot frames, the y-axis represents the origin state and the x-axis represents the destination state.

Our simulations capture the aggregate transition count trends over the shot clock with high integrity; however, they appear to be slightly biased low for some state pairs. On the

60 other hand, simulated transition counts for low-usage players (not shown) are generally biased high. As noted previously, these phenomena are due to shrinkage in the hierarchical model, which we are quick to note is not a model deficiency. As evidenced in Table 3.3, this borrowing of information improves out-of-sample model fit, giving us more reliable calibration on macro-level features.

3.5 Altering policies

With confidence that the method accurately reproduces play sequences under the observed policy model, we now simulate team-specific plays under altered shot policies. However, before providing examples of altered policies we pause to discuss some relevant topics from game theory.

3.5.1 Game theory Optimal policies

A general assumption of this paper is that teams are not operating on optimal shot policies. This is difficult to test, but there is research that supports this conclusion Goldman & Rao (2014a), Skinner (2012). Regarding optimal stopping times (i.e., when a player shoots during a play relative to the shot clock), Goldman & Rao (2014a) show that, while on average, the league as a whole closely follows the optimal curve, individual players are not perfect optimizers, often exhibiting tendency to undershoot. Even under the assumption that a team is operating optimally, players and coaches could still gain utility by exploring adverse effects of changes to this policy.

Allocative efficiency

A player’s shot efficiency depends on the volume of opportunities he is allocated. The mathematical formulation of this concept originates in the work of Oliver (2004). Oliver defines the relationship between a player’s usage and his efficiency as a “skill curve” and suggests that it should generally exhibit a downward trend, meaning that players become less efficient as they carry more of the scoring load. This relationship is important in context of altering shot policies. As explained in Goldman & Rao (2014b), if a team changes its shot policy to take more 3-point shots, the team has to accept lower quality 3-point opportunities on the margin. This will lead to lower expected values for these additional shots but higher expected values for the 2-point shots that by consequence have a lower usage rate due to the increase in 3-point shots. There is a counter-balancing relationship for policy changes due to moving up (or down) the skill curve. For our purposes the simulation method should not bias the results of testing policy changes, as long as the changes are not drastic.

61 Defensive response

If a team makes a tactical change that gives them an advantage, it is reasonable to assume that the defense will attempt to eliminate the advantage. This defensive response brings up some important questions in context of our project—“How sensitive are defenses to policy alterations?” and “How long does it take for a defense to respond sufficiently to make a policy change ineffectual?” These questions depend on too many variables to suggest a single answer; however, we offer some observational evidence from the past two NBA regular seasons that suggests that, in some cases, the defensive response resulting from a team’s altered shot policy does not render its strategy change ineffectual over the course of a season. In the 2016–17 NBA regular season, the Toronto Raptors shot on average 30.5% of their shots from 3-point range, and they averaged 1.1 points per shot on these attempts.6 In the 2017–18 season, they shot 39.6% of their shots from 3-point range. This represents a 30% increase in their team 3-point shot policy. Despite this increase in their 3-point shot policy, the Raptors’ expected points per shot (EPPS) from beyond the arc only decreased less than 2% (from 1.1 in 2016–17 to 1.08 in 2017–18). Additionally, the Raptors’ overall EPPS increased from 1.1 in 2016–17 to 1.14 in 2017–2018. So while the policy change resulted in a small loss of efficiency (perhaps due to defensive adaptation), the response was not such that it rendered the Raptors’ policy change a zero-sum net benefit. We acknowledge that this example is observational; these results could be due to season- to-season variability or the outcome of other variables, such as the addition of new players or the development/decline of returning players. Ultimately, predicting season outcomes for alternate policies is an extrapolation. As such, we believe that testing minor perturbations to a team’s policy will yield more credible results and that proposed changes should be carefully crafted prior to testing.

3.5.2 Shot policy changes

We implement on-policy simulations by turning the crank of Algorithm 2 using a team’s

2015–16 collection of starting state and shot clock time pairs (s0, c0) for all of their regular season plays in tandem with posterior draws from the MDP model fits. This results in one simulation of a season. Simulating a season with a policy alteration follows the same process with one additional step—for each simulation we transform the posterior draw of the shot policy model according to our alteration specifications, then simulate seasons with these modified posterior draws. An example may help clarify exactly how we perform this computationally.

Suppose we want to test a policy change in which player xi shoots in all court regions and at all intervals of the shot clock with a 10% increase in frequency. We first modify

6These statistics were gathered from stats.nba.com.

62 the posterior draws of the policy parameters θe according to the specified alteration for all affected elements, ( ) e(x,y,z)alt × e(x,y,z) e(x,y,z) ∀ ≡ θtn = 0.1 θtn + θtn y, z, tn, and x such that x xi, (3.29)

e(x,y,z)alt e where θtn represents an altered element of θ and all other symbols are as defined previously in the text.7 We now show two examples of policy changes that could be explored with our methods and compare the altered policy simulations to on-policy simulations. For each policy we simulate 300 seasons for the Cleveland Cavaliers. The results are shown in Figure 3.8.

Alteration 1. Reduce the contested midrange shot policy by 20% for all players on the team while more than 10 seconds remain on the shot clock. Alteration 2. Regardless of time on the shot clock, reduce all contested midrange shot policies by 70% while doubling all three-point shot policies.

Contested Midrange Shots 3−point Shots Expected Points per Shot Expected Points per Play

On−policy 20 20 0.020 Alternate policy 1 0.008 Alternate policy 2 15 15 0.015 0.006 10 0.010 0.004 10 Density

0.005 0.002 5 5

0.000 0.000 0 0

200 400 600 800 1400 1800 2200 0.95 1.00 1.05 1.10 1.15 0.85 0.90 0.95 1.00 1.05 Shots Shots Points Points

Figure 3.8: Left to right: distribution of simulated contested mid-range shots, 3-point shots, expected points per shot, and expected points per play.

The most obvious distinction between the policies is the divergence between the con- tested midrange and 3-point shot distributions which is not surprising since we directly altered these shot policies. However, in order to measure whether the policy yields a net positive result for a given team, we must quantify how the altered policy affects efficiency and production. To measure these effects, we restrict our attention to the differences in EPPS and expected points per play (EPPP). Under policy 2, shot efficiency increases (1.04 to 1.09 in EPPS) as does play production (0.92 to 0.97 in EPPP). Under policy 1, these

7We set a maximum threshold for altered shot policy parameters. If any of the parameters that would e(x,y,z)alt be altered end up with θtn > 0.9, we cap the alteration at 0.9. For reasonable policy alterations this issue shouldn’t arise.

63 distributions show no practical differences, largely due to only 7.5% of plays ending in a midrange shot with over 10 seconds on the shot clock, limiting the potential impact.

3.5.3 Passing policy changes

With a few modifications we can consider broader policy changes that encompass not only shooting but also passing and movement as well. This entails altering the probabilities of nonterminating state transitions via the TPT.8 We now explore two altered policies of this nature; the results are shown in Figure 3.9.

Alteration 3. Reduce the transitions from Irving to James by 90%. Alteration 4. Triple the transition probabilities from all veterans to players on rookie contracts, while reducing the transition probabilities from rookie contract players to veterans by 75%.

Irving to James Irving Shots James Shots Expected Points per Play 0.05 0.015 On−policy 20 Altered policy 3 0.015 0.04 Altered policy 4 15 0.010 0.03 0.010 10 0.02

Density 0.005 0.005 0.01 5

0.00 0.000 0.000 0

100 300 500 300 500 700 400 600 800 0.80 0.85 0.90 0.95 1.00 Passes Shots Shots Points

Figure 3.9: Left to right: distributions of simulated transitions from Irving to James, Irving’s total shots, James’ total shots, and expected points per play.

Alteration 3 represents a pathological example in which Irving forces his way into being the dominant player on the team by almost never distributing the ball to James. The downstream effects of the dominant Irving policy lead to a 18% increase his expected total shot count, while James’ is reduced by 13%. Interestingly, though Irving’s and James’ total shot distributions change dramatically, our method predicts that the overall differences in production would be negligible. Alteration 4 could be described as a youth development policy, where veteran players are asked to take a back seat and players on rookie contracts are given the green light

8In addition to the game theoretic consequences mentioned previously, new complexities arise with altering passing/movement policies in the context of our model framework. Many state-transition pairs in the TPT are not physically possible (e.g., a player cannot transition directly from the backcourt to the restricted area). Also, any change where we increase player-to-player transition probabilities is potentially problematic. Passing more often to a player in a specific location hinges on the assumption that the other player is correspondingly available in that location which is something we do not control in our model.

64 on offense. This policy change has a much larger predicted impact on production. We estimate this policy change would cost the Cavaliers 0.02 EPPP which could have significant consequences on win totals and playoff outcomes.

3.6 Conclusion

We have developed and implemented a method to test the impact of shot-clock dependent policy adjustments over the course of a season at an unprecedented level of detail while accounting for model uncertainty in every aspect of the system. These methods could have immediate practical impact across multiple levels of a basketball organization. Coaches could assess proposed strategy changes outside of games rather than risking poor results by testing them in games. Front offices could explore the performance of hypothetical rosters by leveraging the position-level transition probabilities in tandem with their player-specific shot policies and reward functions. These tools could prove useful in evaluating trades and in free-agency negotiations. Additionally, our methods could enable teams to gauge the effects of having to play second-string players if any starters suffered a long-term injury. The examples we have shown in this paper are only the tip of the iceberg in terms of how these methods could be utilized. We have primarily considered shooting decisions in this introductory work, but, as shown in Section 3.5.3, our methodology naturally scales to include all different types of basketball decisions, allowing coaches and analysts to explore incredibly nuanced tactical changes. Ad- ditionally, with similar tracking data now available for most major sports including hockey, football and soccer, our methods could extend to testing decision policies in other sports. In a broader statistical context we have provided and implemented a novel framework for modeling within-episode nonstationarity in MDPs through the use of policy and transi- tion probability tensors. We have also shown how to combine multiple MDPs into a single weighted average process which can enable solutions to problems that were previously im- practicable to compute. Additionally, we’ve built a method to simulate from this type of MDP when the arrival times cannot be modeled parametrically. These contributions could be beneficial in many different areas such as traffic modeling, queuing applications and environmental processes.

65 Chapter 4

Inverse Decision Problems

In Chapters 2 and 3, we assumed that the agents (players and coaches) were not acting optimally. By contrast, in the final project we assume that the agents’ actions are optimal, but that the criteria over which they optimize are unknown. The goal of the analysis is to make inference on these latent optimization criteria. We term this type of problem—where decisions are assumed to be optimal and the goal is to estimate features of the optimization such that the optimality assumption holds—an inverse decision problem. Some inverse decision problems have been extensively studied, particularly in the subfield of inverse optimization within operations research (Ahuja & Orlin 2001, Heuberger 2004, Chan, Lee & Terekhov 2019). Here we provide a high-level overview of this subfield and explain how the problem we consider in Chapter 5 is unique. The concepts are illustrated through a synthetic example and an application of fourth-down decision-making in football.

4.1 Inverse optimization in operations research

In operations research, inverse optimization is often treated in reference to linear program- ming problems. As summarized in Gallego (2017), linear problems are composed of two pieces. The first piece is the objective function that the agent wants to maximize. The sec- ond piece is the constraints, which is a set of equations that define reality. Linear programs can be expressed as follows:

arg max c⊺x (4.1) x subject to Ax ≤ b, (4.2) where A, b, and c are known parameters and x is the decision variable. The agent’s goal is to select x to maximize c⊺x subject to the constraints of (4.2). For the inverse linear programming problem, we essentially have the exact same setup, only the decision variable x∗ is considered fixed and we are trying to solve for A, b, and c such that they render the given solution x∗ optimal.

66 While the problem we considered in Chapter 5 also deals with the optimization of a linear program as in (4.1)-(4.2), there are some key differences. First, the constraint parameters (i.e. A and b) are known and fixed. The cost vector c is also fixed but it is unknown to the optimizer and the uncertainty in c is represented by a statistical model. Finally the optimization occurs sequentially. The goal of the analysis is to identify a function of the surrogate for c⊺x that renders the next x chosen by the agent an optimal choice. Essentially this amounts to learning how people synthesize uncertainty when making choices.

4.2 Examples

Two examples are used to illustrate scenarios where the rewards of one’s actions follow a distribution of outcomes. Due to the distributional nature of the rewards in these examples, the optimal actions depend on the optimizer’s risk preferences. For this reason, many dif- ferent actions can be framed as being optimal, despite being conditioned on the exact same state of the environment.

4.2.1 Multi-armed bandit

Consider a multi-arm bandit problem with six arms where the reward for each arm is de- termined by a probability distribution. While each arm’s probability distribution governing its reward is unique, all the arms have the same expected value of 0. Therefore, over many pulls the arms, the running averages for each option will converge to 0 but the distributions will all be different. The left-hand panel of Figure 4.1 plots the reward distributions for a six-armed bandit that follows this scenario. The reward distributions are displayed as violin plots (i.e. the densities are reflected vertically about the x-axis in each plot). Imagine an agent is tasked with selecting an optimal arm of this bandit. To do so, they must decide how to synthesize the reward distribution for each arm. If they use the expected value, then there is obviously no globally optimal action. However, if the agent makes decisions either optimistically (risk tolerant) or pessimistically (risk averse) then there are clearly actions that adhere to these preferences better than others. The colored lines overlaid on each violin plot show the .05 (blue), .25 (purple), .75 (red), and .95 (green) quantiles of each respective distribution.

67 A1 A2 A3 A4 A5 A6 100 50 0 cum_reward 30 trials per round −50 Distribution of Cumulative Reward Cumulative of Distribution −100

A1 A2 A3 A4 A5 A6 action 68 A1 A2 A3 A4 A5 A6 20 10 0 reward Single action −10 Distribution of Reward of Distribution A multi-arm bandit with six arms. The expected value of the payout for each −20 −30 The inverse decision problem would be to watch an agent pull the arms over an extended When considering a single play of the bandit, an extreme risk seeker might prefer to

A5 A6 A3 A4 A1 A2 action 4.2.2 Fourth down decisionsFourth in down American decision-making football in Americana football decision provides with a probabilistic straightforward example outcomes. of On fourth down, the decision-maker (coach) has period of time, then make inferencedecisions. about their risk preferences by analyzing their observed pull the second arm,other upper which quantiles would as correspondwould well). to be If optimizing the to agent pull overquantile is the the (among extremely first other .95 risk quantiles). arm, quantile averse, Interestingly,time, which the (and if despite corresponds optimal the having the to risk decision same seeker optimizing preferenceoptimal plans with of decision to optimizing respect would gamble with to respect for insteadsums to long the be the of .05 to .95 random quantile, variables pull the converge thelargest to fourth for Gaussian arm. the distributions. Since 4th Byhave the option, the the reward in central highest variance the value is limit for long theorem, all run quantiles (i.e. greater than cumulative .5. sum of reward), this arm will Figure 4.1: choice is 0, however theshown distributions by governing the the violinviolin reward plots for plot for each show arm each thedistribution. are respective .05 The different, action. (blue), left as .25 The plot (purple), shows coloredplot the .75 shows lines reward the (red), overlaid distributions distributions and on for of .95 cumulative a each reward (green) single after play, quantiles 30 while of plays. the each right three options: 1) they can “go for it," meaning they can try to get a first down; 2) they can attempt a field goal; or 3) they can punt. In the first and second options, there is a stark risk/reward tradeoff. If they “go for it" and get the first down their subsequent position is much better, however, if they fail the other team gains possession with fairly good field position. Similarly, if they attempt a field goal, a make corresponds to a more valuable subsequent state, but a miss puts them in a much worse subsequent state, since the opposing team gains possession from the spot of the kick. The third option, punting, doesn’t yield any immediate potential for points, but it gives the other team poor field position when they take over possession. In order to analyze this decision, we must attach values to the potential game-states that can result from the three choices described above. In other words, we need an estimate for the value of each down/yard line combination (for scoring plays, the value is the points from the score minus the value of the other team’s game-state after receiving the ball from the kickoff). Over the past five decades, many models have been proposed for estimating this value, spanning Carter & Machol (1971) to Chan, Fernandes & Puterman (2019). For most of these methods, the ‘value’ of a given down/yard line combination can be defined as the expectation over the distribution of eventual points a team could achieve or concede given possession of the ball in that initial state. Using publicly available NFL play-by-play data from 2009 to 2018, we estimated the game-state value of all down/yard line combinations following Chan, Fernandes & Puterman (2019). These value functions are plotted in Figure 4.2.

69 Figure 4.2: Estimated game-state value for each down/yard line combination. The x-axis shows the number of yards a team is from their own endzone and the y-axis denotes value. Each line plots the values for a different down as a function of yardline (red-1st down; green-2nd down; blue-3rd down; purple-4th down).

Given this context, consider the following situation. A team faces a 4th down from the 62 yard line (i.e. they have 38 yards to go for a touchdown) with 8 yards to go to the first down (i.e. s = (4th, 62 yard line, 8 to go). The coach can “go for it", attempt a field goal, or punt. Using publicly available NFL play by play data, we can isolate every situation where this exact down/yard line/yards-to-go combination occurred from 2009 to 2018 and compute the empirical transition probability distribution of the next state, conditional on each of these three actions. Next, by attaching the value function in Figure 4.2 to the next state transition distribution, we can visualize the next state value distribution for each decision. These distributions are shown in Figure 4.3, smoothed via a kernal density estimator. In each panel, the overlaid colored lines denote different quantiles of each respective distributions (blue = 20th quantile, green = expected value, red = 80th quantile).

70 Distribution of Next State Value Down = 4 , Yardline = 38 , Yards to go = 8

GOING FOR IT

0.8 20th Quantile Mean 80th Quantile 0.4 Density 0.0 −2 0 2 4 6

Next State Value

FIELD GOAL

20th Quantile

0.8 Mean 80th Quantile 0.4 Probability 0.0 −2 0 2 4 6

Next State Value

PUNT

0.6 20th Quantile Mean 0.4 80th Quantile Density 0.2 0.0 −2 0 2 4 6

Next State Value

Figure 4.3: The next state value distribution for “go for it" (top), field goal attempt (middle), and punt (bottom) given initial state s = (4th, 62 yard line, 8 to go). In each panel, the x- axis shows possible values of the next state and the y-axis shows the likelihood of any given value. The colored lines denote different quantiles of each respective distributions (blue = 20th quantile, green = expected value, red = 80th quantile). Note that the value of a made field goal is not 3. This is because the team must immediately kick-off to the the opposing team essentially diminishing the value of the made field goal.

Note that the "go for it" and field goal next state value distributions are bimodal, while the punt next state value distribution is unimodal. These features have intuitive explanations. The probability of making the first down with 8 yards to go is slim, hence the

71 upper mode of the “go for it" distribution is relatively small. There is a much higher chance that the team fails to make the first down, which corresponds to the higher density mode over negative values. A similar explanation can be made for the field goal distribution. As for punting, while this option guarantees the other team will gain possession, a punt in this state has a good chance of the other team gaining possession with poor field possession. This is why the punt next state value distribution is slightly higher than the modes in the “go for it" and field goal distributions that correspond to turning over possession of the ball. As shown by the colored lines marking different functions of the distributions, the opti- mal decision in this state depends on the coach’s risk preferences. If the coach bases their decision on the expectation of the next state value distribution, the optimal decision would be to attempt a field goal. This is because the green line (representing the mean) in the field goal distribution of next state value is higher than the green lines in the other two distributions. If the coach is facing this situation with their team down by 4 points late in the game, they likely would base their decision on an upper quantile of the next state dis- tribution, knowing that the team needs a touchdown to win the game. Under this criterion, the optimal decision would be to go for it, as the red line representing the 80th quantile is highest for the “go for it" distribution in comparison to the red lines from the field goal and punt distributions. Finally, if the team is comfortably in the lead when the coach faces this situation, they might prefer a risk-averse strategy and base the decision on a low quantile of the next state distribution, in which case the optimal decision would be to punt (maximum of the three blue lines). As shown by this example, coaches routinely face situations where the optimal decision depends on their risk preferences. The inverse problem in this case would be to infer a coach’s risk preferences based on their observed fourth down decisions. These examples set the stage for the final project in this thesis, which addresses the inverse problem of Bayesian optimization.

72 Chapter 5

Inverse Bayesian Optimization: Learning Human Acquisition Preferences in an Exploration vs. Exploitation Search Task

∗ We are preparing a version of Chapter 5 to submit to Bayesian Analysis. The paper is coauthored with Yohsuke Miyamoto, Luke Bornn, and Maurice Smith.

5.1 Introduction

The exploration vs. exploitation trade-off has been extensively studied across a multitude of disciplines (Berger-Tal et al. 2014). For decades, psychologists and social scientists have studied how humans balance exploration and exploitation when making decisions under un- certainty (March 1991, Cohen et al. 2007). Machine learning (ML) researchers have studied this topic as well; a number of recent papers investigate the correspondence between ML algorithms and actual human behavior for various decision-making processes (Wilson et al. 2015, Schulz et al. 2015, Plonsky et al. 2019, Gershman 2019). Within this context, Borji & Itti (2013) compare several ML techniques against observed human behavior in a set of sequential optimization tasks in one dimension. They find that Bayesian optimization algorithms best approximate human behavior in the experiments they conducted. A recent paper expands their work to 2D functions (Candelieri et al. 2020). Borji & Itti (2013) and Candelieri et al. (2020) can be viewed as solutions to the inverse problem of Bayesian optimization; given optimization sequences exhibited by an agent, they estimate the unknown acquisition function that generated the sequence. In more general terms, they estimate the agent’s risk preferences based on their observed search paths. However, since neither of these papers approach the problem in a probabilistic framework, they offer no uncertainty quantification in their predictions. Our work fills this gap by providing a probabilistic solution framework for the inverse problem.

73 5.1.1 Background

Bayesian optimization is a methodology for sequential optimization of a latent objective function with expensive trials (Jones et al. 1998, Shahriari et al. 2015).1 The core idea of the method is to characterize the uncertainty about the latent objective with a statistical model, then to strategically synthesize this uncertainty to determine promising new locations to sample (i.e. locations that have good chances of yielding values greater than the current maximum). The process of synthesizing the objective function uncertainty, as represented by the statistical model, is governed by the acquisition function. Conceptually, the acquisition function defines how the optimizer balances exploration and exploitation when searching for the maximum of the latent objective. Acquisition functions are typically constructed such that high values of the function correspond to potentially high values of the reward surface, either because the estimated mean is high (exploitation), the uncertainty is high (exploration), or both (Brochu et al. 2010).

5.1.2 Hotspot search task

We illustrate our methods by analyzing human decision-making behavior from an exper- iment designed to present subjects with a conflict between exploration and exploitation. In the experiment, subjects performed many rounds of a hotspot search task which was created in collaboration with the Neuro Motor Control lab at Harvard University. On each round, subjects searched for an invisible hotspot randomly placed on a 9-inch diameter circular task-region shown to them on a computer monitor. Each round consisted of 3 to 10 moves, where the number of moves was determined randomly according to a uniform distribution. A ‘move’ consists of sampling a location on the task region (i.e. clicking on it) after which a numerical score is immediately shown to the user proportional to the click location’s proximity to the hotspot. On each round, the reward scale was randomized such that the maximum score at the hotspot varied uniformly between 0 to 1000. The task al- ways began at the center of the task region and each subsequent move was constrained to be within a small circular region (0.2 inch radius) centered around the previous move. Across the 28 subjects who participated, the minimum number of rounds played was 228 and the maximum was 716. Figure 5.1 displays an example round of the experiment.

1Bayesian optimization is closely related to (and sometimes synonymous with) sequential design, active learning, and hyperparameter optimization, among other subfields of optimization.

74 Example Round Example Round Subject: 36 Subject: 41

Hotspot: 163 Move 10: 132 ● Move 9: 131

●Move 8: 130 ● ● ● Move 7: 127 ● Move 7: 110 ● ● Move 6: 122 ● Move 6: 109 ● Move 5: 118 ● ● ● Move 5: 106 ● ● ● ● ● Move 4: 114 ● Move 0: 91 ● ● Move 3: 109 Move 1: 92 Hotspot: 150 ● Move 2: 104 Move 2: 97 Move 1: 98 Move 3: 100 Move 0: 93 Move 4: 103

Figure 5.1: An example round of the experiment. The red target shows the invisible hotspot location and the dots track the subject’s search path. The score at the hotspot is the score the subject would be given if they sampled the hotspot. Subjects always begin the search in the center of the click region (shown in blue) to minimize effects of the task-region borders guiding subject search behavior.

A second version of the experiment was conducted to study decision-making behavior in an environment more representative of real-life situations. When people make decisions, they usually don’t have a perfect knowledge of their environment nor the consequences of their decisions, so they often must base their choices on an estimate of the underlying state. To simulate these features, we presented rewards to the user differently is this experiment. Rather than giving numerical feedback, scores were presented as a clouds of dots where higher density dot patterns represented higher scores. Due to this feedback mechanism, subjects had to estimate their score on each click rather than being given the exact value. Small changes in score were hard to detect and many subjects appeared to rely on heuristics when making decisions.

5.1.3 Identifying risk preferences

We are primarily interested in second move of the task because this move provides the most information about a subject’s risk preferences. The first move a subject makes is virtually

75 random since they have no information about the direction of the hotspot when they begin each round. After receiving feedback from their first move they have information about the objective function but only along a single direction.2 This poses a conflict between exploitation and exploration on the second move: continuing along the axis of their first move represents pure exploitation of their current knowledge, while any deviation from this axis represents some degree of exploration. Thus the combination of subjects’ first and second moves provides direct insight into how people balance exploration and exploitation, as illustrated in Figure 5.2.3 After receiving feedback from move 2, subjects can approximate the objective function with high fidelity using the plane defined by moves 0, 1, and 2, hence the remaining moves provide little additional insight into subjects’ risk preferences.

Move 1 Move 2

2a 2b

1 ◖◗2c 0 98 93 0

93

Figure 5.2: Left: initial state for an example round of the task. The starting score is 93. Moving in any direction represents exploration, as shown by the dashed green lines. Right: the subject moves to the edge of the move 1 click-boundary, receiving a score of 98. This creates a conflict between exploration and exploitation on move 2. Moving perpendicular to direction of move 1 represents pure exploration (move 2a), while moving along the same di- rection of move 1 represents pure exploitation (move 2b). Any move between these extremes represents a combination of exploration and exploitation (move 2c).

The goal of our research is to characterize and estimate subjects’ latent risk preferences (i.e. how they balance exploration and exploitation) using their observed behavior in this

2If the reward gradient (i.e. a location’s sensitivity to hotspot distance) was identical from round to round, subjects could back solve the gradient information along the direction orthogonal to their first move despite only having two responses from the objective, thus giving them a near-complete characterization of the local reward surface on their second move. By randomizing the reward gradient on each round, we ensure that subjects have uncertainty about the gradient information orthogonal to their first move.

3This argument implicitly assumes that we can linearize the response surface in the local region defined by the move 1 boundary. We justify this assumption in Section 2.

76 sequential optimization task. As we more fully detail in Section 5.2, we characterize subject risk preferences via the Bayesian optimization paradigm. Within this framework, risk pref- erences are defined by the acquisition function. In Section 5.3, we introduce a probabilistic framework to estimate each subject’s latent acquisition function based on their observed search paths. Figure 5.3 shows three subjects’ data for move 2, the critical move for differen- tiating between acquisition preferences. Subject 17 prefers to exploit the information gained on move 1, while Subject 19 prefers an exploration heavy strategy. Subject 10 appears to have different preferences based on whether their first move produces a negative or positive change in score.

77 Subject 10 Subject 10 Negative change in score on Move 1 Positive change in score on Move 1

Exploration Exploration

0 1 0 1

Exploitation Exploitation

Subject 17 Subject 17 Negative change in score on Move 1 Positive change in score on Move 1

Exploration Exploration

0 1 0 1

Exploitation Exploitation

Subject 19 Subject 19 Negative change in score on Move 1 Positive change in score on Move 1

Exploration Exploration

0 1 0 1

Exploitation Exploitation

Figure 5.3: Move 2 behavior for three subjects in the experiment (Top: Subject 10; Middle: Subject 17; Bottom: Subject 19). The dots show the locations of the subjects’ 2nd moves relative to their first moves. For visualization purposes, we display the first move along the horizontal axis. The color gradient denotes the degree of exploration/exploitation. Pure ex- ploration (green) corresponds to moving directly perpendicular on move 2. Pure exploitation (blue) corresponds to moving along the horizontal axis on move 2 (i.e. the same direction as move 1). The left-hand plots show the subjects’ moves after receiving a negative change in score on their first move, while the right-hand plots are conditional on a positive change in score from move 0 to move 1.

As suggested by the subject data shown in Figure 5.3, we find that subjects exhibit a wide range of acquisition preferences. Interestingly, for many subjects, none of the candidate

78 acquisition functions we consider provide a satisfactory model of their behavior. Guided by the model discrepancies for these subjects, we propose an augmentation to the acquisition functions to construct a more flexible model of human optimization behavior in this task.

5.2 Bayesian optimization

Two components make up the Bayesian optimization paradigm: 1) a surrogate function by which we approximate (and update, as more information is obtained) the latent objective, and 2) an acquisition function which defines how the uncertainty about the objective—as characterized by the surrogate—is synthesized when selecting a new location to sample. In this section we detail these steps in context of the search task.

5.2.1 Choosing a surrogate function

For a given round of the task, the reward rt received on move t, is a deterministic function of the distance dt from the click location to the hotspot:

rt = f(dt)

= r0 + k(d0 − dt). (5.1)

In (5.1), r0 denotes the initial score, dt denotes the euclidean distance (in pixels) of the newly sampled location to the hotspot, d0 denotes the distance of the starting location to the hotspot, and k denotes the reward gradient (i.e. the temperature sensitivity to hotspot distance). In order to prevent the player from gaining information about the objective function along the orthogonal direction of the first move, k is randomly generated for each new round of the task according to a Uniform[0, (5/3)] points/pixel distribution. This ensures subjects are always faced with a tradeoff between exploration and exploitation on their second move. We assume subjects intuitively utilize a surrogate model of the objective function as they perform this task. Since the variables that actually govern the reward on each move

(i.e. k and dt) are not known to the subject, we model the surrogate as a function of a move’s (x, y) coordinates:

rt = fˆt(mt), (5.2)

ˆ xt where ft denotes the surrogate function at move t and mt = ( yt ). A ‘surrogate’ is simply a statistical model, however, the term implies an emphasis on pragmatism and prediction rather than interpretability and identification (Gramacy 2020). In most Bayesian optimiza- tion applications fˆ is assumed to be a Gaussian process. This is sensible when the latent objective is assumed to be non-linear and stationary. Our case is different. Subjects know

79 that there is an optimum somewhere on the circular task-region and that the surface de- creases uniformly and monotonically from this hotspot with distance. Figure 5.4 shows the objective function f for an example round of the task. As illustrated in left panel, globally the objective function is shaped like a cone. However, the experiments were designed to yield a surface that is approximately linear in the local region around each individual move, as evident in the right panel of the figure.

Global Reward Surface Move 1 Reward Surface 20

250 10 Reward Reward

800 430 600 420 410 y 0 400 y 0 Move 1 boundary 200 400 0 390 −200 380 370 −400 −250 −10

−20 −250 0 250 −20 −10 0 10 20 x x

Figure 5.4: Left: the global objective function for an example round of the game. Right: the objective function in the local region in the click-region of the first move (i.e. zoomed in to the green circle in the left plot).

Since the response surface is approximately linear in the neighborhood of the click-region for each move, we assume that the subjects use a linear model as a surrogate of the response surface:

ˆ ⊺ ft(mt) = r0 + mt β + ϵt (5.3) ∼ N 2 ϵt (0, σs ), (5.4) ( ) βx where β = βy represents the reward gradient with respect to the Cartesian plane and ϵt represents deviations from the surrogate to the true objective. We fix σs at a tiny value

(σs = 0.01) since the deviations from the linear model to the objective are negligible in 4 the local region about mt. Since the hotspot is randomly placed over the task-region, we

4In order to have an analytic update for surrogate after each move of a round, we use a Gaussian distribution to model the error term ϵt. While this greatly speeds up the inference when solving the inverse problem, it is actually a misspecification—the errors are not Gaussian. Conditional on feedback from the first move, the model given by (5.3)-(5.4) is effectively a tangent plane of the underlying conical objective, hence the ϵt are almost exclusively non-negative. The mean of ϵt across all rounds and all subjects is 0.006.

80 assume an isotropic, mean-zero Gaussian prior for β: ( ) ∼ N 0 2 β 2 µ0 = ( 0 ) , Σ0 = σβI , (5.5)

2 where σβ is a hyperparamter that we select to give a weakly-informative prior on the mean. The approximate linearity of the objective function in the neighborhood around each move is an important feature of the experiment, as it is what allows us to assume that exploration and exploitation are orthogonal on move 2. We note, however, that the fidelity of the linear approximation declines as the distance to the hotspot decreases. Fortunately, since each move was constrained to be within a small circular about their current location, along with the small number of moves allowed on each round, it was actually impossible for subjects to reach the hotspot in over 90% of all rounds. We omit the rounds where the hotspot was reachable in our analysis.

5.2.2 Updating the surrogate via Bayesian inference

As subjects sample new locations and receive additional feedback, we update the surrogate to reflect the additional knowledge they have about the objective function. We denote observed search paths as D0:t = {M0:t, r0:t}, where M0:t is a (t + 1) × 2 matrix with row t equal to mt, and r0:t is the column vector of observed rewards on moves 0 through t. Given

D0:t, we can update (5.3), yielding a posterior distribution on β. The prior is conjugate and the posterior can be computed analytically (see Banerjee (2008) for details). The posterior distribution of β is given by

β|D1:t ∼ N2(µ, Σ) (5.6) ( ) 1 −1 Σ = M⊺M + Σ−1 (5.7) σ2 0 (s ) 1 ⊺ − −1 µ = Σ 2 M (r r0) + Σ0 µ0 . (5.8) σs

The posterior predictive distribution of the surrogate is given by

|D ∼ N ⊺ 2 rt+1 0:t, mt+1 (mt+1µ + r0, mt+1Σmt+1 + σs ), (5.9) where rt+1 is the reward on the next move, mt+1 is the corresponding sampled location, and all other terms are as defined previously. Figure 5.5 shows an example of the posterior predictive distribution when the starting score is 93 and the first move yields a reward of 98. The left and right plots show the posterior predictive mean and standard deviation surfaces respectively for move 2.

81 Move 2 Move 2 Posterior Predictive Mean Posterior Predictive SD

20 20

10 Reward 10 Reward

102.5 15 100.0 ● ● ● ● y 0 Move 0: 93 Move 1: 98 y 0 Move 0: 93 Move 1: 98 10 97.5

95.0 5 −10 −10

−20 −20

0 10 20 30 40 0 10 20 30 40 x x

Figure 5.5: Example posterior predictive mean (left) and standard deviation surfaces (right) for move 2, given an initial score of 93, a move 1 score of 98, and σs = 0.01, projected onto a fine grid over the click-region.

Note that while the subject learns about the reward gradient along the x-axis, the gradient along the y-axis is not informed by move 1. For this reason, the subject is faced with a decision on their second move to either explore the gradient along the axis perpendicular to their first move (y-axis) or exploit the gradient information they’ve inferred along the axis of their first move (x-axis).

5.2.3 Choosing an acquisition function

After characterizing the uncertainty about the objective function via the surrogate, we synthesize the uncertainty over potential new locations to sample through the acquisition function. The acquisition function u(·) is a function of a proposed location m, and the surrogate fˆ (which itself is a function the locations and responses already observed D0:t). Acquisition functions are typically constructed such that high values of the function cor- respond to potentially high values of the objective, either because the predicted mean is high, the uncertainty is high, or both (Brochu et al. 2010). The experimenter maximizes this function over Mt+1, the space of potential locations for their next move, to obtain a new location to sample:

mt+1 = arg max u(m|D1:t, fˆt), (5.10) m∈Mt+1 where fˆt represents the updated surrogate on move t. After the objective function is sampled at the resulting location, the surrogate is updated and the process is repeated until some stopping criterion is reached.

82 A host of acquistion functions have been proposed in the literature. While any of these functions are viable candidates for the methods we introduce in this paper, we restrict our analysis to probability of improvement (PI), expected improvement (EI), and an upper confidence bound (UCB) criterion. We restrict our analysis to these functions since they have analytic solutions under Gaussian models (Jones et al. 1998) and since they are well- known among Bayesian optimization practitioners. Mathematically, these are defined by ( ) |D ˆ P ˆ ≥ + uPI(m 1:t, ft, ξPI) = ft(m) f(m ) + ξPI (5.11) ( ) |D ˆ ˆ − + uEI(m 1:t, ft, ξEI) = E[max 0, ft(m) (f(m ) + ξEI) ] (5.12) |D ˆ { ≤ ˆ } uUCB(m 1:t, ft, p) = infimum m : p Ft(m) , where p > 0.5. (5.13)

+ where m is the best location mt ∈ M0:t observed so far across the existing set of samples + ˆ ˆ (i.e. m = argmaxmt∈M0:t f(mt)) and Ft is the cumulative distribution function of ft. Each of the acquisition functions in (5.11)-(5.13) depend on an additional parameter which controls the premium on exploration. Acquiring via PI (Kushner 1964) results in the location that most confidently predicts an increase in the response but ignores improvements less than ξPI. EI (Mockus et al. 1978) considers the magnitude of improvement at a particular location, where ξEI controls a trade-off between exploration (high ξEI) and exploitation (low

ξEI)(Lizotte 2008). Intuitively, one can think of EI as related to PI in that the function weighs a locations probability of improvement by the amount of improvement associated with the location. Thus compared to the PI acquisition function, EI is more likely to select a location associated with low probability of improvement if the low probability is offset by the location’s potential for high improvement. For UCB acquisition functions (Cox & John 1992, Srinivas et al. 2009), users select new locations to sample based on optimistic estimates of the resulting reward at each location. Higher values of p encourage more exploration.5 Figure 5.6 shows the acquisition surfaces for the PI, EI, and UCB functions defined in (5.11)-(5.13) in context of the surrogate illustrated in Figure 5.5. The arg max(s) of each surface is denoted by a green star.

5The UCB acquisition function is usually defined in terms of a mean function and covariance function since the surrogate is a Gaussian process in most Bayesian optimization applications. We define it via the quantile function in order to have an upper and lower bound on the parameter p, which simplifies the methodology in Sections 3 and 4.

83 Move 2 PI Surface Move 2 EI Surface Move 2 UCB Surface ξ EI = 1 ξPI = 1 p = 0.95 20 20 20 PI * EI * UCB 10 1.00 10 10 130

0.75 6 120 ● ●

● ● ● ● y y 0 y 0 0 Move 0: 93 Move 1: 98 Move 0: 93 Move 1: 98 4 Move 0: 93 Move 1: 98 * 0.50 110 0.25 2 −10 −10 −10 100 0.00 0 −20 −20 −20 0 10 20 30 40 0 10 20 30 40 0 10 20 *30 40 * x x x

Figure 5.6: Acquisition surfaces for the PI, EI, and UCB functions defined in (5.11)-(5.13) given r0 = 93 and r1 = 98. Each surface is on a different scale: the PI surface (left) is on the probability scale, EI (middle) is in terms of points over 99 (since ξEI = 1), and UCB (right) is in terms of the 95th percentile of the surrogate. The click-region boundary is shown by the black circle encompassing the colored surfaces. The arg max(s) of each surface is denoted by a green star.

Notice that each method prescribes different move 2 locations (i.e. green stars); PI recommends pure exploitation, while EI and UCB recommend almost pure exploration. Of course, this is not always the case. Depending on the change in score from move 0 to move 1, in addition to the exploration parameter values (ξPI, ξEI, and p), the optimal location to sample can vary substantially. Also note that each surface is symmetric across the exploitation axis. This is a feature of the subject having gained information exclusively about a single direction after the first move. Due to this phenomenon, the surfaces may be bimodal (EI and UCB) or unimodal (PI). Because we use a linear model as the surrogate for the response surface, the optimal location to sample on move 2 (i.e. the arg max as determined by (5.10)) always falls on the click-region boundary regardless of the change in score on move 1. This harmonizes with the subjects’ actual behavior; in 95% of the rounds, subjects moved to the click-region boundary on move 2.6 This effectively allows us to reduce the arg max search from two dimensions to one dimension. Instead of searching over the (x, y) Cartesian plane, by leveraging polar coordinates we can fix the radius of the optimal move 2 location at the click-region boundary and limit the arg max search to the single dimension of the angle of move 2 (relative to the direction of move 1):

arg max u(m|∆r1, m1, fˆ1) −→ arg max u(θ|∆r1, θ1, fˆ1), (5.14) m∈M2 θ∈[−π,π]

6The search task was programmed to snap the cursor to the edge of the boundary if the mouse exceeded it, which made it easy for subjects to make moves along the boundary.

84 where M2 is the constrained click-region for move 2 and ∆r1 = r1 − r0. By reducing the dimensionality of the optimization we can illustrate acquisition values over the entire range of ∆r1, as shown in Figure 5.7. The three plots show acquisition values over a fine grid of (∆r1, θ2) pairs for three differenct acquisition functions (left: PI with ξPI = 1; middle: EI with ξEI = 1; right: UCB with p = .95). Note that the move 2 locations are assumed to be made on the click-region boundary. In each panel, the pink acquisition curves denote the angle that yields the maximum of the acquisition values as a function of ∆r1. Depending on the acquisition function, the prescribed move 2 angles (i.e. acquisition curves) can be quite different.

PI Acquisition Curve EI Acquisition Curve UCB Acquisition Curve

ξPI = 1 ξEI = 1 p = 0.95 π π π PI EI UCB π 1.00 π π 2 2 30 2 0.75 * * 140 2 2 2 20 0 0 0 θ θ * 0.50 θ 100 10 − π 2 0.25 − π 2 * − π 2 60 0.00 0 * − π − π − π −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 ∆r1 ∆r1 ∆r1

Figure 5.7: Acquisition surfaces over the range of possible ∆r1 values for the PI, EI, and

UCB acquisition functions. The horizontal axes denote the change in reward from m0 to m1. The vertical axes show the angle of the second move relative to the first move. Move 2 is assumed to be made on the click-region boundary. Color indicates the acquisition function value for any given (∆r1, θ2) pair. High acquisition values are red while low values are shown in blue. In each panel, the pink curve denotes the angle that yields the maximum of the acquisition surface as a function of ∆r1. The vertical black lines at ∆r1 = 5 correspond to the circular black lines denoting the click-region boundaries in Figure 5.6. Similarly, the green stars correspond to the stars in Figure 5.6.

Our goal is not to find an optimal strategy, or acquisition function, for this task. Rather, assuming that subjects use Bayesian optimization intuitively as an optimization framework (Borji & Itti 2013), we want to estimate the latent acquisition functions employed by the subjects in the experiment. The following section formally introduces this problem both in general terms and in context of the search task.

5.3 Inverse Bayesian optimization

Each sequence of clicks in a given round of the task can be viewed as a subject’s noisy optimization routine generated according to their latent acquisition function. By ‘noisy’, we assume subjects optimize according to a distinct strategy but that they do so imperfectly.

85 Noise can also arise from other factors, such as variation in the subject’s ability to click exactly where they intend to, and sloppiness from performing the task rapidly. This leads to move-to-move variability around the target mean implied by a subject’s acquisition pref- erences. Our goal is to estimate each subject’s latent acquisition function in a probabilistic framework. As such, the problem becomes one of inference rather than optimization. The optimization routine has already occurred—we want to infer how each subject optimized. We term this problem ‘inverse Bayesian optimization’ (IBO). Figure 5.8 illustrates the general idea behind IBO in context of the search task. The scatter plots in (a) show Subject 57’s and Subject 60’s click behavior on move 2 conditional on feedback from move 1. Each dot represents a separate round of the task. The plots in (b) show various acquisition curves from the three acquisition families described in Section 2 (PI, EI, and UCB). IBO can be viewed as finding the curves from (b) that best fit the subjects’ data in (a).

86 Subject 57 Move 2 Behavior Subject 60 Move 2 Behavior

● ● ● ● π ● π ●●● ● ● ●● ● ●● ● ●● ●●●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●●● ● ●● ● ● ● ● ●●●●● ● ● ●● ● ●● ●● ●● ● ●●● ● ● ● ● ●●●●●●● ● π ● ● ● ● ●●●●●●●● ● ● ● π 2 ● ●●● ● 2 ● ● ● ● ●● ●● ● ● ●●● ● ●● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ●●●●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ●● ● ● ● ●●●● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● 0 ● 0 ● ●● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●●● ● ●● ● ● ● ●●●●● ● ●● ●●●●● ● ●● ● ●●● ● ● ● ● ● ● ●●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●

Angle of move 2 Angle of move ● 2 Angle of move ● ● ● ● ● ● ●●● ●●● ● − π ● − π ●● 2 2 ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ●●●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ●●●● ●● ● ●●●●● ●●● − π ● ● ● − π ● ● ● ● ● ●●● ●●●● ●●●●●

−30 −20 −10 0 10 20 30 −20 −10 0 10 20 30 ∆ in reward on move 1 ∆ in reward on move 1

(a)

Candidate PI Curves Candidate EI Curves Candidate UCB Curves π π

π 2 π 2 π 2

0 0 0 Angle of move 2 Angle of move 2 Angle of move 2 Angle of move − π 2 − π 2 − π 2

− π − π −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 ∆ in reward on move 1 ∆ in reward on move 1 ∆ in reward on move 1

ξPI : 0 5 10 15 ξEI : 0 5 10 15 p : 0.55 0.65 0.75 0.85

(b)

Figure 5.8: (a): The left and right plots show the move 2 behavior for subjects 57 and 60 respectively. In each scatterplot, dots represent the subject’s move 2 angles given the changes in score on move 1 for each round of the task. The horizontal axis denotes ∆r1, the change in reward from the starting spot to the first sampled location, and the vertical axis shows θ2, the angle of the second move relative to the first move. (b): Each plot shows four candidate acquisition curves under various exploration parameter values for the corresponding family—PI (green), EI (blue), and UCB (red).

In this section we propose a general framework for solving the inverse Bayesian optimiza- tion problem. This can be approached from two different perspectives. In Section 5.3.1, we consider the problem under perfect acquisition and in Section 5.3.2 we consider the problem under imperfect acquisition, which is more relevant to our experiements. In Sections 5.3.3 and 5.3.4, we show how to apply the framework in context of the search task.

87 5.3.1 IBO under perfect acquisition

The first setting we consider is when the optimization is carried out perfectly, or when the objective function is sampled precisely according to a predetermined acquisition function u∗. The goal is to determine the function u∗ that exactly yields the observed trajectory

D0:T . A few conditions must exist in order to guarantee this outcome. First, we must know the surrogate function fˆ0 used to initialize the optimization. Second, the set of candidate acquisition functions U we consider in the potential solution space must contain the true acquisition function u∗ employed in the optimization. Under these conditions, we can solve the inverse problem by sequentially updating the surrogate at each step t, then pruning the set of candidate acquisition functions U to those that yield xt+1 until we reach the final step T of the observed search path. The result is ∗ UT −1, which ideally is a singleton set containing u , the true acquisition function employed by the agent. Algorithm 1 details this process in mathematical terms.

Algorithm 3: Inverse Bayesian Optimization under Perfect Acquisition ˆ Input: D0:T , f0, U

Output: UT −1 ⊂ U

1 U0 ← U; 2 for t = 1,...,T − 1 do

3 fˆt = fˆt−1|D0:t; { ∈ |D ˆ } 4 Ut = u Ut−1 s.t. xt+1 = arg maxx u(x 0:t, ft) ; 5 end ∗ 6 return UT −1 where u ∈ UT −1

5.3.2 IBO under imperfect acquisition

While the procedure outlined above will yield the exact acquisition function employed by an agent (or set of acquisition functions all producing the same search path), the required conditions are not likely to exist in many practical applications. IBO may be more useful as a way to characterize an agent’s risk preferences when they are not only unknown, but perhaps can’t even be explicitly defined. Consider the search task example. Subjects were not asked to define their strategy or acquisition preferences beforehand, they were simply instructed to get as close to the hotspot as possible. However, doing this implicitly required them to synthesize uncertainty on each round. In these types of settings, when people are intuitively balancing exploration and exploitation in a sequential optimization but the preferences governing their decisions exist not as words and equations, but as subconscious impulses and values, we propose IBO as a flexible and practical way to characterize these latent values.

88 Under scenarios of this nature, the optimization will likely be imperfectly performed, particularly if the objective is defined over a continuous space. In other words, while an acquisition function u∗ may underlie the optimization, the sampling is imperfectly carried out, making it impossible to reverse engineer the acquisition function using methods akin to Algorithm 1. We propose a probabilistic framework for IBO under this type of setting as laid out in the steps below. We will assume that the only available input is the observed data D0:T .

Probabilistic Inverse Bayesian Optimization

1. Choose a surrogate fˆ to model the optimizer’s beliefs about the objective. 2. Select a set of candidate acquisition functions U as potential characteriza- tions of the optimizer’s latent risk preferences. 3. For a given step t of the observed search path, assume a likelihood for ∗ (xt+1|D0:t, u ) parameterized such that the mode of the distribution equals the arg max of the latent acquisition function u∗. 4. Select a prior distribution over U. 5. For each step t ∈ {1,...,T − 1} in the optimization, define the posterior update of the surrogate as

fˆt = fˆt−1|D0:t. (5.15)

6. At step t, the posterior probability of any given candidate acquisition func- tion u ∈ U is ( ) P (u|xt+1, D0:t, fˆt) ∝ P xt+1| arg max u(x|D0:t, fˆt) P (u). (5.16) x

We employ a Bayesian framework in our approach to IBO, however, the procedure could also be carried out using other estimators (e.g. maximum likelihood).

5.3.3 Search task implementation

We will now explain how we implement each of these steps in context of the hotspot search task. In order to estimate each subject’s acquisition function, we follow the steps listed above, but we consider only the first and second moves of each round. For step 1, we model the subject’s beliefs about the latent objective with the linear model defined in (5.4). For step 2, we use the PI, EI, and UCB acquisition families defined in (5.11)-(5.13) as the set of candidate acquisition functions U. Note that our method does not preclude using other acquisition families—any acquisition function can be included in the candidate set U. ∗ For step 3, we select a likelihood for (θ2|∆r1, u , fˆ1), the subject’s move 2 angles con- ditional on 1) the change in score from Move 0 to Move 1, 2) their beliefs and uncertainty

89 about the objective function, and 3) their risk preferences. We require the likelihood to be parameterized such that the mode equals the arg max of the acquisition function (i.e. the mode must be an acquisition curve as in Figure 5.8b). In order to model the symmetric bimodal nature of these data, we assume θ2 follows a reflected wrapped normal distribution: 1 1 g(θ |∆r , u, fˆ , σ ) = h(θ |∆r , u, fˆ , σ ) + h(−θ |∆r , u, fˆ , σ ), (5.17) 2 1 1 u 2 2 1 1 u 2 2 1 1 u ∑∞ [− − ∗ 2 ] | ˆ √1 (θ2 θ2 + 2πk) h(θ2 ∆r1, u, f1, σu) = exp 2 (5.18) σu 2π 2σu  k=−∞  ˆ arg max u(θ|∆r1, f1) if θ2 ∈ (−π, 0] ∈ − θ∗ = θ ( π,0] (5.19) 2  ˆ arg max u(θ|∆r1, f1) if θ2 ∈ (0, π]. θ∈(0,π]

Conceptually, (5.17) alters the standard wrapped normal distribution (Breitenberger 1963) by reflecting the distribution over the x-axis. Eq. (5.18) is the probability density function of the wrapped normal distribution, where σu is a measure of a subject’s variation about their ∗ target acquisition curve, which itself is denoted by θ2. Eq. (5.19) ensures that the mode of the likelihood is given by the arg max of the latent acquisition function u∗. Together, (5.17)- (5.19) yield a likelihood that adheres to the defining features of our data; the likelihood is not only defined on the proper support for θ2 (i.e. (−π, π]), it is also guaranteed to be symmetric over the axis of exploitation (i.e. θ2 = 0). We utilize the circular R package (Agostinelli & Lund 2013) to compute (5.17)-(5.19). Next, for step 4 we assume that each acquisition function u ∈ U has equal prior probabil- ity, thus we assume a uniform prior over the parameters indexing the acquisition function space: ξPI, ξEI, p. We also put a prior distribution over σu, the standard deviation of a subject’s noisy optimization behavior. See Section A.1 in the appendix for details on prior specification for each parameter. For steps 5 and 6, we update the surrogate for t = 1, then compute (5.16) for each u ∈ U by multiplying its prior probability by the likelihood. Note that the existence of the arg max in the parameterization of (5.17)-(5.19) creates an additional optimization step when evaluating (5.16). To ease the computational burden posed by this additional optimization step, we precompute a fine grid of acquisition curves over a plausible range of values for each candidate family and restrict our inference to this discrete set. Figure

5.9 shows the (∆r1, θ2) pairs for subjects 46, 54, and 57 overlaid with their corresponding maximum a posteriori (MAP) acquisition curves in color. The out-of-sample log likelihoods (computed using a 80%/20% training/test split of the data) are shown in the lower right hand corner of each panel.

90 Subject 46 Subject 54 Subject 57 Acquisition Curve Acquisition Curve Acquisition Curve MAP: UCB ( p = 0.76 ) MAP: PI ( ξPI = 24.14 ) MAP: UCB ( p = 0.76 )

π π π ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ●●● ● ●● ● ●●●●● ●● ●● ● ● ● ●●●● ● ● ● ●● ●●●● ● ● ●●●● ●● ● ●●●●●● ●● ●● ● ●●●● ●● ● ● 3π 4 ● ● ● ●● ● ● ● ●● ●●●● ● ●●●●●● 3π 4 3π 4 ● ● ● ● ● ● ●● ●●●●● ●●●● ●●●● ● ● ●● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●●● ● ●●●●● ● ●●● ●●● ● ● ●●●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ● ●●●●●● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ●●●●●● ● ● π 2 π 2 ● ● ● ● ● π 2 ● ● ● ●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●●●● ● ●● ● ● ● ●● ●●● ● ● ● ● ●●● ● ●● ●● ●● ● ● ● ● ● ●●● ● ●●●●● ● ●●●● ● ● ● ● ● ● ●●● ● ● ●●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● π 4 ● π 4 ● ● ● ● ● ● π 4 ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●

2 2 ● 2 ● ● ● ● ● ● ● ● ● ●

θ 0 ● θ 0 ● ● ● ● θ 0 ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● − π ● ●●● ● ● − π ● ● ● ● ● ● − π ● 4 ● ●●● ● ●●●● ●●● ●● ● ● ● ● 4 ● 4 ●● ●● ●● ●●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●●● ● ● ●● ●● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● − π ● − π ● ● ● ● − π 2 ● 2 ● ● ● ● ●●● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − π ● − π ● ● − π 3 4 3 4 ● 3 4 ● ● ● ● ● Out of sample Out of sample ● ● Out of sample − π ● − π − π ● ● ● log likelihood: −30.39 log likelihood: −73.95 log likelihood: −38.62

−30 −20 −10 0 5 10 20 30 −30 −20 −10 0 5 10 20 30 −30 −20 −10 0 5 10 20 30

∆r1 ∆r1 ∆r1

Figure 5.9: (∆r1, θ2) pairs for subjects 46 (left), 54 (middle), and 57 (right). The horizontal axis shows the change in score from move 0 to move 1 while the vertical axis shows the angle of their second move relative to their fist move. The subtitle of the plot lists the acquisition family and value of the exploration parameter within that family for the subject’s MAP acquisition function. Each scatterplot is overlaid with the subject’s corresponding MAP acquisition curve in color. Green curves denote PI, blue denotes EI (not shown), and red denotes UCB. Around each curve is the 95% highest density posterior prediction interval in light gray. The out-of-sample log likelihood of the fit is shown for each subject in the lower right corner of the plot.

While the MAP acquisition curve for subject 57 fits their data well, the best fit curves for subjects 46 and 54 have significant biases. In the case of subject 46, there appears to be no candidates in U that prescribe such a drastic change in strategy from negative ∆r1 to positive ∆r1 values while maintaining highly exploratory preferences. The exploration tendencies of subject 54 are even more extreme—no candidate acquisition curves are even close their exhibited mean trend. In both of these cases, the best fit curve essentially com- pensates for the lack of fit to the mean by inflating the variance. If the acquisition curves fit the trends in the data better, the prediction intervals wouldn’t be as wide. In the next section we propose an augmentation to the candidate acquisition functions guided by these discrepancies.

5.3.4 Incorporating human tendencies

Many subjects in our experiment tend to exhibit exploration tendencies beyond that of the ability of the PI, EI, and UCB acquisition functions to adequately capture, particularly for highly negative and highly positive values of ∆r1. In order to account for this tendency, we propose augmenting the acquisition functions by an additional exploration threshold pa- rameter τ. For a given acquisition function u, we define the augmented acquisition function

91 ue as  u if ( π − τ) < |θ | < ( π + τ) e | ˆ 2 2 2 u(θ2 ∆r1, f1, τ) =  (5.20) −∞ otherwise, where τ ∈ [0, π/2]. Eq. (5.20) defines the acquisition value to be infinitely negative for all move 2 angles that aren’t at least within τ radians of π/2, the angle which represents maximal exploration, while leaving the acquisition value unchanged if θ2 is sufficiently ex- ploratory. The practical effect of (5.20) is it allows the optimization criteria to be based solely on exploration. While the exploration parameters in the PI, EI, and UCB acquisi- tion families effectively allow the optimizer to specify different weights on exploration vs. exploitation when synthesizing the uncertainty in fˆ, they do not enable the optimizer to let one preference completely dominate the other. Eq. (5.20) allows exploration to trump exploitation, regardless of fˆ. This type of behavior is exhibited by many of the subjects in our study. Another feature we desire to account for is the human tendency to react differently for positive feedback vs. negative feedback (Tversky & Kahneman 1979). Many subjects appear to exhibit different acquisition preferences depending on whether they get a negative or positive change in score on their first move. We therefore fit separate curves for positive and negative values of ∆r1. We fit this piecewise augmented acquisition model to each subject’s data following the same procedure as outlined previously, only here we estimate two separate values for each parameter—one for negative ∆r1 and one for positive ∆r1. We differentiate between neg- ative and positive dependent parameters by a superscript, e.g. τ − and τ +. As with the other exploration parameters, we set a uniform prior on τ and over its support. The MAP augmented piecewise curves for the same subjects as shown previously (46, 54, and 57) are displayed in Figure 5.10. As indicated by the outofsample log likelihood values in compar- ison to the corresponding values in Figure 5.9, the piecewise augmented model provides a superior fit in each case.

92 Subject 46 Subject 54 Subject 57 Augmented Acquisition Curve Augmented Acquisition Curve Augmented Acquisition Curve

Lower MAP: PI ( ξPI = 3.45 , τl = 0.81 ) Lower MAP: EI ( ξEI = 0 , τl = 0 ) Lower MAP: EI ( ξEI = 0 , τl = 0.97 )

Upper MAP: UCB ( p = 0.5 , τu = 0.6 ) Upper MAP: PI ( ξPI = 0 , τu = 0.97 ) Upper MAP: UCB ( p = 0.79 , τu = 1.25 )

π π π ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●●●●● ●● ● ● ● ●●● ●●●●●●●●●● ●●● ●●●●●●●● ● ● ● π ● ● ●●●● ●● ● ●●●●●● ●● ●● ● ●●●● π π ●● ● ● ● 3 4 ● ● ●● ● ● ●● ●● ●●●●● ●● ●●●●●● 3 4 3 4 ● ●● ● ● ● ●● ●●● ●●●●●● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●●● ● ●●●● ● ● ● ●● ●● ● ●●●●● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●●●●● ● ● ●● ● ●●●●● ● π ● π ● ● ● ●● ● π ● ● ● ● ●●●●●●●● ● ● ● 2 2 ●● ● ● 2 ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●●●● ● ●●●● ● ● ● ● ● ● ●● ● ●● ● ● ●●●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● π ● π ● ● ● ● ● ● π ● ●● ● 4 ● ● 4 ● ● ● 4 ●●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●

2 2 ● 2 ● ● ● ● ● ● ● ● ● ●

θ 0 ● θ 0 ● θ 0 ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● − π ● ●●● ● ● − π ● ● ● ● ● − π ● 4 ● ●●● ● ●●●● ●●● ●● ● ● ● ● 4 ● ● 4 ●● ●● ●● ●●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●●● ● ● ●● ●● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● − π ● − π ● ● ● ● − π 2 ● 2 ● ● ● ● ●●● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − π ● − π ● ● − π 3 4 3 4 ● 3 4 ● ● ● ● ● Out of sample Out of sample ● ● Out of sample − π ● − π − π ● ● ● log likelihood: 6.14 log likelihood: −48.22 log likelihood: −29.68

−30 −20 −10 0 5 10 20 30 −30 −20 −10 0 5 10 20 30 −30 −20 −10 0 5 10 20 30

∆r1 ∆r1 ∆r1

Figure 5.10: All pairs of (∆r1, θ2) data for subjects 46 (left), 54 (middle), and 57 (right) as shown in Figure 5.9. Each scatterplot is overlaid with the subject’s corresponding MAP piecewise augmented acquisition curve in color. Green curves denote PI, blue denotes EI, and red denotes UCB. Around each curve is the 95% highest density posterior prediction interval in light gray. The out-ofsample log likelihood of the fit is shown for each subject in the lower right corner of the plot.

5.4 Model Extensions

In this section we analyze the second version of the experiment, which introduces factors requiring alterations to the IBO framework described in Section 5.3. The motivation behind the second version of the experiment was to make the decision process closer to real-life situations by changing how feedback was presented to the user. In this experiment, the score after each click was not given by a numeric score, but as a cloud of dots centered around the subject’s cursor. A larger number of dots indicated a smaller Euclidean distance to the hotspot—the closer a subject got to the hotspot the more dots they would see. The number of dots shown upon starting each round (i.e. the move 0 score) was always fixed 3 15 at 200. The reward gradient k from (5.1) was drawn from a Uniform[ 4 , 4 ] dots per pixel distribution in order to ensure that subjects could not realistically count the number of dots from step to step in the optimization.

5.4.1 Perception error

The change in reward presentation in the second experiment requires a modification to the

IBO approach introduced in Section 5.3. Previously we fixed σs in (5.4) at a small value since ϵ1 only represented minute deviations between the surrogate and the true response surface. In this experiment, σs additionally incorporates the error with which each subject perceives or approximates the number of dots shown around their cursor on each click.

93 For the subjects in the second experiment, rather than fixing σs we estimate it as part of probabilistic IBO framework laid out in Section 5.3.2.

5.4.2 Informative priors

Another differentiating feature of the second version of the experiment was that the reward gradient was guaranteed to be steep in each round of the experiment. In the first version of the experiment, the reward gradient k was drawn from a Uniform[0, 5/3] distribution, hence subjects sometimes encountered rounds where the reward gradient was nearly 0, yielding a flat objective function. In this experiment, k ∼ Uniform[(3/4), (15/4)] dots per pixel (or Uniform[15, 75] dots per move in the direction of the hotspot, assuming the move is to the click-region boundary). This ensured that subjects never experienced rounds of the game for which the objective function was flat, or even approximately so. This informs subject priors in a subtle yet significant way. After subjects have played a few rounds of the game, their average score improvement on each move significantly increases. Scores increase by an average of 14% after a subject’s first 20 rounds. We assume that is partly because their prior beliefs on the plausible range of values for the gradient parameter become more accurate as they are experience more time in the game (Tsividis et al. 2017). Consequently, they learn that there is a bound on the minimum value that k can be, and that this value is significantly greater than zero. This leads to an intuitve awareness of an inverse relationship that exists between βx and βy. By (5.1), the maximum change in reward that a subject can receive for a click-boundary length move is 20k and this will only occur if the move is made in the direction of the hotspot. By approximating the objective function with a linear model, the gradient information contained in k is effectively split into two components—βx and βy. The directional derivative of a function of two variables (x, y) in the direction d can be expressed as

grad δ(x, y) · d = |grad δ(x, y)| |d| cos θ, (5.21) where d is assumed to be a unit vector and θ is the angle between the gradient vector and d. Since the directional derivative takes on its greatest positive value when θ = 0, the direction of greatest increase of δ(x, y) is the same direction as the gradient vector. In our case, this is ⟨βx, βy⟩, or, converted to a unit-vector, ⟨ ⟩ β β √ x , √ y . 2 2 2 2 βx + βy βx + βy

94 The corresponding directional derivative is ⟨ ⟩ βx βy ⟨βx, βy⟩· √ , √ · 1 = 2 2 2 2 βx + βy βx + βy √ β2 β2 √ x √ y 2 2 + = βx + βy . (5.22) 2 2 2 2 βx + βy βx + βy

Therefore, √ ≈ 2 2 k βx + βy . (5.23)

Consequently the following inequality holds in the local region defined by the click region for a single move: √ ≤ 2 2 ≤ min(k) βx + βy max(k). (5.24)

Since 20 × min(k) = 15 is always well above 0, if a subject receives a change in reward of approximately 0 on their first move (i.e. estimates that βx ≈ 0), they can deduce that the absolute value of the gradient in the orthogonal direction, |βy|, must be significantly greater than 0. In other words, when subjects experience negligible gains (or losses) along one axis, this corresponds to a comparatively large gains in the orthogonal direction. By a similar argument, the converse relationship also holds: when subjects experience large gains (or losses) along one axis, they may deduce that any gains/losses in the orthogonal direction must be comparatively moderate. Due to√ this relationship, we want to select a prior for β that ensures that the transformed 2 2 variable βx + βy has most of its mass in the support given by the the uniform distribution from which k is drawn. Since subjects are unaware of the exact minimum and maximum values which govern the uniform distribution√ on k, we will model their prior belief as a 2 2 Gamma(a, b) distribution. Let g(x) = x1 + x2 for a bivariate vector x, and let

ba h(y | a, b) = y(a−1)e(−b · y). (5.25) Γ(a)

Note that g(·) and h(·) define our transformation of interest and the Gamma(a, b) density function respectively. We desire a prior distribution on β, π(β), such that ( ) π(β) ∝ h g(β) . (5.26)

The resulting distribution is on an infinite circular ridge (see the left panel of Figure 5.11). While we do not have an analytic solution for the distribution induced by (5.26), since the

95 definite integral of the Gamma distribution kernal on g(x) over the Cartesian plane is finite, we know that the distribution exists. The prior defined by (5.26) characterizes the information we assume that the subjects learn as they become familiar with the task. Taken independently, subjects do not know much about either βx or βy a priori, but when considered jointly they know an important feature about the objective. This is illustrated√ in Figure 5.11. The left plot of Figure 5.11 2 2 ∼ shows the joint prior density of β when βx + βy Gamma(5.5, 2). We project the prior distribution over a fine grid, using a Reimann sum to approximate the normalizing constant. 0 The donut hole in the neighborhood of β = ( 0 ) corresponds to subjects’ prior knowledge that the response surface will not be flat, while the donut itself (i.e. the circular area of high density) corresponds to the aforementioned inverse relationship between βx and βy. This relationship is illustrated in the right-hand plot, which shows the posterior joint distribution of β given ∆r1 = 0. Given no change in score after their first move (rotated to fall along the horizontal-axis), the subject knows that gradient along the y-axis must be steep, despite not having explored along this axis at all. However, since they haven’t explored along the y-axis, they cannot tell whether this gradient is highly negative or positive, as shown by the two modes of posterior. In both plots, the dashed pink circles represent the boundaries of the uniform distribution√ from which√ the reward gradient k is drawn for each new round 2 2 3 2 2 15 of the task (i.e. βx + βy = 4 and βx + βy = 4 ).

β Prior Density of Posterior Density of β | ∆r1 = 0 6 6

3 3 Density Density

0.020

0.015 0.10 y y

β 0

β 0 0.010 0.05 0.005

0.000 0.00 −3 −3

−6 −6 −6 −3 0 3 6 −6 −3 0 3 6 βx βx

Figure 5.11: Left: the joint prior density of β in experiment 2 as given by (5.25)-(5.26).

Right: the joint posterior density of β given ∆r1 = 0 after the first move. In both plots, the dashed pink circles represent the boundaries of the uniform distribution√ from which the reward gradient K is drawn for each new round of the task (i.e. β2 + β2 = 3 and √ x y 4 2 2 15 βx + βy = 4 ).

96 We approximate the posterior distribution of β using samples from a metropolis algo- rithm. We constrain the sampler to only sample βy ∈ (0, ∞) then reflect the samples over the x-axis, yielding a symmetric sample from the full posterior. In order to ensure that sub- jects have gained the prior knowledge implied by our model, we omit early rounds (1-20) for each subject in our analysis.

5.4.3 Search task implementation (version 2)

We now perform IBO on experiment 2 incorporating the features detailed in Sections 4.1 and 4.2. To do so, we first generate samples from the posterior predictive distribution of the surrogate since this distribution no longer has analytic solution given the prior in (5.26).

We do this across a range of σs values since this parameter is also latent in this version of the experiment. We then approximate the values of the candidate acquisition functions using these variates and take the maximums with respect to θ2 of the resulting surfaces over a plausible range of the exploration parameters. This yields a 3-dimensional argmax array where the additional dimension corresponds to σs. Following the same IBO steps detailed in Section 3, we estimate the posterior distribution over U for each subject using the same acquisition candidate set as in the first experiment. The top row of Figure 5.12 shows the data for subjects 10, 17, and 19 with their respective MAP acquisition curves overlaid. The bottom row of Figure 5.12 shows each subject’s e data overlaid with the MAP estimate over P (U|D1:2), the posterior distribution over the augmented set of candidate acquisition functions.

97 Subject 10 MAP: PI ( ξPI = 50 ) Subject 17 MAP: UCB ( p = 0.57 ) Subject 19 MAP: UCB ( p = 0.87 ) ^ ^ ^ σsurr = 30 σsurr = 30 σsurr = 20

● ● ● ● π ● π ● ●● ●●● ●● ● ● ●● ● ● ●●● π ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3π 4 ● ●● 3π 4 3π 4 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● π ● ● ● π π ● ● ● 2 ● ● 2 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ●● ● π π ● π ● ● ● ● ● ● 4 ● 4 ● ● ● 4 ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ●● ●● ●● 2 ● ● ●● ●● ● ● ● 2 ● ● ●●● ● ● ●● ● ● ● ● ● ● θ 0 ● ● ● θ 0 ● ● ●● ● θ 0 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● − π ● ● − π − π ● ●●● ● ● ● ● 4 ● ● ● ● 4 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● − π ● ● − π − π ● ● ● 2 ● ● ● 2 2 ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − π ● − π − π ● ● ● 3 4 3 4 ● 3 4 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Out of sample ● ● ● ● ● ● ● Out of sample Out of sample ● ●● ●●●● ● ●● ●● ●● ● ● − π − π ● ●● ● ● ●● ● ●● ●● − π ● ● ● log likelihood: −53.52 log likelihood: −43.15 log likelihood: −52.56

−75 −50 −25 0 25 50 75 −75 −50 −25 0 25 50 75 −75 −50 −25 0 25 50 75

∆r1 ∆r1 ∆r1

Subject 10 Subject 17 Subject 19 Augmented Acquisition Curve Augmented Acquisition Curve Augmented Acquisition Curve

Lower MAP: EI ( ξEI = 15 , τl = 0.73 ) Lower MAP: PI ( ξPI = 5 , τl = 1.26 ) Lower MAP: UCB ( p = 0.8 , τl = 0.73 ) Upper MAP: EI ( ξEI = 0 , τu = 1.26 ) Upper MAP: PI ( ξPI = 5 , τu = 1.26 ) Upper MAP: PI ( ξPI = 20 , τu = 0.94 )

● ● ● ● π ● π ● ●● ●●● ●● ● ● ●● ● ● ●●● π ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● π ● ● ● π π ● ● ● ● ● 3 4 ●● ● 3 4 3 4 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● π ● ● ● π π ● ● ● 2 ● ● 2 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ● π π ● π ● ● ● ● ● ●● 4 ● ● 4 ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 2 ● 2 ● ● ● ●● ●● ●● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●

θ ● θ θ 0 ● ● ● ● ● 0 ● ● ●● ● 0 ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● − π ● ● ● − π ● ● − π ● ●●● ● ● ● 4 ● ● ● ● 4 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● − π ● ● − π − π ● ● ● 2 ● ● ● 2 2 ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − π ● ● − π − π ● ● 3 4 3 4 ● 3 4 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Out of sample ● ● ● ● ● ● Out of sample Out of sample ● ● ●● ● ● ●● ●● ●● ● − π − π ● ● ●● ● ● ● ●● ● ●● ●● ● − π ● ● ● log likelihood: −48.01 log likelihood: −36.17 log likelihood: −45.26

−75 −50 −25 0 25 50 75 −75 −50 −25 0 25 50 75 −75 −50 −25 0 25 50 75

∆r1 ∆r1 ∆r1

Figure 5.12: All pairs of (∆r1, θ2) data for subjects 10 (left), 17 (middle), and 19 (right) in the second experiment. The top row shows each subject’s data overlaid with their MAP acquisition curve when considering the original candidate set U. In the bottom row the scatter plots are overlaid with the piecewise MAP acquisition curves over the augmented set Ue. Green curves denote PI, blue denotes EI, and red denotes UCB. Around each curve is the 95% highest density posterior prediction interval in light gray. The out-ofsample log likelihood of the fit is shown for each subject in the lower right corner of the plot.

While the improvement using the augmented acquisition candidate functions is less pronounced in this second experiment, the augmented functions do provide a better out-of- sample fit. We note that the jagged acquisition curves do not represent overfitting. Rather, the jaggedness comes from Monte Carlo error due to estimating the arg max over the ac- quisition function surfaces created using variates from the posterior predictive distribution of fˆ.

98 5.5 Conclusion

In conclusion, we find that subjects exhibit a wide array of acquisition preferences, but that nearly all of them exhibit exploration tendencies beyond the ability of standard acquisition functions to capture. We propose an augmentation to the candidate acquisition functions that yields a better explanation of human optimization behavior in this task. We also find evidence that humans respond to differently to positive versus negative rewards. From a methodological perspective, we introduce a probabilistic framework for inverse bayesian optimization and we show how this procedure can be carried out in a Bayesian paradigm through the hotspot experiment. More broadly, this work contributes to the field of inverse optimization, which has only very recently begun incorporating a probabalistic framework when estimating unknown model parameters of an optimization problem (Aswani et al. 2018).

99 Chapter 6

Conclusion

6.1 Summary

This thesis introduces methodology for basketball analytics that could have an impact on strategic decisions at multiple levels of a basketball organization. In the first project, the problem was to evaluate player shooting decisions in context of their teammates. We provided a novel framework to estimate the efficiency of a player’s shooting tendencies in this context while explicitly accounting for spatial information. The metric we developed to measure this type of efficiency, LPL, was found to correlate with game outcomes. LPL could be useful for coaches in strategic planning, to players in their development, and to front offices in player acquisition decisions. In the second basketball application, our goal was to create a method to estimate the effects of changing a player’s shot policy. To do this, we created a basketball play simulator based on estimated team-specific Markov decision processes. In order to model the nonsta- tionarity introduced by the shot clock, we used a unique tensor framework to model the dynamics as a function of time. We showed that the simulator was calibrated with high fi- delity and we illustrated how the simulator could be used to test alternate decision policies. While we did not do this in the thesis, the LPL metrics from the first project could be used to inform player-specific shot policy changes, which could then be tested using the methods from this project. Of course, this relies on the assumption that the transition dynamics of the system would not change based on the altered policies. This assumption represents a significant limitation of this method. In the final project, we study human decision-making behavior in a much less compli- cated environment in comparison to basketball. The data come from a simple computer game designed to shed light on the exploration/exploitation tradeoff that characterizes many human decisions. As related to the basketball players in the first two projects, we take a fundamentally different perspective about the decisions of the subjects in this game. A tacit assumption of the first two projects is that the players are performing suboptimally. The hope is that the methodology we introduce can help change behavior for the better

100 (i.e. more optimal). By contrast, in the final project we assume that the agents’ actions are optimal, but that the criteria over which they optimize are unknown. The goal of the analysis is to make inference on these latent optimization criteria. We introduce a probabilistic solution framework for inverse Bayesian optimization, and we illustrate our method on the subjects’ data from the hotspot computer game. We find that subjects exhibit a wide range of acquisition preferences; however, some subject’s behav- ior does not map well to any of the candidate acquisitions functions we consider. Guided by the model discrepancies, we augment the candidate acquisition functions to yield a superior fit to the human behavior in this task. From a methodological perspective, this work builds on the existing literature by expanding inverse optimization to the field of Bayesian opti- mization. More importantly, we show how a probabilistic framework can be incorporated in the inference of an inverse decision problem, which we are the first to consider to our knowledge.

6.2 Future work

In the first basketball application, a significant limitation is that we do not account for defensive information in the FG% and FGA rate models. With access to proprietary data, defensive information (and many other contextual variables) could be included in the FG% and FGA rate models, which could make the proposed shot distribution proposed by LPL a more reliable optimum to seek. Even without access to these data, it may be possible to recreate some features that aren’t explicitly provided by the NBA’s public-facing API. For instance, shot clock times could be estimated using game clock times given in the play-by-play data. Another area of future work that could improve the allocative efficiency metrics we propose is the incorporation of player usage curves. A usage curve defines the relationship between a player’s usage (i.e. how much of a team’s scoring load a player is asked to carry) and his efficiency. LPL is constructed by reallocating shots among players in a lineup and estimating the number of points that would be generated by this hypothetical redistribution. The theory behind the usage curve suggests that each player’s efficiency would change when redistributing shots, which would change the number of points expected from the redistribution. These same issues apply to the second basketball project, where we test changes to a player’s shot policy and estimate the resulting changes to point production and efficiency. A clear next step in the second basketball application would be to estimate the action- value function of the MDP for the on-policy setting. Given the MDP’s massive state space, this would be a significant computational challenge, but doing so could potentially provide new insights about states and players that are over- or under-valued. Unfortunately, our MDP framework could not be used to solve for an optimal policy because we have no

101 way of estimating the causal effects of changes to shot policies incurred by the defensive response. This represents perhaps the biggest challenge to prescriptive analytics in multi- agent team sports. However, significant steps are being made in this direction. Recently, a 3-dimensional, high fidelity, physics based reinforcement learning environment of soccer has been developed at Google (Kurach et al. 2020). This type of simulated environment could allow researchers to test changes to strategy and see how the defensive agents respond. Developing this type of environment for basketball represents an exciting open problem. Finally, there are many promising directions of future work stemming from the inverse Bayesian optimization project. From a methodological perspective, we would like to incor- porate statistical models to existing methods of inverse optimization in operations research, which would enable solutions to be properly contextualized with respect to uncertainty. We also see promising connections between the inverse Bayesian optimization methods we introduce in Chapter 5 and inverse reinforcement learning, which we are developing in an ongoing project related to the football example discussed in Chapter 4.

102 Bibliography

Agostinelli, C. & Lund, U. (2013), ‘R package circular: Circular statistics (version 0.4-7)’, URL https://r-forge. r-project. org/projects/circular .

Ahuja, R. K. & Orlin, J. B. (2001), ‘Inverse optimization’, Operations Research 49(5), 771– 783.

Aswani, A., Shen, Z.-J. & Siddiq, A. (2018), ‘Inverse optimization with noisy data’, Oper- ations Research 66(3), 870–892.

Banerjee, S. (2008), ‘Bayesian linear model: Gory details’, Dowloaded from http://www. biostat. umn. edu/˜ ph7440 .

Banerjee, S., Carlin, B. P. & Gelfand, A. E. (2015), Hierarchical Modeling and Analysis for Spatial Data, 2nd edn, CRC Press, Boca Raton, FL.

Berger-Tal, O., Nathan, J., Meron, E. & Saltz, D. (2014), ‘The exploration-exploitation dilemma: a multidisciplinary framework’, PloS one 9(4).

Besag, J. (1974), ‘Spatial interaction and the statistical analysis of lattice systems’, Journal of the Royal Statistical Society. Series B. 36(2), 192–236.

Borji, A. & Itti, L. (2013), Bayesian optimization explains human active search, in ‘Advances in neural information processing systems’, pp. 55–63.

Breitenberger, E. (1963), ‘Analogues of the normal distribution on the circle and the sphere’, Biometrika 50(1/2), 81–88.

Brochu, E., Cora, V. M. & De Freitas, N. (2010), ‘A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning’, arXiv preprint arXiv:1012.2599 .

Candelieri, A., Perego, R., Giordani, I., Ponti, A. & Archetti, F. (2020), ‘Modelling human active search in optimizing black-box functions’, arXiv preprint arXiv:2003.04275 .

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P. & Riddell, A. (2017), ‘Stan: A probabilistic programming language’, Journal of Statistical Software 76(1).

Carter, V. & Machol, R. E. (1971), ‘Operations research on football’, Operations Research 19(2), 541–544.

103 Cervone, D., DAmour, A., Bornn, L. & Goldsberry, K. (2016), ‘A multiresolution stochastic process model for predicting basketball possession outcomes’, Journal of the American Statistical Association 111(514), 585–599.

Chan, T. C. Y., Lee, T. & Terekhov, D. (2019), ‘Inverse optimization: Closed-form solutions, geometry, and goodness of fit’, Management Science 65(3), 11151135. URL: http://dx.doi.org/10.1287/mnsc.2017.2992

Chan, T., Fernandes, C. & Puterman, M. (2019), ‘Value functions and points gained in football: theory and applications’, Sauder working paper .

Chang, Y.-H., Maheswaran, R., Su, J., Kwok, S., Levy, T., Wexler, A. & Squire, K. (2014), Quantifying shot quality in the NBA, in ‘The 8th Annual MIT Sloan Sports Analytics Conference, Boston, MA’.

Cohen, J. D., McClure, S. M. & Yu, A. J. (2007), ‘Should i stay or should i go? how the human brain manages the trade-off between exploitation and exploration’, Philosophical Transactions of the Royal Society B: Biological Sciences 362(1481), 933–942.

Cox, D. D. & John, S. (1992), A statistical method for global optimization, in ‘[Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics’, IEEE, pp. 1241– 1246.

D’Amour, A., Cervone, D., Bornn, L. & Goldsberry, K. (2015), ‘Move or die: How ball movement creates open shots in the nba’, Sloan Sports Analytics Conference .

Diggle, P. (1985), ‘A Kernel Method for Smoothing Point Process Data’, Journal of the Royal Statistical Society. Series C (Applied Statistics) 34(2), 138–147.

Dixon, M. J. & Coles, S. G. (1997), ‘Modelling association football scores and inefficiencies in the football betting market’, Journal of the Royal Statistical Society: Series C (Applied Statistics) 46(2), 265–280.

Efron, B. & Morris, C. (1975), ‘Data analysis using stein’s estimator and its generalizations’, Journal of the American Statistical Association 70(350), 311–319.

Franks, A., Miller, A., Bornn, L., Goldsberry, K. et al. (2015), ‘Characterizing the spatial structure of defensive skill in professional basketball’, The Annals of Applied Statistics 9(1), 94–121.

Gallego, J. S. (2017), Introduction to inverse optimization. URL: http://jsaezgallego.com/tutorial/2017/07/16/Inverse-opitmization.html

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A. & Rubin, D. (2013), Bayesian Data Analysis, Third Edition, Chapman & Hall/CRC Texts in Statistical Science, Taylor & Francis. URL: https://books.google.ca/books?id=ZXL6AQAAQBAJ

Gershman, S. J. (2019), ‘Uncertainty and exploration.’, Decision 6(3), 277.

Goldman, M. & Rao, J. (2014a), ‘Optimal stopping in the nba: An empirical model of the miami heat’, Available at SSRN 2363709 . URL: https://ssrn.com/abstract=2363709

104 Goldman, M. & Rao, J. M. (2011), Allocative and dynamic efficiency in nba decision making, in ‘In Proceedings of the MIT Sloan Sports Analytics Conference’, pp. 4–5.

Goldman, M. & Rao, J. M. (2014b), ‘Misperception of risk and incentives by experienced agents’, Available at SSRN 2435551 .

Goldner, K. (2012), ‘A markov model of football: Using stochastic processes to model a football drive’, Journal of Quantitative Analysis in Sports 8(1).

Gramacy, R. B. (2020), Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences, CRC Press.

Grimstead, C. & Snell, J. (1997), Introduction to Probability: Second Revised Edition, Amer- ican Mathematical Society.

Gudmundsson, J. & Horton, M. (2017), ‘Spatio-temporal analysis of team sports’, ACM Computing Surveys (CSUR) 50(2), 22.

Heuberger, C. (2004), ‘Inverse combinatorial optimization: A survey on problems, methods, and results’, Journal of combinatorial optimization 8(3), 329–361.

Hirotsu, N. & Wright, M. (2002), ‘Using a markov process model of an association football match to determine the optimal timing of substitution and tactical decisions’, Journal of the Operational Research Society 53(1), 88–96.

Hurwicz, L. (1951), ‘The generalized bayes minimax principle: a criterion for decision making under uncertainty’, Cowles Comm. Discuss. Paper Stat 335, 1950.

Jones, D. R., Schonlau, M. & Welch, W. J. (1998), ‘Efficient global optimization of expensive black-box functions’, Journal of Global optimization 13(4), 455–492.

Kubatko, J., Oliver, D., Pelton, K. & Rosenbaum, D. T. (2007), ‘A starting point for analyzing basketball statistics’, Journal of Quantitative Analysis in Sports 3(3).

Kurach, K., Raichuk, A., Staczyk, P., Zajc, M., Bachem, O., Espeholt, L., Riquelme, C., Vincent, D., Michalski, M., Bousquet, O. & et al. (2020), ‘Google research football: A novel reinforcement learning environment’, Proceedings of the AAAI Conference on Artificial Intelligence 34(04), 45014510. URL: http://dx.doi.org/10.1609/aaai.v34i04.5878

Kushner, H. J. (1964), ‘A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise’, Journal of Basic Engineering pp. 97–106.

Lee, D. D. & Seung, H. S. (1999), ‘Learning the parts of objects by non-negative matrix factorization’, Nature 401, 788. URL: https://doi.org/10.1038/44565 http://10.0.4.14/44565

Li, W. & Ng, M. K. (2014), ‘On the limiting probability distribution of a transition proba- bility tensor’, Linear and Multilinear Algebra 62(3), 362–385. URL: https://doi.org/10.1080/03081087.2013.777436

105 Lindgren, F., Rue, H. & Lindström, J. (2011), ‘An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach’, Journal of the Royal Statistical Society, Series B 73(4), 423–498. URL: http://dx.doi.org/10.1111/j.1467-9868.2011.00777.x

Lizotte, D. J. (2008), Practical bayesian optimization, University of Alberta.

March, J. G. (1991), ‘Exploration and exploitation in organizational learning’, Organization science 2(1), 71–87.

Miller, A., Bomn, L., Adams, R. & Goldsberry, K. (2014), ‘Factorized point process in- tensities: A spatial analysis of professional basketball’, 31st International Conference on Machine Learning, ICML 2014 1, 398–414.

Mockus, J., Tiesis, V. & Zilinskas, A. (1978), ‘Toward global optimization, volume 2, chapter bayesian methods for seeking the extremum’.

Neiman, T. & Loewenstein, Y. (2011), ‘Reinforcement learning in professional basketball players’, Nature Communications 2, 569.

Oliver, D. (2004), Basketball on Paper: Rules and Tools for Performance Analysis, Potomac Books, Inc.

Plonsky, O., Apel, R., Ert, E., Tennenholtz, M., Bourgin, D., Peterson, J. C., Reichman, D., Griffiths, T. L., Russell, S. J., Carter, E. C. et al. (2019), ‘Predicting human decisions with behavioral theories and machine learning’, arXiv preprint arXiv:1904.06866 .

Polson, N. G., Scott, J. G. et al. (2012), ‘On the half-cauchy prior for a global scale param- eter’, Bayesian Analysis 7(4), 887–902.

Puterman, M. L. (2014), Markov Decision Processes: Discrete Stochastic Dynamic Program- ming, John Wiley & Sons.

Routley, K. D. (2015), A markov game model for valuing player actions in ice hockey, in ‘Conference on Uncertainty in Artificial Intelligence (UAI)’, pp. 782–791.

Sandholtz, N. & Bornn, L. (2020), ‘Supplement to “markov decision processes with dynamic transition probabilities: An analysis of shooting strategies in basketball”’.

Schulz, E., Tenenbaum, J. B., Reshef, D. N., Speekenbrink, M. & Gershman, S. (2015), Assessing the perceived predictability of functions., in ‘CogSci’, Citeseer.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & De Freitas, N. (2015), ‘Taking the human out of the loop: A review of bayesian optimization’, Proceedings of the IEEE 104(1), 148–175.

Simpson, D., Illian, J. B., Lindgren, F., Sørbye, S. H. & Rue, H. (2015), ‘Going off grid: Com- putationally efficient inference for log-Gaussian Cox processes’, Biometrika 103(1), 49–70.

Skinner, B. (2012), ‘The problem of shot selection in basketball’, PLoS ONE 7(1), e30776.

Skinner, B. & Goldman, M. (2019), Optimal strategy in basketball, CRC Press, chapter 11.

106 Sports Reference LLC (n.d.), ‘Calculating PER’. URL: https://www.basketball-reference.com/about/per.html

Srinivas, N., Krause, A., Kakade, S. M. & Seeger, M. (2009), ‘Gaussian process optimization in the bandit setting: No regret and experimental design’, arXiv preprint arXiv:0912.3995 .

Štrumbelj, E. & Vračar, P. (2012), ‘Simulating a basketball match with a homoge- neous markov model and forecasting the outcome’, International Journal of Forecasting 28(2), 532–542.

Sutton, R. S. & Barto, A. G. (2018), Reinforcement Learning: An Introduction, MIT press.

Terner, Z. & Franks, A. (2020), ‘Modeling player and team performance in basketball’.

Thomas, A. C., Ventura, S. L., Jensen, S. T. & Ma, S. (2013), ‘Competing process haz- ard function models for player ratings in ice hockey’, The Annals of Applied Statistics 7(3), 1497–1524. URL: http://www.jstor.org/stable/23566482

Tsividis, P. A., Pouncy, T., Xu, J. L., Tenenbaum, J. B. & Gershman, S. J. (2017), Human learning in atari, in ‘2017 AAAI Spring Symposium Series’.

Tversky, A. & Kahneman, D. (1979), ‘Prospect theory: An analysis of decision under risk’, Econometrica 47(2), 263–291.

Von Neumann, J. & Morgenstern, O. (2007), Theory of games and economic behavior (com- memorative edition), Princeton university press.

Vračar, P., Štrumbelj, E. & Kononenko, I. (2016), ‘Modeling basketball play-by-play data’, Expert Systems with Applications 44, 58–66.

Wald, A. (1939), ‘Contributions to the theory of statistical estimation and testing hypothe- ses’, The Annals of Mathematical Statistics 10(4), 299–326.

Ward Jr, J. H. (1963), ‘Hierarchical grouping to optimize an objective function’, Journal of the American Statistical Association 58(301), 236–244.

Wilson, A. G., Dann, C., Lucas, C. & Xing, E. P. (2015), The human kernel, in ‘Advances in neural information processing systems’, pp. 2854–2862.

107 Appendix A

Additional Details

A.1 IBO Prior Specification

Experiment 1

• Prior distributions:

ξPI ∼ Uniform[0, 33.33] (A.1)

ξEI ∼ Uniform[0, 33.33] (A.2) p ∼ Uniform[0.5, 0.999] (A.3)

σu ∼ Uniform(0, π] (A.4) τ ∼ Uniform(0, π/2] (A.5)

Experiment 2

• Priors distributions:

ξPI ∼ Uniform[0, 75] (A.6)

ξEI ∼ Uniform[0, 75] (A.7) p ∼ Uniform[0.5, 0.999] (A.8)

σu ∼ Uniform(0, π] (A.9) τ ∼ Uniform(0, π/2] (A.10)

108