Predictive Bayesian selection of multistep Markov chains, applied to the detection of the rsos.royalsocietypublishing.org hot hand and other statistical Research dependencies in free throws Joshua C. Chang1 Article submitted to journal 1 Epidemiology and Biostatistics Section, Rehabilitation Medicine Department, The National Author for correspondence: Institutes of Health, Clinical , Bethesda, Joshua C. Chang Maryland 20892, U.S.A. e-mail: [email protected]

Consider the problem of modeling memory effects in discrete-state random walks using higher-order Markov chains. This paper explores cross validation and information criteria as proxies for a model’s predictive accuracy. Our objective is to select, from data, the number of prior states of recent history upon which a trajectory is statistically dependent. Through simulations, I evaluate these criteria in the case where data are drawn from systems with fixed orders of history, noting trends in the relative performance of the criteria. As a real-world illustrative example of these methods, this manuscript evaluates the problem of detecting statistical dependencies in shot outcomes in shooting. Over three NBA seasons analyzed, several players exhibited statistical dependencies in free throw hitting probability of various types – hot handedness, cold handedness, and error correction. For the 2013–2014 through 2015–2016 NBA seasons, I detected statistical dependencies in 23% of all player-seasons. Focusing on a single player, in two of these three seasons, LeBron James shot a better percentage after an immediate miss than otherwise. In those seasons, conditioning on the previous outcome makes for a more-predictive model than treating free throw makes as independent. When extended to data from the 2016–2017 NBA season specifically for LeBron James, a model depending on the previous shot (single-step Markovian) is approximately as predictive as a model with arXiv:1706.08881v3 [stat.ME] 20 Feb 2019 independent outcomes. An error-correcting variable length model of two parameters, where James shoots a higher percentage after a missed free throw than otherwise, is more predictive than either model.

c 2014 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/ by/4.0/, which permits unrestricted use, provided the original author and source are credited. 1. Introduction 2

Multistep Markov chains (also known as N-step or higher-order Markov chains) are flexible models that 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org ...... are useful for quantifying any discrete-state discrete-time phenomenon. They have appeared in limitless contexts such as analysis of text [1], human digital trails [2], DNA sequences, protein folding [3], eye movements [4], and queueing theory [5]. In these models, transition probabilities between states depend on the recent history of states visited. In order to learn these models from data, a choice for the number of states of history to retain must be made. In this manuscript we evaluate contemporary Bayesian methods for making this choice, from the perspective of predictive accuracy. Bayesian model selection is an active field with many recent theoretical and computational advancements. In the broad setting of Markov Chain Monte Carlo (MCMC) inference, a significant advance has been approximation of leave one out (LOO) cross validation through use of Pareto smoothed importance sampling [6], which has made LOO computationally feasible for a large class of problems. As explored in Gelman et al. [7], many methods similar to LOO exist. This manuscript adapts these methods to the context of selection for multistep Markovian models, providing closed-form expressions for computing information criterion that summarize the evaluation made by each method. As a concrete illustration of these methods, I examine the problem of the detection of statistical dependencies broadly using free throw data from the National Association (NBA). A particular type of statistical dependency is known as the hot hand phenomenon. This phenomenon implies that recent success is indicative of success in the immediate future. While controversial in analytical circles, belief in the hot hand phenomenon is certainly widespread in both the general public and in athletes [8– 10]. Empirically, the phenomenon has proven to be elusive [11]. In the 1980s, examinations of the phenomenon in basketball based on analysis of shooting streaks yielded negative results [8,12], failing to reject null hypotheses of statistically independent shot outcomes. Based on these early analyses, some studies have dismissed the widespread belief in the hot hand by relating it to the Gambler’s fallacy [13]. The gambler’s fallacy refers to the seemingly mistaken belief that “random” events such as roulette spins exhibit autocorrelation [14]. In the context of the hot hand, an autocorrelation would involve increased probability of making a shot when one is in a “hot” state. Follow-up studies have examined the effects of belief in the hot hand under the supposition that it is a fallacy [15]. Ignoring the fact that statistical dependencies in even intended games of chance can exist [16], one might reasonably suspect that various latent factors can affect the accuracy of an individual, where the outcome is the result of physical processes. These latent factors, modeled for instance by hidden Markov models [17, 18], would manifest as statistical dependencies in outcomes. Additionally, there were weaknesses in the prior research efforts that failed to find the hot hand effect. Recent analyses, using multivariate methods that can account for factors such as shot difficulty [19–21], have supported the phenomenon, finding the original studies to be underpowered [20–22], or to suffer from methodological issues regarding the weighting of expectation values [21]. A statistical testing approach that did not share this methodological issue [23] found evidence of the hot hand in aggregate game data but also raised the question of whether the observed patterns were a result of the hold/cold hand or of other individual-level states that imply statistical dependencies. In this manuscript I focus on detecting individual-level effects. As an illustration of model selection for multistep Markov chains, this manuscript re-examines the hot hand phenomenon from a different analytical philosophy. Presently, a broad class of analyses of the phenomenon [8,12] have been rooted in null hypothesis statistical testing. Rather than follow this approach, which requires a subjective choice of a cut off p-value to assess “significance,” this manuscript frames this problem as a model selection task. We are interested in whether models that encompass statistical dependencies like hot handedness are better at predicting free throw outcomes for individual players than models without such effects.

2. Quantitative Methods paa pbb (a) (b) 3 h = 1 pba h = 2 pabc h = 0 ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org a b p a b ...... pab cb → pbbc pbc h = 1 c b b pcbc c pac →

p ca pcc c b → h = 3 1 < h < 2

p pcabc h = 2 a a b aabc c a b → → → → pcbbc

pabbc pccbc a b b c c b b → → pacbc → →

pbcbc

pbbbc a c b pbabc c c b → → → →

h = 3 b a b b b b b c b → → → → → →

Figure 1. (a) Multi-step finite state Markovian processes parameterized by degree of memory h, demonstrated on a three-state network. For h = 1, the statistics of the next state depend solely on the current state and the stochastic system is parameterized by transition probabilities indexed by a tuple. For h = 2 and h = 3 the statistics depend on the history. Shown are the possible single-step transitions from state b to state c. For h = 2, transition probabilities depend on the current state and the previous state, and all transition probabilities are indexed by 3-tuples. For h = 3, all transition probabilities depend on the current state and two previous states and are indexed by 4-tuples. (b) Venn diagram of multistep-models. Lower-order models can be presented using higher order transition matrices under equality of rows.

(a) Probabilistic modeling Multistep Markov chains (also known as n-step Markov Chains) are factorized probability models for discrete-state trajectories, where the probability of a particular trajectory is the product of conditional transition probabilities between possible states. The conditions pertain to the prior locations that a trajectory has visited, or its recent history. In our model of free throw shooting there are two states (make and miss), however, let us consider the more general problem of a model with any number M states. Assume that a trajectory ξ consists of steps ξl, where each step takes a value xl taken from the set 1,2,...,M . We are { } interested in representations for the trajectory probability of the form

L Pr(ξ) = ∏Pr(ξl = xl previous h states) = ∏ Pr(ξl = xl ξl 1 = xl 1,...,ξl h = xl h) | l=1 | − − − − L p = ∏ xl h,xl h+1, ,xl , (2.1) l − − ··· where h, a non-negative counting number, represents the number of states worth of memory needed to predict the next state, with appropriate boundary conditions for the beginning of the trajectory. In the context of the hot hand effect, models with h 1 encompass statistical dependencies between shot outcomes. A hot ≥ hand would correspond to higher make probabilities after recent makes and cold hands correspond to lower probabilities after misses. Mathematically, the stochastic process underlying discrete-time Markov chains (implicitly h = 1) is represented by a transition matrix, where each entry is a conditional probability of a transition from a state (row) to a new state (column). Multi-step Markov chains are no different in this respect. Each row corresponds to a given history of states and the corresponding matrix entries provide conditional 4 probabilities of transitioning at the next step to a new state (column). ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org In the case of absolutely no memory (h = 0), the path probability is simply ...... the ...... product ...... of ...... the ...... probabilities of being in each of the separate states in a path, px px ,... px , and there are essentially M 1 1 2 L − free model parameters, where M is the number of states. The memoryless property of Markov chains refers to h = 1. It should be noted that h = 0 is a special sub-case of h = 1, where the associated transition matrix has identical rows. If h = 1, the model is single-step Markovian (memoryless) in that only the current state is relevant in determining the next state. These models involve M(M 1) free parameters. Knowledge of prior − states beyond the current state is considered “memory.” Generally, if h states of history are required, then the model is h-step Markovian, and Mh(M 1) parameters are needed (see Fig1). Hence, the size of the − parameter space grows exponentially with memory. Our objective is to determine, based on observational evidence, an appropriate value for h. Note that multistep-Markovian models are nested. Lower-order models (smaller-h) can be represented by higher order models (larger-h) but not visa-versa. Variable-length models [24] also fit into this paradigm, as pictured in Fig.1. A model might have an effective order of 1 < h < 2 for instance if many of its parameter vectors px are identical. For a fixed degree of memory h, we may look at possible history vectors x = [x1,x2,...,xh] of length h h taken from the set Xh = 1,2,...,M . For each x, denote the vector px = [px 1, px 2,... px M], where px,m { } , , , is the probability that a trajectory goes next to state m given that x represents its most recent history. For convenience, we denote the collection of all px as p (see below for an example of the notation). Generally one has available J Z+ trajectories. Assuming independence between trajectories, one may ∈ write the joint probability, or likelihood, of observing these trajectories as

J J M ( j) M ( j) J ( j) Nx,m Nx,m Pr( ξ p) = Pr(ξ p) = px m = px m , (2.2) { } j=1| ∏ | ∏ ∏ ∏ , ∏ ∏ , j=1 j=1 x Xh m=1 x Xh m=1 ∈ ∈ ( j) ( j) ( j) where Nx m is the number of times that the transition x m occurs in trajectory ξ , and Nx,m = ∑ Nx m , → j , is the total number of times the transition is seen. ( j) For convenience, denote Nx = ∑ Nx,m, Nx = [Nx 1,Nx 2,...,Nx M], and the collection Nx x j as m , , , {{ } } N. The sufficient statistics of the likelihood are the counts, so we will refer to the likelihood as Pr(N p). | The maximum likelihood estimator for each parameter vector px is found by maximizing the probability in MLE Eq. 2.2, and can be written easily asp ˆx = Nx/Nx. Example: The outcomes of free throws for a player in a particular game can be represented as string or trajectory of states (miss or make). For example, using “+” to denote makes and “ ” to − denote misses, a trajectory of “+ + +” corresponds to a game where a player makes the first two − free throws, misses the third, and makes the fourth. To clarify our notation, consider a model for free throw shooting informed using J = 2 observed trajectories (games) given: ξ (1) = + + + + +,ξ (2) = { − − + + + + + + + . Suppose that we set h = 1 in this model. This choice implies that we − − − (1) −} (1) (1) (1) (2) (2) (2) (2) need the counts N++ = 2, N+ = 2, N = 0, N + = 2, N++ = 4, N+ = 3, N = 1, N + = 2 coinciding to the number of makes− following−− makes,− misses following makes,− misses following−− misses− and makes following misses respectively for each trajectory (game). In addition, since the first state in each trajectory is stochastic as well, we add two special states representing the outcome of the initial free throw, N = 0, N + = 2. Aggregating the counts across the trajectories, in the vector notation above, we have ·− · N+ = [N+ ,N++] = [5,6],N = [1,4],N = [0,2], where the indices are all of length one since our choice − − · of h = 1 means that we only consider history vectors of length one. Using maximum likelihood, we arrive MLE MLE MLE MLE MLE at the parameter estimates pˆ = [pˆ , pˆ + ] = [1/5,4/5], pˆ + = [5/11,6/11], pˆ = [0,1]. It is notable that our estimate− of−− the probability− of missing the first free throw in· a game is null – this is probably an unrealistic inference. Fundamentally, the maximum likelihood estimator precludes the existence of unobserved transitions – a property that is problematic if the sample size J is small, as already seen in this example. This problem amplifies when increasing h. It is desirable to regularize the problem by allowing a nonzero probability that transitions that have not yet been observed will occur. This manuscript’s approach to rectifying these issues is Bayesian. (b) Bayesian modeling 5 A natural Bayesian formulation of the problem of determining the transition probabilities is to use the ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org ...... Dirichlet conjugate prior on each parameter vector

px Dirichlet(α), (2.3) ∼ hyper-parameterized by α, a vector of size M. The Dirichlet probability distribution is a distribution over finite-dimensional probability distributions. It has the probability density function

M 1 αm 1 π(px) = ∏ px,m− , (2.4) B(α) m=1 M where B : R R refers to the multivariate beta function [25], → M ∏ Γ (xm) B(x) = m=1 , (2.5) M ! Γ ∑ xm m=1 and Γ refers to the gamma function. This manuscript assumes that α = 1, corresponding to a uniform prior. This prior, paired with the likelihood of Eq. 2.2, yields the posterior distribution on the probabilities,

px Nx Dirichlet(α + Nx). (2.6) | ∼ In effect, one is assigning a mean probability of αm/(∑m αm + Nx) to any unobserved transition, where α can be made small if it is expected that the transition matrix should be sparse. Other values of α are | | possible, for instance α = 1/2 corresponds to the Jeffreys’ prior. Note that other Bayesian treatments of Markov chain inference have used priors within the Dirichlet family [4,26]. In the large-sample limit, as long as the components of α are bounded, the posterior distribution is not sensitive to the choice of α as the posterior density of Eq. 2.6 becomes tightly concentrated about the maximum likelihood estimates. This fact is evident by observing that in Eq. 2.6, αm + Nx Nx as Nx ∞. ≈ → (c) Model selection criteria The parameter h controls the trade-off between complexity and fitting error. From a statistical viewpoint, complexity results in less-precise determination of model parameters, leading to larger prediction errors (overfitting). Conversely, a simple model may not capture the true probability space where paths reside, and fail to catch patterns in the real process (underfit). There are various existing generalized methods for evaluating how well models predict. Each of these methods summarizes a model using a single quantity. To facilitate comparison between the methods themselves, this manuscript scales the output of all methods to the deviance scale as used in the AIC. The deviance is a measure of information loss when going from a full model to an alternative model [27]. The Akaike Information Criterion (AIC), [27–29], defined through the formula AIC = 2∑ logPr(Nx pˆ MLE) + 2k, where k is the number of parameters in the model, is an estimate of deviance − x | between an unknown true model and a given model. For the selection of h, it may be computed exactly in closed form M   Nx,m h AIC = 2∑ ∑ Nx,m log + 2M (M 1), (2.7) − x m=1 Nx − where in this context we define 0 log(0) 0. Rooted in information theory, the AIC is an asymptotic × ≡ approximation of the deviance [30]. The model with the smallest AIC is chosen. A limitation of the AIC is inaccuracy for small datasets. A correction to the AIC known as the AICc exists [31], however, its exact form is problem specific [30]. The Bayesian Information Criterion (BIC) is closely related to the AIC but differs in the form of complexity penalty, taking sample size into account. The BIC, obeying the general formula BIC = 2∑ logPr(Nx pˆ MLE) + log(N)k, is also available in closed form − x | 6 M  N    x,m h 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org BIC = 2∑ ∑ Nx,m log + log ∑Nx M (M 1) ...... (2.8) ...... − x m=1 Nx x − Despite its name, the formulation of the BIC is not Bayesian, using neither the prior nor posterior distributions. Under some conditions, however, the BIC can be seen as an asymptotic approximation of a Bayes factor [32]. Many of the Bayesian evaluation criteria feature the multivariate beta function (Eq. 2.5), as found in the normalization constant of the Dirichlet distribution (Eq. 2.4). To understand the large-sample properties of these methods, asymptotic expansion of the multivariate beta function can help. In the case where x ∞, | | → assuming that all components of x become unbounded, by Stirling’s approximation the log multivariate beta function has the behavior M   1 xm 1 3 logB(x) = ∑ xm log(xm) xm log + + O(xm− ) m=1 − − 2 2π 12xm   1 ∑m xm 1 3 log(∑xm)∑xm ∑xm log + + O((∑xm)− ) − m m − m − 2 2π 12∑m xm m       xm 1 xm ∑m xm 1 1 1 3 = ∑xm log ∑log log + ∑ + O((∑xm)− ). m ∑xm − 2 m 2π − 2π 12 m xm − ∑m xm m (2.9) Bayes factors are ratios of the probability of the dataset given two models averaged over their corresponding prior parameter distributions [33,34]. In the case of Markov chains, the likelihood completely factorizes into a product of transition probabilities and each model’s corresponding term in a Bayes factor is the exponential of its log marginal likelihood (LML)   B(Nx + α) LML = ∑log . (2.10) x B(α) If the expectation is computed instead against a posterior distribution p N, one arrives at the expected | log predictive density (LPD)   B(2Nx + α) LPD = ∑log . (2.11) x B(Nx + α) Related to the LPD is the expected log pointwise predictive density (LPPD), where the expectation in the LPD is broken down “-wise.” For our application, we will consider trajectories to be points and write the LPPD as ( j) ! B(Nx + N + α) LPPD = ∑∑log x . (2.12) j x B(Nx + α) The LPPD features in alternatives to Bayes factors and the AIC [7]. The Widely Applicable Information Criterion [35,36] (WAIC) is a Bayesian information criterion with two variants, each featuring the LPPD but differing in how they compute model complexity. The WAIC is defined as WAIC = 2LPPD + 2k , (2.13) − WAIC where the effective model sizes are computed exactly as

M    kWAIC1 = 2LPPD 2∑ ∑ Nx,m ψ(Nx,m + αm) ψ Nx + ∑αm , (2.14) − x m=1 − m and " M  # ( j) 2 ( j) 2 kWAIC2 = ∑∑ ∑ [Nx,m] ψ0(αm + Nx,m) [Nx ] ψ0 Nx + ∑αm . (2.15) j x m=1 − m In each of these expressions, Ψ refers to the digamma function, the derivative of the Gamma function [25]. 7 The WAIC, unlike the AIC, is applicable to singular statistical models and is asymptotically equivalent ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org to Bayesian leave-one-out cross-validation [35]. The two effective model size estimates ...... for ...... the ...... WAIC ...... are ...... posterior expectations of equivalent estimates used in the Deviance Information Criterion (DIC). The DIC,   DIC = 2 log p N p = (p ) + 2k , (2.16) ∑ x x Epx Nx x DIC − x | also resembles the WAIC. It consists of two variants in the computation of model complexity,

( M   M   ) Nx,m + αm kDIC1 = 2 ∑ ∑ Nx,m log ∑ ∑ Nx,m ψ(αm + Nx,m) ψ Nx + ∑αm , x m=1 Nx + ∑m αm − x m=1 − m (2.17) and kDIC2 = 2varp N [logPr(N p)], which may be computed | |    2 2 kDIC2 = 2∑ ∑Nx,mψ0(αm + Nx,m) Nx ψ0 ∑αm + Nx . (2.18) x m − m The two estimates of complexity can be derived asymptotically from the LPD [7], both reducing exactly to the number of predictors in the case of linear regression models using uniform priors. Finally Bayesian variants of cross-validation have recently been proposed as alternatives to information criterion [7]. In our problem, k-fold CV, where data is divided into k partitions, can be evaluated in closed form without repeated model fitting. Using 2 LPPD as a metric, this manuscript also evaluates two − × variants of k-fold CV: leave-one-out cross validation (LOO) ! B(Nx + α) LOO = 2 log , (2.19) − ∑∑ ( j) j x B(Nx Nx + α) − and two-fold (leave-half-out, LHO) cross validation,

J/2 + ( j) ! J ( j) ! B(Nx + Nx + α) B(Nx− + Nx + α) LHO = 2 ∑ ∑log + 2 ∑ ∑log , (2.20) − j=1 x B(Nx + α) − j=J/2 x B(Nx− + α) where Nx± constitute the transition counts of the last J/2 trajectories or the first J/2 trajectories respectively, + so that Nx− + Nx = Nx, and B refers to multivariate beta function.

3. Evaluation of selection criteria Simulations provided a means for testing how the criteria mentioned perform in finite-sample settings typical of most learning tasks. For the large-sample characteristics of cross-validation, I refer the reader to Stone [37]. As a test system, consider a system of M = 8 states, with designated start and absorbing states. For each given value of htrue, I generated for each x Xh, a single set of fixed true transition probabilities ∈ px : x Xh drawn from Dirichlet(1) distributions. For each of these random networks of a fixed h,I { ∈ } randomly sampled sets of J trajectories, 104 times – each trajectory terminating when hitting a designated absorbing state. Note that the number of steps in a given trajectory is itself stochastic and determined by the statistics of the first-passage time to an absorbing state given the true transition probabilities. Then for each sample of J trajectories, I computed all the criteria mentioned in the previous section for various values of h. Fig.2 provides the frequency that each of six models ( h = 0,1,...,5,6) was chosen based on the selection criteria compared. Each row corresponds to a given true degree of memory htrue 0,1,2,3,4,5 ∈ { } and sample sizes increase along columns when viewed from left to right. Generally, as the number of samples increases, all selection criteria except for the LML (Bayes factors) and LPPD improve in their ability to select the true model. The LML consistently selects a more-complex (higher-h) model. J = 4 J = 16 J = 64 J = 256 J = 1024 J = 4096 J = 16384 8 WAIC1 64% 91% 93% 93% 100% 100% 100% WAIC1

WAIC2 78% 100% 100% 100% 100% 100% 100% WAIC2 ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org LOO 78% 100% 100% 100% 100% 100% ...... 100% ...... LOO ...... LHO 64% 93% 100% 100% 100% 100% 100% LHO

DIC1 64% 91% 93% 93% 100% 100% 100% DIC1 = 0

DIC2 100% 100% 100% 100% 100% 100% 100% DIC2 true

h AIC 100% 100% 100% 100% 100% 100% 100% AIC BIC 100% 100% 100% 100% 100% 100% 100% BIC 2LPPD 43% 35% 51% 79% 86% 100% 100% 2LPPD − − 2LML 57% 63% 100% 100% 93% 100% 100% 2LML − −

WAIC1 86% 100% 100% 100% 100% 100% 100% WAIC1

WAIC2 93% 100% 100% 100% 100% 100% 100% WAIC2 LOO 93% 100% 100% 100% 100% 100% 100% LOO LHO 92% 100% 100% 100% 100% 100% 100% LHO

DIC1 86% 100% 100% 100% 100% 100% 100% DIC1 = 1

DIC2 79% 100% 100% 100% 100% 100% 100% DIC2 true

h AIC 79% 100% 100% 100% 100% 100% 100% AIC BIC 100% 86% 100% 100% 100% 100% 100% BIC 2LPPD 50% 56% 86% 93% 100% 100% 2LPPD − − 2LML 36% 79% 93% 100% 100% 100% 100% 2LML − −

WAIC1 40% 86% 100% 100% 100% 100% 100% WAIC1

WAIC2 40% 93% 100% 100% 100% 100% 100% WAIC2 LOO 40% 93% 100% 100% 100% 100% 100% LOO LHO 40% 36% 79% 100% 100% 100% 100% 100% LHO

DIC1 40% 79% 100% 100% 100% 100% 100% DIC1 = 2

DIC2 100% 100% 35% 65% 100% 100% 100% 100% DIC2 true

h AIC 100% 100% 63% 37% 100% 100% 100% 100% AIC BIC 100% 100% 79% 100% 100% 100% 100% BIC 2LPPD 36% 39% 36% 63% 93% 100% 100% 100% 2LPPD − − 2LML 37% 39% 44% 49% 100% 100% 100% 100% 100% 2LML − −

WAIC1 43% 58% 35% 86% 100% 100% 100% 100% WAIC1

WAIC2 43% 65% 100% 100% 100% 100% 100% WAIC2 LOO 43% 65% 100% 100% 100% 100% 100% LOO LHO 43% 51% 86% 100% 100% 100% 100% LHO

DIC1 35% 65% 35% 86% 100% 100% 100% 100% DIC1 = 3

DIC2 100% 98% 100% 100% 100% 100% 100% DIC2 true

h AIC 100% 77% 100% 72% 100% 100% 100% AIC BIC 100% 100% 79% 100% 100% 100% 100% BIC 2LPPD 35% 37% 44% 42% 63% 77% 100% 100% 100% 2LPPD − − 2LML 35% 37% 37% 56% 100% 100% 100% 100% 100% 2LML − −

WAIC1 42% 72% 86% 100% 100% 100% WAIC1

WAIC2 42% 36% 79% 100% 100% 100% 100% WAIC2 LOO 35% 36% 79% 100% 100% 100% 100% LOO LHO 42% 43% 49% 93% 100% 100% 100% LHO

DIC1 42% 36% 72% 86% 100% 100% 100% DIC1 = 4

DIC2 100% 100% 100% 86% 86% 51% 49% 100% DIC2 true

h AIC 100% 99% 100% 79% 100% 86% 100% AIC BIC 100% 100% 65% 35% 100% 100% 72% 100% BIC 2LPPD 36% 57% 43% 35% 42% 37% 84% 100% 100% 100% 2LPPD − − 2LML 57% 36% 49% 93% 100% 100% 100% 100% 2LML − −

WAIC1 35% 44% 35% 51% 42% 77% 100% 100% 100% WAIC1

WAIC2 44% 35% 51% 35% 84% 100% 100% 100% WAIC2 LOO 44% 35% 51% 84% 100% 100% 100% LOO LHO 35% 44% 65% 84% 100% 100% 100% LHO

DIC1 37% 35% 44% 35% 51% 42% 77% 100% 100% 100% DIC1 = 5

DIC2 100% 93% 100% 93% 56% 44% 100% 100% DIC2 true

h AIC 100% 77% 100% 100% 100% 100% 100% AIC BIC 100% 100% 79% 100% 100% 56% 44% 100% BIC 2LPPD 44% 42% 44% 42% 37% 63% 35% 65% 93% 100% 100% 2LPPD − − 2LML 37% 42% 37% 49% 84% 100% 100% 100% 100% 2LML − −

Figure 2. Chosen degree of memory h in simulations for varying true degrees of memory htrue and number of observed trajectories J. Choices made on basis of model with lowest value of given criterion. Rows correspond to model selection under a given degree of memory. Columns correspond to the number of trajectories. Depicted are the percent of simulations in which each degree of memory is selected using the different model evaluation criteria (percents of at least 20 are labeled). Colors coded based on degree of memory: (0: black, 1: red, 2: blue, 3: green, 4: purple, 5: orange).

Example: For htrue = 1 and J = 4, the WAIC1 criteria selected h = 1 approximately 86% of the time. 40 J = 4 J = 16 J = 64 J = 256 J = 1024 J = 4096 150 600 3000 10000 50000 9 30

1 8000 40000 100 400 2000 20 6000 30000 ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org WAIC 4000 ...... 10 50 200 1000 20000 ∆ 2000 10000 0 0 0 0 0 0 125 3000 10000 50000 20 100

2 8000 40000 400 2000 75 6000 30000 10

WAIC 50 200 1000 4000 20000 ∆ 25 0 2000 10000

0 0 0 0 0 3000 20 100 10000 50000

400 8000 40000 75 2000 10 6000 30000

LOO 50 200 4000 20000 ∆ 1000 25 0 2000 10000

0 0 0 0 0

100 2500 10000 50000 20 400 80 2000 8000 40000 300 60 1500 6000 10 30000

LHO 200 40 1000 4000 20000 ∆ 20 100 0 500 2000 10000 0 0 0 0 0 40 150 600 3000 10000 50000

30 8000 40000

1 100 400 2000 20 6000 30000 DIC 50 4000 20000 ∆ 10 200 1000 2000 10000 0 0 0 0 0 0

50000 100 3000 10000 0 400 8000 40000 2 50 2000 6000 30000

DIC 200 20 4000 ∆ − 0 1000 20000 2000 10000 40 0 − 50 − 0 0 0 1500 40000 4000 15000 60000 200 30000 1000 3000 10000 20000 40000

AIC 2000 0 500 ∆ 1000 5000 10000 20000 200 0 − 0 0 0 0 80000 250000 4000 15000 500000 500 200000 60000 400000 10000 150000 2000 0 40000 300000 BIC 100000

∆ 5000 200000 500 0 20000 − 50000 100000 0 0 0 0 1500 600 4000 10000 50000 200 1000 400 2000 25000 500 0

LML 0 100 200 0 0 2∆ 2000 0 − 10000 25000 − 500 − − 0 − 4000 − 200 50000 − 1000 20000 − − −

= 0 = 1 = 3 = 4 = 5 = 0 = 1 = 3 = 4 = 5 = 0 = 1 = 3 = 4 = 5 = 0 = 1 = 3 = 4 = 5 = 0 = 1 = 3 = 4 = 5 = 0 = 1 = 3 = 4 = 5 h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h

Figure 3. Distributions of computed selection criteria relative to a true model (htrue = 2), ∆Criterion(h) = Criterion(h) Criterion(htrue). Density plots with minimum, maximum, and mean of the selection criteria for each model − relative to that of the true model are shown at various sample sizes J. Values above zero mean that the true model is favored over a particular model. Ideally, mass should be above zero for accurate selection of the true model (zero drawn as dashed line).

The AIC does well if htrue is small, but requires more data than many of the competing methods in order to resolve larger degrees of memory. The BIC behaves like a more-conservative version of the AIC, requiring more data to select the more-complex but true generating model than the other methods. LOO, the two variants of the WAIC, and DIC1 perform roughly on par. Since one uses each criterion by choosing the model of the lowest value, it is desirable that ∆Criterion(h) = Criterion(h) Criterion(htrue) > − 0, for h = htrue. Fig.3 explores the distributions of these quantities in the case where htrue = 2. As sample 6 size J increases, there is clearer separation of these quantities from zero. By J = 64, no models where h = 1 are selected using any of the criteria. The WAIC2 and LOO criteria perform about the same whereas the WAIC1 criteria and the DIC1 criteria lag behind in separating themselves from zero. In the Supplemental Materials, the aforementioned experiment is repeated for M = 4 state systems, finding consistent results. This consistency is evident when comparing Supplemental Fig. S2 to Fig.2, where the same trends are present in both sets of results. Informed by these tests, this manuscript recommends the use of leave-one-out cross validation (LOO). 10 LOO performed slightly better than WAIC2 in the included tests, while being somewhat simpler to compute. ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org Eq. 2.19 decomposes completely into a sum of logarithms of gamma functions, . . and ...... is ...... hence ...... easy ...... to ...... implement in standard scientific software packages.

4. The hot hand phenomenon I used the methodology in this manuscript to evaluate the hot hand effect in the controlled context of free throws. The free throw data was manually scraped from game-logs publicly available on the ESPN website. In Fig.4, LOO is evaluated for h 0,1,2,3,4 , for all player-seasons between 2013–2014 and 2015– ∈ { } 2016. LOO favored a model where h > 0 for approximately 23% of all player-seasons tested. Restricted to player-seasons where at least 82 free throws are attempted, ∆LOO = LOO(h) LOO(0) is presented in − Fig.4. The finding that h > 0 does not by itself present information about the nature of statistical dependencies. To examine their nature, one must examine the inferred probabilities. In Fig.5, conditional hitting probabilities are presented, for the same player-seasons represented in Fig.4, where h = 1 is favored over h = 0. It is evident that many of the players exhibit what one might call a “hot hand” by shooting a better percentage after a previous make than otherwise. Likewise, many players exhibit cold hands, shooting worse after misses, or by shooting a worse free throw percentage on the first attempt in a game. Notably, some players, such as LeBron James, seem to shoot better after a miss than otherwise which would represent a form of error correction. LeBron James is a volume free throw shooter who appears in Fig.4 so I singled him out for further analysis. Looking at Fig.4, James appears in the first two seasons but is not present in the third season (2015–2016). Examining his shooting splits from Fig.5 two trends stand out. First, he shoots a lower percentage on the first free of the game than otherwise. Second, he appears to consistently hit a higher percentage after a miss than otherwise. In the 2015–2016 season, those patterns do not survive and LOO does not favor h = 1 over h = 0. During the 2016-2017 season, in 91 games, LeBron James attempted at least a single free throw, hitting 471 of 693 overall (Fig.6). Conditioning the hit probabilities by the outcome of the preceding free throw in the same game, James shot a slightly better percentage after missing a free throw than otherwise. However, the h = 0 model is favored slightly over h = 1 (Model selection pane of Fig6). As in Fig.2 for the M = 8 test system, we can evaluate the performance of the model selection criteria using simulations. Assuming that the h = 1 model is true, sets of 91 strings of free throw outcomes were simulated. The length of each string was chosen by drawing from a Poisson distribution where the expectation matched the mean number of free throws attempted by James per game ( 7.6). The overall ≈ hitting percentage in these simulations was matched to 68%, as found in the original game data, and the transition probabilities were matched overall to those in Fig.6. Despite the fact that h = 1 was the underlying true model, it was chosen slightly under half the time (Fig.6). Another variant of the simulated power analysis is given in Fig. S1, where within each game James’ free throw outcomes are resampled from the actual game data, in effect scrambling the order of makes and misses. Shuffling of shot outcomes destroys the correlations between consecutive shots. It is seen in Fig. S1 that the information criteria match up both qualitatively and quantitatively in their model choice with the h = 0 simulated game data presented in Fig.6. Examining the model parameters in the case of h = 1, one sees that the hitting probabilities are similar in all cases except after a miss (Fig.6). This observation suggests a model with jagged variable-length memory: independence of outcome except after a miss. Having one fewer parameter than the full h = 1 model, this variable-length model is favorable to both the h = 0 and h = 1 models (Fig.7). Hence, at least for this season and the two present in Fig.4, the most predictive model of James’ free throw shooting tells a story of error correction rather than a story of hot hand.

5. Discussion and Summary This manuscript addressed general methods of degree selection for multistep Markov chain models. While the example provided was related to the evaluation of the hot hand phenomenon, the class of models where 2013 − 2014 2014 − 2015 2015 − 2016 11 −0.7 2.4 11.7 13.6 Wesley Matthews −5.4 −2.3 4.4 10.2 Will Barton −0.7 5.2 13.9 21.2 Tyler Zeller −1 3.5 4 4.1

Tyreke Evans −4.5 0.6 3 10.7 Trevor Booker −4.2 1.7 12.6 10.1 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org ...... −1.2 ...... 0.1 ...... 1.3 ...... 3.9 ...... −0.6 −2.3 −2.4 9.3 Tony Wroten −0.9 1.2 6.3 4.7 Tony Allen −0.7 5.2 9.4 12.5 Timofey Mozgov −2.4 −1.4 4.3 12.9 −2.2 4.5 8.2 16.2 −3.6 0.1 5.8 8.7 Thaddeus Young −2.4 4.3 7.4 14.9 Shabazz Muhammad −1 3.3 7.3 17.1 Thaddeus Young −10 −0.8 −1.6 0.2 Taj Gibson −0.2 −1.9 1.7 4.6 −0.5 −3 6 17.6 Serge Ibaka −0.3 3.9 6.2 8.8 Ryan Anderson Steven Adams −3.4 −4.8 2 6.3 −8.7 −9 −8.6 −4.2 Ryan Kelly −0.9 3.9 9.2 12.4 −0.3 3 10.7 13.6 −1.1 0.8 6.1 10.1 Rodney Stuckey −0.8 1.7 7.1 2.9 −12.1 −9.2 −5.9 −8.5 −0.9 4.6 10.6 14.9 −2.6 2 4.9 9.6 Reggie Jackson −0.5 6 12.5 10.7 Serge Ibaka −0.2 4.2 10.5 20.8 Otto Porter −2 2.6 1.8 10.9 Reggie Jackson −4.6 −6.1 −2.8 0.5 Samuel Dalembert −5.8 −1.1 0 −1.9 −1.6 2.1 11.7 9.1 Otto Porter −5.1 −2.2 0.9 3.1 Nikola Jokic Rodney Stuckey −1.6 3.6 11.3 8.6 −0.1 −0.6 6.5 11.2 Omer Asik 0 5.3 12.4 10.9 −1.6 −2 2.7 6.3 Robert Sacre −7.3 −2.8 −7.9 −4.2 Norris Cole −6.3 −2.1 1.2 3.6 Nicolas Batum −3.5 3.3 9.3 12.2 −1.1 3.4 5.7 11 Nene −1.5 1.2 9.9 15.7 Nikola Mirotic −2.6 3.1 4.9 13.5 Ramon Sessions −1.1 −2.1 5.9 4.8 Mirza Teletovic −1.1 −1.8 2.2 3.4 Marcus Morris −1.9 0.7 −1.3 0.4 Paul Pierce −0.3 6.1 12.3 11.1 Matthew Dellavedova −3 1.4 2 2 −2.2 3.2 6.1 9.7 −1.2 3.4 5.4 7.4 −0.6 0.6 5.1 6.1 −1.3 −0.7 1 1.7 Marreese Speights −0.2 0.2 0.3 0.5 Nick Collison −0.9 2 −1.9 −1.9 Markel Brown −1.1 −2.1 5.9 7.4 Luis Scola −5.7 1.1 6.2 6.8 Nene −8.5 −7.5 −8.9 8.6 Marcus Morris −1.7 −0.8 8.3 10.7 Lou Amundson −0.9 4.4 1.8 −2.6 Maurice Harkless −3.2 −2.6 −9.3 −7 Marcin Gortat −0.9 2.2 6.8 11.2 Marcin Gortat −10.2 −3.9 −1.6 −0.4 LeBron James −4.7 −3.9 7.8 26.1 Luis Scola −0.2 4.2 4.8 10 Louis Williams LeBron James −6.6 −1.4 3.4 16.8 Kyle Korver −1.8 2.8 6.7 11.5 −0.4 1.6 11.4 14.4 −0.1 3.3 11.4 13.9 Kyrie Irving −1.3 −0.7 2.2 6.7 Kris Humphries −1.1 1.8 4.6 8 Khris Middleton −1.7 −2.6 3.9 7.3 Kris Humphries Kevin Seraphin −2 −0.7 −2.9 −1.9 −0.2 3.9 6 7.9 Kevin Martin −0.2 5.7 8.5 15.9 Khris Middleton −1.6 1.4 2 6.4 K.J. McDaniels −3.2 0.2 3.6 7.4 Kentavious Caldwell−Pope −1 4.7 7.7 13.5 −0.7 1.9 5.5 16.6 Jusuf Nurkic −0.2 5.5 13.2 12 Kenneth Faried −0.6 2.3 7.5 8.1 Kemba Walker −0.6 3.8 7.1 17.9 −0.8 2.8 2.6 7.1 Jerryd Bayless −8.2 −2.1 2.4 4.7 Jerryd Bayless −5 −2.5 2.7 6.5 Jordan Crawford −0.1 4.6 10.9 18.2 −1.7 1.8 3.6 6.7 −3.5 −2.3 −0.2 1.4 −5.4 −1.5 4.2 15.7 J.J. Barea −0.1 2.6 3.6 5.2 Jeff Green −1.5 4.3 9.2 5.8 Jameer Nelson −0.3 2.5 7.9 10.4 Gordon Hayward −0.1 2.8 6.4 16.4 Jae Crowder −0.5 6 11.4 17.2 J.J. Hickson J.R. Smith −1 0.7 −0.5 −1.8 −4.4 −0.4 13.4 22.4 Goran Dragic −3.7 1.4 1.8 13 J.J. Redick J.J. Barea −2.1 0 1.8 6.3 −1 1.4 4.1 4.5 Gerald Henderson −0.9 4.5 5.6 14.3 Isaiah Thomas −5.1 0.9 7.1 11.5 −2.1 5 9.5 3.7 Evan Turner −4.5 −4.7 0.7 4.1 Hassan Whiteside 0 4.4 14.7 26.4 Gorgui Dieng −1.9 4.8 4.3 6.9 Eric Gordon −3 −1.5 2.3 1.5 −6 −2.4 1 2.4 Gerald Henderson −1.3 −1.2 2.9 7.6 Garrett Temple −1.9 5 5.6 10.4 Ed Davis −0.3 3.4 8.7 5.8 Ersan Ilyasova −3.8 −5.5 −4.2 −0.4 Ersan Ilyasova −1.6 4 12.1 12.2 Dirk Nowitzki −5.8 −2.5 2.1 7.7 Eric Gordon Dwyane Wade −0.2 6 17.3 23.3 −4.5 −5.2 −5.2 −0.6 −2 3.1 3.1 10.7 Ed Davis −0.7 3 6.7 13.8 −0.5 −1.3 7.2 22.9 −0.1 2.2 5.8 7.8 Dirk Nowitzki −3 −1.1 −0.5 4.5 −1.2 −2 −7.2 −7 Derrick Favors −1 3.8 9.9 15.7 Cory Joseph −4.4 1.6 3.9 6.2 Devin Harris −3.3 −0.3 2.1 10.4 −4.3 0.2 6.4 5.8 Corey Brewer −1.1 1.6 1.7 11.9 D.J. Augustin −5.8 −4.3 −5.9 −1.3 D.J. Augustin −1.3 0.8 5.6 14.9 −0.5 3.8 9.5 9 Cody Zeller −0.1 4.5 6.1 17.8 Cole Aldrich −0.1 −2.6 2.7 4.1 C.J. Watson −1.2 1.1 5.9 12.7 Chris Bosh −4.1 1.3 6.3 3.4 −4.2 −0.1 6.6 14 −3.1 0.8 0.1 1 Brook Lopez −4.5 −1.3 −4.6 2.7 Chris Douglas−Roberts −6.2 −0.8 1.8 7.2 C.J. Miles −2.2 2.3 7.3 7.9 −1.7 −0.5 1.6 −0.4 −0.3 4.8 13.4 16.5 Brandon Bass −0.6 0.3 2.2 6.4 Brandon Bass −1.3 3.2 10.3 14.2 −4.5 0.9 12.7 21.8 Bojan Bogdanovic −0.5 3.6 6.4 10.9 Boris Diaw −0.2 5.7 10.5 14.4 Arron Afflalo −1.8 3.6 7.3 14.8 Ben McLemore −3.4 2.4 5.5 4.9 Avery Bradley −0.3 5.8 11.2 15.4 −0.1 2.6 7.5 9 −0.4 6.1 15.6 13.9 Archie Goodwin −2.3 0.6 4.6 9.5 Alex Len −5.8 −1.1 −1.8 −0.5 −4.3 −3.4 6.8 20.5 −4.4 −8.3 −8.4 −10.5 Andre Miller −1 4.4 8.1 9.2 Aaron Brooks −3.9 1.3 9.3 10.1 −0.8 2.2 10.1 9.7 h=1 h=2 h=3 h=4 h=1 h=2 h=3 h=4 h=1 h=2 h=3 h=4

LOO(h) − LOO(h=0) −10 0 10 20

Figure 4. LOO(h) - LOO(h = 0) for player seasons with evidence of memory for the given seasons (minimum 82 attempts in a season). Negative (olive) values of this quantity mean that the associated model is more predictive than the h = 0 model. 2013 − 2014 2014 − 2015 2015 − 2016 12 Tyson Chandler 0.64 0.72 0.5 0.63 Wesley Matthews 0.85 0.7 0.61 0.75 Will Barton 0.82 0.87 0.72 0.81 Tyler Zeller 0.87 0.72 0.71 0.81 0.82 0.68 0.68 0.77 Trevor Booker 0.6 0.7 0.43 0.58 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org Tony Parker ...... 0.82 ...... 0.67 ...... 0.7 ...... 0.77 ...... Trevor Ariza 0.83 0.67 0.72 0.77 Tony Wroten 0.65 0.76 0.53 0.67 Tony Allen 0.72 0.63 0.53 0.66 Timofey Mozgov 0.79 0.78 0.63 0.75 Taj Gibson 0.66 0.68 0.83 0.71 Toney Douglas 0.9 0.58 0.77 0.85 Thaddeus Young 0.76 0.73 0.58 0.71 Shabazz Muhammad 0.68 0.84 0.63 0.72 Thaddeus Young 0.76 0.42 0.65 0.64 Taj Gibson Shane Larkin 0.85 0.75 0.67 0.78 0.76 0.81 0.66 0.75 Serge Ibaka 0.82 0.92 0.79 0.84 Steven Adams Ryan Anderson 0.87 0.96 0.8 0.87 0.56 0.7 0.37 0.55 Ryan Kelly 0.79 0.93 0.79 0.83 Roy Hibbert 0.86 0.74 0.72 0.81 Spencer Hawes 0.74 0.91 0.76 0.78 Rodney Stuckey 0.86 0.69 0.77 0.82 Rodney Hood 0.79 0.89 0.97 0.86 Shaun Livingston 0.78 0.91 0.78 0.81 Robert Covington 0.86 0.81 0.71 0.82 Reggie Jackson 0.88 0.85 0.78 0.87 Serge Ibaka 0.78 0.87 0.71 0.78 Otto Porter 0.68 0.89 0.76 0.76 Reggie Jackson 0.87 0.65 0.8 0.83 Samuel Dalembert 0.83 0.58 0.64 0.73 Omri Casspi 0.74 0.57 0.57 0.65 Otto Porter 0.76 0.79 0.5 0.68 Rodney Stuckey 0.86 0.84 0.73 0.83 Nikola Jokic 0.85 0.71 0.75 0.81 Omer Asik 0.6 0.64 0.48 0.58 Nik Stauskas 0.85 0.74 0.64 0.77 Robert Sacre 0.62 0.91 0.58 0.68 Norris Cole 0.88 0.65 0.57 0.71 Nicolas Batum 0.84 0.96 0.8 0.85 Richard Jefferson 0.71 0.87 0.69 0.74 Nene 0.66 0.57 0.45 0.58 Nikola Mirotic 0.81 0.88 0.7 0.8 Ramon Sessions 0.84 0.77 0.71 0.81 Mirza Teletovic 0.83 0.72 0.67 0.77 Marcus Morris 0.75 0.59 0.51 0.63 Paul Pierce 0.84 0.85 0.73 0.82 Matthew Dellavedova 0.87 0.92 0.73 0.83 Marcin Gortat 0.78 0.64 0.61 0.7 Patty Mills 0.9 0.91 0.78 0.87 Marvin Williams 0.85 0.61 0.82 0.82 Luol Deng 0.72 0.86 0.76 0.76 Marreese Speights 0.83 0.88 0.73 0.81 Nick Collison 0.63 0.86 0.69 0.71 Markel Brown 0.67 0.88 0.78 0.76 Luis Scola 0.79 0.6 0.62 0.71 Nene 0.65 0.54 0.39 0.56 Marcus Morris 0.77 0.82 0.65 0.76 Lou Amundson 0.42 0.61 0.35 0.47 Maurice Harkless 0.72 0.57 0.46 0.59 Marcin Gortat 0.75 0.72 0.59 0.7 Marcin Gortat 0.76 0.7 0.5 0.68 LeBron James 0.7 0.79 0.63 0.72 Luis Scola 0.78 0.78 0.61 0.73 LeBron James 0.75 0.84 0.67 0.76 Kyle Korver 0.91 0.92 0.8 0.89 Louis Williams 0.84 0.85 0.73 0.83 Kyrie Irving 0.85 0.9 0.93 0.88 Kyrie Irving 0.89 0.76 0.8 0.86 Kris Humphries 0.75 0.86 0.62 0.74 Khris Middleton 0.91 0.75 0.83 0.89 Kris Humphries Kevin Seraphin 0.71 0.85 0.57 0.69 0.75 0.86 0.86 0.81 Kevin Martin 0.85 0.88 0.93 0.89 Khris Middleton 0.9 0.69 0.78 0.86 K.J. McDaniels 0.83 0.58 0.7 0.75 Kentavious Caldwell−Pope 0.82 0.87 0.72 0.81 Kenneth Faried 0.7 0.62 0.57 0.65 Jusuf Nurkic 0.69 0.68 0.51 0.64 Kenneth Faried 0.68 0.58 0.53 0.61 Kemba Walker 0.87 0.87 0.76 0.85 Jordan Hill 0.75 0.59 0.62 0.69 Jerryd Bayless 0.91 0.53 0.84 0.87 Jerryd Bayless 0.88 0.67 0.61 0.78 Jordan Crawford 0.87 0.91 0.78 0.86 Jerami Grant 0.56 0.51 0.72 0.59 Jeremy Lamb 0.84 0.71 0.59 0.73 John Wall 0.82 0.84 0.67 0.8 J.J. Barea 0.85 0.85 0.71 0.81 Jeff Green 0.76 0.78 0.62 0.73 Jameer Nelson 0.85 0.92 0.8 0.86 Gordon Hayward 0.82 0.72 0.84 0.81 Jae Crowder 0.84 0.69 0.79 0.81 J.J. Hickson J.R. Smith 0.73 0.67 0.5 0.63 0.58 0.54 0.37 0.52 Goran Dragic 0.83 0.72 0.66 0.77 J.J. Barea J.J. Redick 0.89 0.94 0.81 0.88 0.85 0.81 0.67 0.79 Gerald Henderson 0.86 0.9 0.75 0.85 Isaiah Thomas 0.88 0.9 0.76 0.87 Ian Mahinmi 0.57 0.77 0.56 0.62 Evan Turner 0.71 0.93 0.72 0.76 Hassan Whiteside 0.68 0.57 0.63 0.64 Gorgui Dieng 0.74 0.58 0.46 0.62 Eric Gordon 0.88 0.75 0.69 0.81 Gary Harris 0.89 0.85 0.68 0.82 Gerald Henderson 0.78 0.78 0.65 0.75 Garrett Temple 0.73 0.88 0.65 0.74 Ed Davis 0.58 0.44 0.43 0.49 Ersan Ilyasova 0.87 0.84 0.65 0.82 Ersan Ilyasova 0.74 0.81 0.59 0.72 Dirk Nowitzki 0.88 0.97 0.82 0.89 Dwyane Wade 0.77 0.65 0.72 0.74 Eric Gordon 0.91 0.93 0.75 0.89 Dion Waiters 0.77 0.58 0.63 0.68 Ed Davis 0.63 0.57 0.47 0.56 Dwight Howard 0.59 0.52 0.5 0.55 Danny Green 0.9 0.77 0.79 0.86 Dirk Nowitzki 0.89 0.96 0.84 0.9 Draymond Green 0.77 0.58 0.63 0.69 Derrick Favors 0.65 0.8 0.73 0.71 Cory Joseph 0.68 0.9 0.68 0.73 Devin Harris 0.88 0.71 0.68 0.8 Damian Lillard 0.92 0.8 0.83 0.89 Corey Brewer 0.75 0.77 0.61 0.72 D.J. Augustin 0.92 0.88 0.76 0.89 D.J. Augustin 0.84 0.86 0.68 0.8 Carlos Boozer 0.65 0.71 0.5 0.63 Cody Zeller 0.75 0.78 0.62 0.73 Cole Aldrich 0.66 0.53 0.8 0.7 C.J. Watson 0.82 0.92 0.74 0.83 Chris Bosh 0.75 0.91 0.8 0.79 Chris Paul 0.87 0.84 0.72 0.84 Chandler Parsons 0.78 0.65 0.53 0.68 Brook Lopez 0.85 0.84 0.69 0.82 Chris Douglas−Roberts 0.82 0.95 0.67 0.81 C.J. Miles 0.76 0.88 0.64 0.75 Brandon Bass 0.84 0.71 0.7 0.78 Brandon Jennings 0.78 0.7 0.67 0.75 Brandon Bass 0.87 0.88 0.74 0.84 Brandon Bass 0.82 0.92 0.89 0.86 Blake Griffin 0.75 0.74 0.59 0.73 Bojan Bogdanovic 0.88 0.76 0.73 0.83 Boris Diaw 0.64 0.77 0.78 0.72 Arron Afflalo 0.88 0.83 0.73 0.84 Ben McLemore 0.79 0.81 0.56 0.72 Avery Bradley 0.81 0.86 0.68 0.78 Anthony Morrow 0.82 0.91 0.76 0.83 Andrew Wiggins 0.77 0.79 0.66 0.76 Archie Goodwin 0.75 0.56 0.61 0.67 Anthony Davis Alex Len 0.62 0.57 0.86 0.7 0.83 0.71 0.7 0.79 Amir Johnson 0.55 0.42 0.72 0.59 Andre Miller 0.7 0.62 0.87 0.76 Aaron Brooks 0.85 0.6 0.83 0.82 Aaron Gordon 0.73 0.66 0.56 0.67

first first first overall overall overall after make after miss after make after miss after make after miss

probability 0.4 0.6 0.8

Figure 5. In-game conditional free throw hitting probabilities, for player-seasons shown in Fig.4, where the h = 1 model is more predictive than the h = 0 model. Simulated power analysis (h = 1) Observed free throw accuracy true 13 WAIC1 overall 471/694 (67.9%) 41% 48% WAIC2 42% 48% ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org after 139/189 (73.5%) ...... miss LOO 43% 48% after 272/413 (65.9%) LHO 45% 42% make DIC1 41% 49% first of 60/91 (65.9%) game DIC2 41% 50% 0.0 0.2 0.4 0.6 0.8 1.0 Free throw make probability AIC 42% 49% BIC 94% Model selection AIC 877 Simulated power analysis (htrue = 0)

LOO WAIC1 81% 12% 876 WAIC2 WAIC2 DIC2 83% 12% DIC1 875 WAIC1 LOO 83% 12%

874 LHO 70% 20% criterion DIC1 82% 12% 873 DIC2 82% 12% 872 AIC 83% 12%

871 BIC 100% 0 1 2 3 h Model chosen percentage

Figure 6. LeBron James’ free throw accuracy for the 2016-2017 season and evaluation of the hot hand phenomenon. Model selection criteria for degree h 0,1,2,3 based on four criteria compared. Lower is better and h = 0 is slightly ∈ { } favored over h = 1 using all criteria. Simulated power analysis showing the frequency that each value h is chosen for simulated sets of free throw trajectories.

Hybrid jagged model: observed free throw accuracy after miss 139/189 (73.5%)

otherwise 332/504 (65.9%)

0.0 0.2 0.4 0.6 0.8 1.0

Figure 7. Hybrid variable-length memory model for free throw outcome where shots are independent except immediately after a miss. AIC: 869.40, WAIC1: 869.45, WAIC2: 869.52, LOO: 869.52. For reference, all selection criteria for the fully independent (h = 0) model are approximately 871 (Fig.6).

the included methodology can be used is broad. Notably, multistep Markov Chains have applications in queueing models which feature heavily in operations research. The simulations yielded insight on the performance of the criteria in the typical small-sample setting. This manuscript provided simulations for M = 2,4,8 state systems, finding consistent results throughout. Importantly, both the AIC and LML (Bayes factors) are biased in opposite situations, in opposite directions. For small datasets, the AIC tends to sparsity, which runs counter to the typical situation in linear regression problems where the AIC can favor complexity with too few data, a situation ameliorated by the more- stringent AICc [38]. The BIC is the most-conservative of the methods tested, however, requires more data to accept the veracity of any given higher-order model. Bayes factors with flat model priors (via the AIC-scaled log marginal likelihood) as investigated here, on the other hand, consistently select a higher value of h given more data. Due to the nested nature of these models (depicted in Fig.1), such behavior may not be undesirable. One may still learn an effective lower- order model within a higher-order model, finding that the higher-order model makes effectively the same predictions. Notably, alternative Bayes factors methods for selecting the degree of memory also include model-level priors that behave like the penalty term in the AIC [2,39]. Since the upper bound of the LML is the logarithm of the likelihood found from the MLE procedure, this selection method is more stringent in 14 the low sample-size regime than the pure AIC and hence will suffer from the same bias towards selecting ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org models with less memory...... Additionally, it is known that Bayes factors are sensitive to the choice of prior [7], since they involve an expectation relative to the prior distribution. Examining Eq. 2.10, the denominator of the term within the logarithm of Eq. 2.10 is invariant to observations. In contrast, Bayesian alternatives where expectations are taken with respect to the posterior should not be as sensitive to the choice of prior. When looking at LOO of Eq. 2.19, the prior comes into the formulation only to increment the overall count of a given pattern. Asymptotically, Nx quickly overwhelms α. Hence, LOO is not as sensitive to the exact choice of α.

(a) Limitations and extensions This manuscript addressed only a limited aspect of the overall model selection task – the evaluation of competing models on the basis of predictive accuracy. This manuscript does not tackle the parallel task of model searching, outside of the context of fixed-order multistep models. For fixed order models, search is easy. One fits models by order sequentially. We have seen, however, that at times variable-length histories are appropriate. Notably, in LeBron James’ 2016–2017 free throws, a variable length model is favored over a larger encompassing model, which is itself disfavored relative to a smaller fixed-length model. In that example, with the small number of states, one could easily detect the variable-length model directly. However, when the number of states increases, the number of variable-length models also increases exponentially. While out of the intended scope of this manuscript, I note that projective search methods [40] may have promise for adaptation to the search for variable-length Markov chains. In these methods, one searches for submodels nested within a larger encompassing model. As a baseline for such a procedure, one may choose to begin with a model of slightly higher order than that selected by LOO. As pertains to the hot hand and related phenomena, fundamentally, these phenomena manifest as observable correlations in shot outcomes. However, the “generating distributions” for free throw outcomes are likely not in the class of multistep Markov models. From a modeling standpoint, a hidden Markov model with “hot” and “cold” states, as implemented by Wetzels et al. [41], may be more mechanistically- valid. Hidden Markov models map to multistep Markov models of perhaps infinite order at arbitrarily high precision. Hence, this manuscript focused on the detection of any statistical dependency in free throws. However, one could make an argument, as I have done, that the various patterns of statistical dependencies detected: cold first shot, error correction, etc, have real-world physical interpretations. Such an argument may not hold in generality for other processes modeled using multistep Markov chains. Hence, I would like to stress that the focus of the manuscript is on finding predictive models, rather than on mechanistic certainty.

(b) The hot hand phenomenon From the modeling perspective, for a given player, there can be large fluctuations in free throw shooting percentage between seasons. It is common, particularly early in careers, for players to drastically improve their accuracy after an offseason of training. However, changes occur often in the other direction as well. Hence, one either needs to model non-stationarity in the percentages or restrict the time interval of the applicability of any given model. In this manuscript I have chosen to restrict modeling to the interval of a single season, ignoring non-stationarity within the season, a trade-off common to other analyses [8]. The downside of such restrictions is that they limit the volume of data that may be used in detecting effects. LeBron James, a volume free throw shooter who does not miss many games and plays well into the playoffs, is perhaps a best-case scenario for detection of statistical dependencies in free throws using such models. Even for James, the detection of these effects can be difficult. Judging from simulations (Fig6), it appears that the dataset is underpowered for the selection of a pure model where h = 1. On the other hand, in simulations where htrue = 0, h = 0 is correctly chosen approximately 83% of the time. The h = 0 and h = 1 models are both approximately as predictive. In fact, from the model averaging perspective, they would be weighted approximately the same as weights are exponential in the gap between the selection 15 criteria [7]. ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org For James’ 2016–2017 attempts, the variable length model (Fig7) is more predictive ...... than ...... either ...... of ...... the ...... pure h = 0 and h = 1 models. While his 2013–2014 and 2014–2015 seasons do favor h = 1 over h = 0, the effective models have the same error-correcting behavior as the variable-length model of 2016–2017. Hence, based on finding the model with the best prediction, one would predict that LeBron James is often more likely to make a free throw after a miss than otherwise.

Conflict of interests. I have no competing interests.

Data Accessibility. All data has been deposited on Dryad

Competing Interests. The author declares no competing interests.

Disclaimer. This section does not apply.

Authors’ Contributions. The sole author is responsible for the entirety of the work.

Funding. This work is supported by the Intramural Research Program of the National Institutes of Health Clinical Center and the US Social Security Administration.

Acknowledgements. This manuscript is dedicated to the memory of Dr. Robert M. Miura. He is survived by his loving family, his numerous mathematical contributions, and the many generations of researchers that he has mentored and advised throughout his decades of generous service. I also thank members of the Biostatistics and Rehabilitation Section in the Rehabilitation Medicine Department at NIH, John P. Collins in particular, and also Carson Chow at NIDDK for helpful discussions.

References 1. SS Melnyk, OV Usatenko, and VA Yampol’skii. Memory functions of the additive Markov chains: applications to complex dynamic systems. Physica A: Statistical Mechanics and its Applications, 361 (2):405–415, 2006. 2. Philipp Singer, Denis Helic, Behnam Taraghi, and Markus Strohmaier. Detecting memory and structure in human navigation patterns using Markov chain models of varying order. PloS one, 9(7):e102070, 2014. 3. Zheng Yuan. Prediction of protein subcellular locations using Markov chain models. FEBS letters, 451 (1):23–26, 1999. 4. Mario Bettenbühl, Marco Rusconi, Ralf Engbert, and Matthias Holschneider. Bayesian selection of Markov models for symbol sequences: Application to microsaccadic eye movements. PloS one, 7(9): e43388, 2012. 5. Jyotiprasad Medhi. Stochastic models in queueing theory. Elsevier, 2002. 6. Aki Vehtari, Andrew Gelman, and Jonah Gabry. Pareto smoothed importance sampling. arXiv preprint arXiv:1507.02646, 2015. 7. Andrew Gelman, Jessica Hwang, and Aki Vehtari. Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6):997–1016, 2014. 8. Thomas Gilovich, Robert Vallone, and Amos Tversky. The hot hand in basketball: On the misperception of random sequences. Cognitive psychology, 17(3):295–314, 1985. 9. Colin F Camerer. Does the basketball market believe in the hot hand? The American Economic Review, 79(5):1257–1261, 1989. 10. Justin M Rao. Experts’ perceptions of autocorrelation: The hot hand fallacy among professional basketball players. Unpublished technical manuscript. California. San Diego. Downloaded from http://www. justinmrao. com/playersbeliefs. pdf.(July 11th 2012), 2009. 11. Michael Bar-Eli, Simcha Avugos, and Markus Raab. Twenty years of ‘hot hand’ research: Review and critique. Psychology of Sport and Exercise, 7(6):525–553, 2006. 12. Jonathan J Koehler and Caryn A Conley. The ‘hot hand’ myth in professional basketball. Journal of Sport and Exercise Psychology, 25(2):253–259, 2003. 13. Peter Ayton and Ilan Fischer. The hot hand fallacy and the gambler’s fallacy: Two faces of subjective randomness? Memory & cognition, 32(8):1369–1378, 2004. 14. James Sundali and Rachel Croson. Biases in casino betting: The hot hand and the gambler’s fallacy. 16 Judgement and Decision Making, 1(1):1, 2006. ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org 15. Bruce D Burns. The hot hand in basketball: Fallacy or adaptive thinking. In Proceedings ...... of ...... the ...... twenty- ...... third annual meeting of the cognitive science society, pages 152–157, 2001. 16. Daniel E. Slotnik. Richard Jarecki, Doctor who conquered roulette, dies at 86, Aug 2018. URL https://www.nytimes.com/2018/08/08/obituaries/ richard-jarecki-doctor-who-conquered-roulette-dies-at-86.html. 17. Marius Ötting, Roland Langrock, Christian Deutscher, and Vianey Leos-Barajas. The hot hand in professional darts. arXiv preprint arXiv:1803.05673, 2018. 18. Brett Green and Jeffrey Zwiebel. The hot-hand fallacy: Cognitive mistakes or equilibrium adjustments? Evidence from Major League Baseball. Management Science, 2017. 19. Andrew Bocskocsky, John Ezekowitz, and Carolyn Stein. The hot hand: A new approach to an old ‘fallacy’. In 8th Annual Mit Sloan Sports Analytics Conference, 2014. 20. Jeremy Arkes. Revisiting the hot hand theory with free throw data in a multivariate framework. Journal of Quantitative Analysis in Sports, 6(1):2, 2010. 21. Joshua B Miller and Adam Sanjurjo. Surprised by the hot hand fallacy? a truth in the law of small numbers. Econometrica, forthcoming, 2018. 22. Jeremy Arkes. Misses in ‘hot hand’ research. Journal of Sports Economics, 14(4):401–410, 2013. 23. Gur Yaari and Shmuel Eisenmann. The hot (invisible?) hand: can time sequence patterns of success/failure in sports be modeled as repeated random independent trials? PloS one, 6(10):e24532, 2011. 24. Peter Bühlmann and Abraham J Wyner. Variable length Markov chains. The Annals of Statistics, 27 (2):480–513, 1999. 25. Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables, volume 55. Courier Corporation, 1964. 26. Tsai-Hung Fan and Chen-An Tsai. A Bayesian method in determining the order of a finite state Markov chain. Communications in Statistics-Theory and Methods, 28(7):1711–1730, 1999. 27. Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974. 28. Howell Tong. Determination of the order of a Markov chain by Akaike’s information criterion. Journal of Applied Probability, 12(03):488–497, 1975. 29. Richard W Katz. On some criteria for estimating the order of a Markov chain. Technometrics, 23(3): 243–249, 1981. 30. Kenneth P Burnham and David R Anderson. Model selection and multimodel inference: a practical information-theoretic approach. Springer Science & Business Media, 2003. 31. Clifford M Hurvich and Chih-Ling Tsai. Regression and time series model selection in small samples. Biometrika, pages 297–307, 1989. 32. Harish S Bhat and Nitesh Kumar. On the derivation of the Bayesian Information Criterion. School of Natural Sciences, University of California, 2010. 33. Michael Lavine and Mark J Schervish. Bayes factors: what they are and what they are not. The American Statistician, 53(2):119–122, 1999. 34. David Posada and Thomas R Buckley. Model selection and model averaging in phylogenetics: advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Systematic biology, 53(5):793–808, 2004. 35. Sumio Watanabe. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11(Dec):3571–3594, 2010. 36. Sumio Watanabe. A widely applicable Bayesian information criterion. Journal of Machine Learning Research, 14(Mar):867–897, 2013. 37. Mervyn Stone. Asymptotics for and against cross-validation. Biometrika, pages 29–35, 1977. 38. Gerda Claeskens, Nils Lid Hjort, et al. Model selection and model averaging, volume 330. Cambridge University Press Cambridge, 2008. 39. Christopher C Strelioff, James P Crutchfield, and Alfred W Hübler. Inferring Markov chains: Bayesian estimation, model comparison, entropy rate, and out-of-class modeling. Physical Review E, 76(1): 011106, 2007. 40. Juho Piironen, Markus Paasiniemi, and Aki Vehtari. Projective inference in high-dimensional problems: 17 Prediction and feature selection. arXiv preprint arXiv:1810.02406, 2018. ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org 41. Ruud Wetzels, Darja Tutschkow, Conor Dolan, Sophie van der Sluis, Gilles . . . Dutilh, ...... and ...... Eric-Jan ...... Wagenmakers. A Bayesian test for the hot hand phenomenon. Journal of Mathematical Psychology, 72:200–209, 2016. A. Derivations 18

The marginal likelihood, also known as Bayes factor, is the expectation of the likelihood under the prior 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org ...... distribution. The logarithm of this quantity (LML) is

LML = logEp [Pr(N p)] | M ! Nx,m = logEp ∏ ∏ px,m x m=1   B(Nx + α) = ∑log . (A 1) x B(α)

The log predictive density (LPD) given a model defined by an inferred posterior distribution p N may | be computed

LPD = logEp N [Pr(N p)] | | M ! Nx,m = logEp N ∏ ∏ px,m | x m=1   B(2Nx + α) = ∑log . (A 2) x B(Nx + α)

The log pointwise predictive density (LPPD), requires partition of data into disjoint “points.” Treating trajectories as points yields

h  i LPPD log Pr N( j) p = ∑∑ Epx Nx x x j x | |

M ! N( j) log p x,m = ∑∑ Epx Nx ∏ x,m j x | m=1 ( j) ! B(Nx + N + α) = ∑∑log x . (A 3) j x B(Nx + α)

The LOO, as defined in this manuscript, is similar to the LPPD. For the LOO, each of the pointwise posterior distributions is computed after leaving out the corresponding trajectory. Hence,

h  ( j) i LOO = 2 logE ( j) Pr Nx px ∑∑ px Nx Nx − j x | \ |

M ( j) ! Nx,m = 2 logE ( j) px,m ∑∑ px Nx Nx ∏ − j x | \ m=1 ( j) ( j) ! B(Nx Nx + Nx + α) = 2 log − . (A 4) − ∑∑ ( j) j x B(Nx Nx + α) −

For the WAIC, the two variants of complexity parameters are 19 ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org h ( j) i ...... k 2LPPD 2 logpNx WAIC1 = ∑∑Epx N x − j x | M 2LPPD N( j) log p = ∑∑ ∑ x,mEpx Nx ( x,m) − j x m=1 | M    ( j) = 2LPPD 2∑∑ ∑ Nx,m ψ(Nx,m + αm) ψ Nx + ∑αm − j x m=1 − m M    = 2LPPD 2∑ ∑ Nx,m ψ(Nx,m + αm) ψ Nx + ∑αm , (A 5) − x m=1 − m and

h  ( j) i kWAIC2 = ∑∑varpx logPr Nx px j x |

( M ( j) !) Nx,m = ∑∑varpx log ∏ px,m j x m=1

" M # ( j) = ∑∑varpx ∑ Nx,m log px,m j x m=1 M M ( j) ( j) = ∑∑ ∑ ∑ Nx,mNx,n cov(log px,m,log px,n) j x m=1 n=1 M M "  # ( j) ( j) = ∑∑ ∑ ∑ Nx,mNx,n ψ0 (αn + Nx,n)δnm ψ0 ∑αm + Nx j x m=1 n=1 − m " M  # ( j) 2 ( j) 2 = ∑∑ ∑ [Nx,m] ψ0(αm + Nx,m) [Nx ] ψ0 ∑αm + Nx . (A 6) j x m=1 − m

The commonly used Deviance Information Criterion (DIC)

  DIC 2 log p N p p 2k (A 7) = ∑ x x = Epx Nx x + DIC − x | | also resembles the WAIC, consisting of two variants in the computation of model complexity,

( M   ) Nx m + αm ( j) k 2 N log , logpNx DIC1 = ∑ ∑ x,m ∑∑Epx N x − x m=1 Nx + ∑m αm − j x | ( M   M   ) Nx,m + αm ( j) = 2 ∑ ∑ Nx,m log ∑∑ ∑ Nx,m ψ(αm + Nx,m) ψ ∑αm + Nx x m=1 Nx + ∑m αm − j x m=1 − m ( M   M   ) Nx,m + αm = 2 ∑ ∑ Nx,m log ∑ ∑ Nx,m ψ(αm + Nx,m) ψ ∑αm + Nx , x m=1 Nx + ∑m αm − x m=1 − m (A 8) ∆WAIC1 ∆WAIC2 ∆LOO 20 20 20 20 (a)

0 0 0 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org ......

∆LHO ∆DIC1 ∆DIC2 20 20 50 0 0 0

∆AIC ∆BIC 2∆LML 0 20 − 50 50 0 − 0 h = 1 h = 2 h = 3 h = 1 h = 2 h = 3 h = 1 h = 2 h = 3

(b)

∆WAIC1 78% 12%

∆WAIC2 79% 13% ∆LOO 79% 13% ∆LHO 68% 21%

∆DIC1 78% 13%

∆DIC2 79% 13% ∆AIC 80% 13% ∆BIC 100% 2∆LML 100% −

Figure S1. Permuting LeBron James’ 16’–17’ free throws. (a) Distributions of ∆Criterion(h) = Criterion(h) − Criterion(h = 0) Evaluations of information criterion relative to h = 0 for resamplings without replacement of James’ 16’-17’ free throws. (b) Frequency of choosing h = 0: black, 1: red, 2: blue, 3: green.

and kDIC2 = 2varp N [logPr(N p)], which may be computed | |   kDIC2 = 2varpx ∑∑Nx,m log px,m x m   = 2∑varpx ∑Nx,m log px,m x m

= 2∑∑∑Nx,mNx,ncov(log px,m,log px,n) x m n    = 2∑∑∑Nx,mNx,n ψ0(αm + Nx,m)δmn ψ0 ∑αm + Nx x m n − m    2 2 = 2∑ ∑Nx,mψ0(αm + Nx,m) (Nx) ψ0 ∑αm + Nx , (A 9) x m − m where δmn refers to the Kronecker delta function.

B. Supplemental results In Fig.6, it is apparent that models where h = 0 and h = 1 have comparable predictive power. As a form of permutation test, we consider resamplings without replacement of James’ 2016–2017 free throws where within each game the order of his shot outcomes are scrambled. The results here are similar to those found in Fig.6, where the statistical power of the information criteria are evaluated using simulated data. Fig. S2 presents a version of the same tests performed in the main manuscript (where an eight state system is used) on a four-state system. In comparing Fig.2 to Fig. S2, one finds consistent results. Note that on average the trajectory lengths are shorter in the four state system relative to the eight state system due to the fact that there are fewer possible states relative to the number of absorbing states. J = 4 J = 16 J = 64 J = 256 J = 1024 J = 4096 J = 16384 21 WAIC1 63% 35% 71% 65% 86% 86% 93% 100% WAIC1

WAIC2 84% 86% 86% 100% 86% 93% 100% WAIC2 ssryloitpbihn.r .Sc pnsi 0000000 sci. open Soc. R. rsos.royalsocietypublishing.org LOO 91% 93% 93% 100% 86% 93% ...... 100% ...... LOO ...... LHO 84% 85% 79% 100% 93% 86% 100% LHO = 0 DIC1 63% 71% 65% 86% 86% 93% 100% DIC1

true DIC2 93% 100% 100% 100% 93% 93% 100% DIC2 h AIC 100% 100% 100% 100% 86% 93% 100% AIC BIC 100% 100% 100% 100% 100% 100% 100% BIC LML 55% 78% 84% 77% 100% 100% 100% LML

WAIC1 50% 43% 35% 93% 93% 100% 93% 100% WAIC1

WAIC2 68% 71% 93% 100% 100% 100% 100% WAIC2 LOO 61% 71% 100% 100% 100% 100% 100% LOO LHO 68% 71% 93% 86% 100% 100% 93% LHO = 1 DIC1 46% 36% 35% 93% 93% 100% 100% 100% DIC1

true DIC2 36% 64% 100% 100% 100% 100% 100% 100% DIC2 h AIC 78% 93% 100% 100% 100% 100% 100% AIC BIC 75% 100% 100% 100% 100% 100% 100% BIC LML 35% 50% 86% 100% 100% 93% 100% LML

WAIC1 56% 79% 93% 100% 100% 100% 100% WAIC1

WAIC2 56% 93% 100% 100% 100% 100% 100% WAIC2 LOO 56% 100% 100% 100% 100% 100% 100% LOO LHO 44% 72% 98% 100% 100% 100% 100% LHO = 2 DIC1 50% 65% 93% 100% 100% 100% 100% DIC1

true DIC2 69% 31% 43% 57% 100% 100% 100% 100% 100% DIC2 h AIC 75% 42% 58% 100% 100% 100% 100% 100% AIC BIC 88% 99% 58% 42% 100% 100% 100% 100% BIC LML 85% 100% 100% 100% 100% 100% LML

WAIC1 36% 42% 56% 100% 100% 100% 100% WAIC1

WAIC2 35% 42% 72% 100% 100% 100% 100% WAIC2 LOO 35% 56% 72% 100% 100% 100% 100% LOO LHO 35% 43% 79% 100% 100% 100% 100% LHO = 3 DIC1 35% 58% 100% 100% 100% 100% DIC1

true DIC2 64% 36% 58% 42% 49% 37% 100% 100% 100% 100% DIC2 h AIC 49% 51% 79% 51% 35% 100% 100% 100% 100% AIC BIC 86% 86% 100% 65% 93% 100% 100% BIC LML 50% 86% 100% 100% 100% 100% LML

WAIC1 36% 36% 36% 70% 98% 100% 100% 100% WAIC1

WAIC2 43% 35% 36% 77% 100% 100% 100% 100% WAIC2 LOO 35% 35% 36% 84% 100% 100% 100% 100% LOO LHO 35% 44% 70% 91% 100% 100% 100% LHO = 4 DIC1 43% 36% 36% 70% 98% 100% 100% 100% DIC1

true DIC2 63% 37% 78% 58% 42% 84% 93% 100% 100% DIC2 h AIC 56% 44% 78% 77% 70% 93% 100% 100% AIC BIC 92% 93% 100% 44% 56% 86% 86% 100% BIC LML 35% 43% 100% 100% 100% 100% 100% LML

WAIC1 50% 42% 44% 49% 86% 100% 100% 100% WAIC1

WAIC2 43% 44% 42% 93% 100% 100% 100% WAIC2 LOO 43% 42% 58% 100% 100% 100% 100% LOO LHO 50% 44% 72% 100% 100% 100% LHO = 5 DIC1 64% 42% 37% 56% 93% 100% 100% 100% DIC1

true DIC2 63% 37% 50% 50% 42% 72% 44% 56% 93% 100% DIC2 h AIC 63% 37% 72% 35% 58% 86% 65% 35% 93% 100% AIC BIC 93% 71% 93% 51% 49% 58% 42% 51% 79% BIC LML 50% 56% 100% 100% 100% 100% 100% LML

Figure S2. Chosen degree of memory h for M = 4 system in simulations for varying true degrees of memory htrue and number of observed trajectories J. Choices made on basis of model with lowest value of given criterion. Rows correspond to model selection under a given degree of memory. Columns correspond to the number of trajectories. Depicted are the percent of simulations in which each degree of memory is selected using the different model evaluation criteria (percents of at least 20 are labeled). Colors coded based on degree of memory: (0: black, 1: red, 2: blue, 3: green, 4: purple, 5: orange). Example: For htrue = 1 and J = 4, the WAIC1 criteria selected h = 1 approximately 68% of the time.