<<

How do judges learn from ?

Scott Baker and Anup Malani1

Introduction

Federal appellate judges cite cases by sister circuits. Why is that? Common wisdom holds that judges look to out-of-circuit cases for credible legal arguments -- as persuasive precedent. But hard-core adherents of the attitudinal model (Segal and Spaeth 2002) – and certain cynical realists and scholars2 – might argue that judges decide in accordance with their policy or political preferences and cite cases only to cover up that fact or to legitimize acting out their preferences.

In this paper we address that dispute in the process of asking some more basic questions about how judges learn from prior, persuasive precedent. We inquire whether judges put any weight on non- binding or make decisions only based on the case before and/or their own preferences; whether judges weight all precedents the same or make adjustments based on inferences about how confident prior judges were in their opinions or about the quality of those judges; and whether judges’ opinions convey to future jurists all the information that their authors gleaned from the case before them, i.e., whether judges have full information about everything prior judges learned.

These are important questions because the manner in which judges learn not only affects how we model them and predict their behavior, but also affects their legitimacy and how much authority we ought to give them. For example, if judges place weight on prior non-binding precedent, then it becomes less plausible that they are the purely political actors that the strongest version of the attitudinal model suggests. Alternatively, suppose we find that judges are subject to information cascades. They rely on prior opinions that do not convey all the information those opinion authors had. In the case, we may want either to limit the of judges or do the opposite and empower them. Specifically, we may want to allow judges even more leeway not to publish. In so doing, judges can self-censor and avoid sparking a cascade when they think a decision is based on a weak evidentiary basis – i.e., it could have easily been decided the other way.

Our approach first presents a number of models of judicial learning – many taken from the existing literature -- that offer different answers to these questions. We next test their conflicting predictions in order to discriminate between them. For our tests, we employ nearly 1000 sequences of federal appellate cases that address a common legal issue, e.g., whether Family Educational Rights and Privacy Act allows private rights of action or the Coals Act permits successor liability. Each sequence contains at least one circuit split. We code the different decisions made by the circuit in a sequence as A, B, C, etc., depending on whether the circuit agreed or disagreed with the circuits that considered the issue

1 Washington University and University of Chicago, respectively. We thank Ray Mao for excellent research assistance. The authors thank Kaushik Vasudevan, Kevin Jiang, Sarah Wilbanks, Bridget Widdowson and Ray Mao for outstanding research assistance. We also thank Charles Barzun, William Hubbard, Richard McAdams, Tom Miles, Bruce Owen, workshop participants at Washington University School of , the Max Planck Institute in Bonn, Northwestern Law School, and ZTH, and participants at the Harvard Conference on Blinding, the L&E Theory Conference at Yale, American Law & Economics Association, and the Center for Empirical Legal Studies for helpful comments. 2 For a nice overview of legal realism and a response to its crudest characterization, see Leiter 2003.

1 previously. We end up with a sequence like ABA, AAABB, ABBC, etc. A typical sequence might look as follows:

Circuit Decision

Ninth Circuit A

Eleven Circuit A

Fifth Circuit B

Seventh Circuit B

Fourth Circuit A

For each legal question the dataset contains the decision reached by each circuit and the order in which the circuits reached those decisions. In the example above, the dataset reveals that the Fourth Circuit decided the issue as an “A.” We also know that the Fourth Circuit had access to four other decisions. Those prior decisions split: 2 As (circuits that agreed with the Fourth Circuit) and 2 Bs (circuits that reached the opposite result). Finally, we know the order of the past decisions was 2 As followed by 2 Bs. Depending on how, if at all, judges learn, they might pay attention to both the number of decisions on each side of an issue and the order of those decisions.

These data are ideal for our purpose. They take advantage of the sequencing of cases to understand how judges learn from prior cases. Moreover, because they contain splits, they present judges with the flexibility to weight prior cases in different ways, giving us a lot of leverage to rule in or out different models of judicial learning.

Our first model posits that judges do not rely at all on prior precedents. This model replicates the hard core attitudinal model, where judges decide based on their personal preferences, with no deference to the opinions or decisions of other courts. This assumption implies that decisions on the same legal issue should be independent of one another. We test this assumption by looking for runs – sequences of two or more consecutive identical decisions (e.g., two or more As in a row) – in sequences of cases addressing a common legal issue. We use bootstrap methods to generate an empirical distribution of runs assuming that decision order is random and compare the actual number and length of runs on our data to this empirical distribution. We are able strongly to reject that the decisions in our data are independent of one another.

After rejecting the strong political model, we next consider models of judges who learn from prior opinions. Such models vary on two dimensions. The first is whether the quality of new information made available to each judge during – including information from lower courts decisions, the factual record, the litigants and their – and the quality of the each judge herself is constant or varies across case, i.e., whether cases are of “variable quality”. The second dimension is whether judicial opinions reveal all the new, relevant information that their authors gleaned during adjudication, i.e., whether judges consulting prior cases have “full information” about the underlying reasons the prior judges decided the way they did.

The structure of our analysis replicates models by Daughety and Reinganum (1999), who suggest that judges may not have full information and may this be subject to information cascades and Talley (1999),

2 who suggests that judges do have full information and are thus protected from cascades. Variable quality adds a twist to the because it suggests that judges can mitigate the impact of cascades, to the extent they exist, by self-censoring – not publishing – opinions that are of low quality and that conform to the cascade (Baker and Malani 2014).

Turning to empirics, suppose that case quality does indeed vary from judge to judge, from circuit to circuit. What should we expect to see in the circuit split data? Take a judge consulting prior precedent that contained a balanced history – a number of pairs of opposite decisions, like AB or ABAB. If case quality varies, this judge should be more likely than not to agree with the last case in the sequence. The reason is the inference about what the prior judges must have known when they decided their own cases. If the immediate predecessor judge disagreed with the majority of earlier cases, then she must have a great deal of confidence in the information from her case – otherwise, why create a split? Understanding as much, the judge examining a balanced history would weigh the decision by his immediate predecessor more. Of course, such inferences are ruled out if, by assumption, all prior cases are of the same quality. When we test this non-parametrically in our sequences, we find that judges are indeed more likely to follow the last case in a balanced history, suggesting case quality is variable.

We find it more challenging to test whether prior opinions convey to judges full information from prior cases. The reason is that information can be conveyed in a number of ways. At one extreme, a judicial opinion might simply report all the relevant information the judge gleaned from their own case, the litigation at hand. At the other extreme, an opinion might report a posterior belief about what the correct answer is. That belief would reflect all the information the judge gleaned from his own case and combine it with all the information that judge culled from the prior cases. In the latter setting, we obtain a potentially testable prediction: judges opinions should depend on the immediately prior opinion in a sequence, but not on earlier ones. I.e., opinions should follow an AR(1) process, but not an AR(k), k>1, process. Even this prediction is a challenge to test because, while it is possible to code decisions, it is harder to code – quantify – opinions, or the value of the posterior belief each future judge could have extracted from those opinions. Nevertheless we implement empirical tests using vector autoregressions on both decisions and dissents and with conditional tetrachoric correlations amongst decisions. The first of these tests suggest that judges do indeed have full information. The second test is still under construction.

The remainder of our paper is organized into two halves. The first presents our various models of judicial learning and the testable predictions from each. The second presents our empirical tests of each prediction.

I. Models of judicial learning

This section lays out a general model of judicial behavior. The model embeds, as special cases, different specific models of how judges learn from prior courts addressing the same legal question, i.e., non- binding precedents. The models will not take a position on what was learned but focus on whether anything was learned and the nature of information conveyed by prior courts.

As noted, we always start with a legal question. Legal questions include, for example, whether the Affordable Care Act permits payment of health insurance premium subsidies to individuals who buy insurance on a federally-run exchange rather than a state-run exchange or whether Food and Drug

3 Administration approval of a drug as safe and effective preempts state products liability suits against the drug. We assume legal questions have a binary answer, 퐴 or 퐵 -- “yes” or “no.”

The legal question is presented to an arbitrary number of federal appellate courts. The issue is always one of first impression in that circuit: there is no prior precedent from the circuit itself or from the Supreme . Whenever the circuit is not in the first position (i.e., is not the first to consider the issue), the judges have access to persuasive precedent from sister circuits. Judges observe the opinions and decisions of those circuits. The question is what impact, if any, do prior decisions and opinions have on the judge’s decision. If two sister circuits decided the legal question the same way, does that increase the chance that the third circuit to consider the question will follow the trend? If two sister circuits gave conflicting answers, is the third circuit in line more likely to follow the first or second mover?

We construct a model of judicial learning in which we assume that the order in which the courts hear the case is random. While this is certainly may not be the case -- e.g. litigants may choose to target some courts before others or certain courts may be more likely to sidestep novel questions than others - - we test the assumption in our data and find it to be reasonable (more on this later). The position or slot in which a court appears in a sequence is indexed by 푚. Thus, the circuit that decides after two sister circuits have already considered the issue sits in position “3”, so 푚 = 3.

Entire circuits (unless meeting en banc) do not hear cases; cases are decided by three judge panels. However, we conflate the panel and the circuit in our analysis. Random assignment of cases to panels in a circuit3 supports our assumption. While there may be important within-panel effects (e.g., Sunstein, Schkade & Ellman 2004), we also ignore that in our models to keep them tractable.4

After laying out the models from the existing literature, we ask what predictions those models would have for the data on sequential decisionmaking by judges. After that, we test the predictions from each model against the data.

A. No learning

We start with a crude model of judicial behavior: one where policy preferences alone drive decisions. Legal scholars and some political scientists are skeptical of this model for lots of good reasons (Lax 2011; Cross & Tiller 2006; Epstein & Knight 2013; Epstein, Landes, and Posner 2013). In this context, however, it could be an accurate account of decisionmaking. Our data only include cases of first impression in each circuit. The question we ask is whether the judge looks and learns from the prior persuasive precedent or instead decides based on her own instincts and the record in the case at hand. In other words, are the decisions in the circuit split sequence on the same legal question correlated over time or independent? If judges only consider their preferences, then we would straightforwardly get the following prediction.

3 While Chilton & Levy (2014) show that assignment is not perfectly random, it is largely random. 4 That said, in some of our empirical analysis, we will use the existence of dissents to majority opinions to indicate a less persuasive . Moreover, in regression analyses we conducted but do not report, we include the party of the President that appointed the judge that authors the majority opinion as a regressor.

4 Prediction 1: When presented with a case of first impression, if judges are guided solely by policy preferences, the history of prior case dispositions should have no effect on the judge’s resolution of the issue.

Of course, it is possible that decisions could be independent even if judges do not simply vote their preferences. For example, if a judge looks at the facts and arguments in the case before her and, thinking she is a better judge than anyone else, she may make a decision that is independent of prior decisions. We cannot rule this out. But even if the driver of the decision is not preferences, we would uncover that the judge did not dig down in the prior cases for information, i.e., did not respect persuasive precedent.

B. Herding with constant quality cases

Given a sequence of past decisions reaching the same result, might judges be tempted to ignore their own instincts and follow the trend? As is well-known, such herding can lead to clustering on the wrong outcome (Banerjee 1992, Bikhchandani et al. 1992). As noted in the introduction, other scholars have suggested that the federal circuit courts ruling on the same legal issue also might herd (Daughety and Reinganum 1999, Talley 1999).

To see why, consider a typical model of this sort. As noted, each legal issue has two possible answers. The judges provide a decision -- A or B. Assume that there is a correct legal answer (perhaps the one the Supreme Court would give). The true state or correct legal answer is either “훼” or “훽.” Greek letters, here, distinguish the true state of the world (the right answer) from the decisions rendered by the judges (A or B).

As for payoffs, suppose that the judge decides 퐴 if she believes that the correct legal answer is more likely 훼 than 훽 and 퐵 otherwise. Judges start with an uninformative prior about the correct rule (i.e., 푃푟(훼) = 푃푟(훽) = 1/2).

Before issuing a decision, the judge hears the argument, reads the briefs, and consults with the other panel members. These actions are “private,” meaning they are not easily and costlessly observed by future judges. Assume that, taken together, these actions lead to a random draw of a “private” signal, suggesting what to do -- which decision to make. The information contained in the private signal is probative, but not conclusive: the signal might be right or might be wrong.

The possible values of the signal are 푠 ∈ {푎, 푏}. The signal draws are (conditionally) independent. The probability the signal identifies the correct outcome is 휋 = Pr(푠 = 푎|훼) = Pr(푠 = 푏|훽) ∈ (.5,1]. For now assume judges receives signals with identical quality or precision – i.e., the value of 휋 is the same for each judge on every circuit.

In addition to the own private signal, judges also observe opinions by prior judges – the persuasive precedent. The prior opinions only contain information about the prior decision rendered: A or B, and not the signal that the prior judges received.

With this framework in hand, take a sequence of three decisions by different circuits. Given that judges start with uninformative priors, the first judge always follows his own private signal. If the signal is 푎, she decides 퐴. If the signal is 푏, she decides 퐵.

5 The second judge observes the decision of the first judge. From this observation, the second judge can back out the private signal the first judge received. If the first judge decided A, she must have received a signal of “a.” If she decided B, she must have received a signal “b.” Suppose that the first judge decides 퐴. The second judge’s decision will follow if she receives a private signal 푎. On the other hand, the second judge will split and issue a decision 퐵 upon receiving signal 푏.5

Move now to the third judge. For this judge, the inference is more complicated. Suppose that third judge observes the prior two circuits decide the case as 퐴, i.e., he sees a history 퐻 = 퐴퐴. This judge knows that both judge 1 and judge 2 obtained private signals 푎. As a result, even if the third judge’s private signal suggests B is more likely correct, she will rationally ignore her own conflicting private signal. She will go with the herd, deciding 퐴 as well.6 The reason: the information contained in two 푎 signals outweighs the information contained in one 푏 signal.

In this setting, the fourth judge in line learns nothing from the third judge’s decision. The fourth judge knows that the third judge decides 퐴 whether she received an 푎 or 푏 private signal. So the fourth judge sits in the same position as the third judge in terms of the information available in the . She therefore makes the same choice, no matter the value of her own private signal. She also decides 퐴. In this framework, consecutive identical decisions spark an information cascade. The prediction from the herding model follows:

Prediction 2: In a herding model, the history of decisions generally matters. For example, the probability of observing an outcome A following two A’s i.e., 퐻 = 퐴퐴 is one.

Prediction 2 shows that any history with two A decisions in a row determines all future decisions. But what about splits, one A decision followed by a B decision? To analyze that circumstance, suppose that judge 3 observes her sister circuits split over the issue; e.g., 퐻 = 퐴퐵. From the decisions, the third judge can make two inferences: (1) the prior judge deciding “A” received an “a” signal and (2) the prior judge deciding “B” received a “b” signal. The third judge also knows that these signals were of identical quality. As a result, the conflicting prior decisions cancel out and, in effect, provide the third judge with no information whatsoever. The third judge is in the same position as judge 1. As a result, she decides

5 We assume that if the judge believes state 훼 and state 훽 are equally likely after receiving their signal, they issue the decision that accords with the value of the private signal. 6 An application of Bayes rule confirms the claim in the text. Suppose that the first two judges decided 퐴 and the third judge received signal 푠 = 푏. In that case, Bayes rule implies that Pr(훼|푠 = 푏, 퐻 = 퐴퐴) equals:

Pr(푠 = 푏, 퐻 = 퐴퐴|훼) Pr(훼)

Pr(푠 = 푏, 퐻 = 퐴퐴|훼) Pr(훼) + Pr(푠 = 푏, 퐻 = 퐴퐴|훽) Pr(훽)

Because the signals are conditionally independent and the prior is uninformative, this expression is

Pr(훼|푠 = 푏, 퐻 = 퐴퐴) = (1 − 휋)휋2/[(1 − 휋)휋2 + (1 − 휋)2휋].

By contrast, Pr(훽|푠 = 푏, 퐻 = 퐴퐴) = (1 − 휋)2휋/[(1 − 휋)휋2 + (1 − 휋)2휋].

1 Since > , it follows that Pr(훼|푠 = 푏, 퐻 = 퐴퐴) > Pr(훽|푠 = 푏, 퐻 = 퐴퐴). 2

6 based solely on her private signal. This logic applies to any judge who consults a set of prior precedent that is (1) balanced and (2) does not contain two consecutive identical decisions. This yields the following prediction.

Prediction 3: In a herding model, a judge consulting a balanced history that does not contain two consecutive identical decisions will choose A if 푠 = 푎.

To test this prediction, we need to know the probability that a judge receives a private signal 푎. But that requires knowing the correct answer to the legal question – something we don’t know and are not going to figure out. All is not lost, however. Given that the order in which judges hear a case is random, the decision that comes first in a sequence clarifies which signal has expected probability 0.5 or greater. Thus, any signal that differs from the first decision should on average have probability 0.5 or lower. Thus, we can test the modified version of the preceding prediction.

Prediction 3’: In the herding model, the probability that a judge facing a balanced history will choose the last decision in the history is at most 0.5.

C. Herding with variable quality cases

It seems unrealistic to suppose that all judges receive the same quality of information from the underlying litigation. Some sets of fact are more informative than others. Lawyers vary in quality. Some judges make better inferences than others. To capture these differences, make two changes to the model of the preceding section. First, we assume that the precision of the private signal can take one of two values. With probability 푝 the private signal is barely informative. It has a probative value 휋 = 휋퐿 = 0.5 + 휀, where 휀 > 0. With probability 1 − 푝, the signal is highly informative, taking a value 휋 = 휋퐻 > 휋퐿. Second, assume that judges do not know the quality the signal (high or low) received by prior judges.

The analysis proceeds in much the same way as before. Take three circuits deciding a legal question in turn. The first judge receives her signal. Because the signal is informative (even if only slightly), she decides the case in accordance with her signal. If the signal says “a”, she decides A.

The second judge receives a new private signal. If that signal matches the decision rendered by the first judge, he decides the case the same way. If, however, the signal differs from the decision, the second judge might or might not create a split.

Suppose that the first judge decided the issue as 퐴 and the second judge obtained a conflicting signal 푏, but the quality of that signal was high, 휋퐻. The second judge will follow her own conflicting high quality signal rather than defer to the of the first judge. The second judge knows that the first judge received an 푎 signal because the first judge’s decision reveals as much. The second judge is unsure, however, whether the first judge’s decision was based on a good or bad signal – based on a good or bad record, strong or flimsy reasons. The expected quality of the first judge’s signal is

퐸[휋] = 푝휋퐿 + (1 − 푝)휋퐻. By contrast, the second judge knows for sure that her own conflicting signal has good quality, 휋퐻 > 퐸[휋]. As a result, she will follow her own conflicting and higher quality signal.

7 More interesting, if the second judge’s conflicting signal was of low quality, she prefers to follow the first judge. In that case, the second judge knows her information is lousy; she also knows the first judge’s information could be lousy or good. Rather than rely on lousy information for sure, the second judge defers to the first judge’s decision.

As before, the third judge is the one who must make the most difficult inferences from prior precedent. Consider the decision of a third judge following a split, e.g., following 퐻 = 퐴퐵. In creating the split, the second judge signaled that she must have made the decision based on a strong record or strong reasons. Otherwise, she wouldn't have been so bold. The first judge, by contrast, could have made her decision on a weak or a strong record. The order of the decisions allows the third judge to unwind which prior precedent is based on better information. Before considering her own private signal, the third judge is more likely to believe that judge 2 got the decision right as opposed to judge 1.

Now suppose that the third judge draws the 푠 = 푎 signal, but that signal is not terribly informative. Maybe the lawyers did not present that compelling case for deciding the case as A. What should the third judge do? Should he follow his own signal rendered from a lousy record or defer to the decision- making of the second judge -- a decision he knows was made on the best possible information? The following result identifies when a judge following a split will ignore the information contained in his own private signal and defer to the judge who created the split.

Remark (1) If 푝 > 2 − (1/휋퐻), a judge who receives a weak private signal that conflicts with the decision of the second judge will ignore his own private signal and defer to the decision made by the second judge.

If the third judge draws a high quality signal, he always follows that signal. Combining with the results in remark 1, this insight allows us to compute the probability of observing the sequence of decisions 퐴퐵퐵. That is to say, the decision sequence where the first two judges split and the third judge follows the second judge's position. We can also compute the probability of observing the sequence 퐴퐵퐴 -- the path where the first two judges split and the third judge follows the first judge's position. Because of the second judge must be really confident to create a split, the third judge will always find that decision more persuasive. The first path is more likely to occur. Formally, we have:

Prediction 4: A judges consulting a precedent history 퐻 = 퐴퐵, will be more likely to decide 퐵 than A if 푝 > 2 − (1/휋퐻). D. Judges with perfect information

Up until this point, we assumed that judges only observed prior decisions, not the underlying information – the private signal – that influenced those decisions. Suppose instead that the judges observed everything from the past. This is the sort of judge Talley (1999) has in mind. The judges report their private signals along with any decision. The signals are more informative than the decisions. Assume again the signals are of constant quality.

Does such a model provide any different predictions from the ones outlined above? Take judge 3. Suppose he received signal b. Suppose judge 1 revealed that he received the signal “a” and judge 2 revealed that he received the signal b. Judge 3’s belief that the true state is β and, as a result, he should decide B is given by:

8 (1 − 휋)2휋 Pr(훽|푏, 푎, 푎) = (1 − 휋)2휋 + (1 − 휋)휋2 Consider next judge 4. Suppose he observed signal a. His belief that the true state is β and, as a result, he should decide B is given by:

(1−휋)3휋 Pr(훽|푎, 푏, 푎, 푎) = (1−휋)3휋 +휋3(1−휋) or (1 − 휋) Pr(훽|푏, 푎, 푎) Pr(훽|푎, 푏, 푎, 푎) = (1 − 휋) Pr(훽|푏, 푎, 푎) + 휋 Pr(훼|푏, 푎, 푎) If information is fully revealed, judge 4’s estimate of the correct legal answer only depends on what judge 3 reports (his posterior) and the information judge 4 receives from the litigation in his own circuit.

All this, of course, is a well-understood of Bayes rule. We rehearse it here because it yields another – and different prediction – for the empirical results.

Prediction 5: If opinions convey information fully, only the immediate prior decision should matter or influence the subsequent decision.

In terms of the example, judge 4 need not look any further than the decision and opinion of judge 3. He need not consult the decisions of judge 2 or judge 1; because judge 3’s decision already embeds that information. As a result, the decisions should follow an AR(1) process.7

II. Tests of models

We now test the several, alternative models presented in the last section. The models are defined by their assumptions about information and learning, so we will test implications of these assumptions to discriminate between the models. The following table lists the predictions from each of these assumptions.

Table 1. Predictions from judicial learning models.

Model/Assumption Prediction Judges do not learn from prior cases Prediction 1: Decisions within a sequence are independent of one another Judges do not have full information on signals in Prediction 2: There are no splits observed after prior cases and all signals have the same quality the second slot in the sequence of cases and j The quality of private signals vary across cases, Predictions 3’ and 4: In cases with a balanced thus judges weight prior cases differently history, decisions are more likely to follow the immediate prior decision than not Judges have full information on signals in prior Prediction 5: Conditional on the opinion in the cases. immediately prior cases, judicial opinions do not rely on opinions from earlier cases

7 Other, more complicated models of judicial learning also imply that judges need only consult a signal prior opinion – the one that is most informative about the location of efficient legal rule. See Baker & Mezzetti 2012.

9

A. Data

As noted, the data we employ for our tests are outcomes from nearly 1000 sequences of federal appellate cases that answer a common legal question and result in a circuit split, defined as conflicting answers to that question amongst the circuit courts.

The sequences were located by searching backwards from the January 2015 issue of U.S. Law Week (US).8 We read each editions “Circuit Split Roundup” if present. Each roundup lists several cases that generate or contribute to circuit splits. The roundup also includes the question addressed by the case, the other cases in the sequence that also addressed that question, and the answer provided by at least one case in the sequence. Usually the roundup references another article in USLW that elaborates on a split. This referenced article lists the answer given by each case in a sequence of cases addressing a common question. In situations where there is no referenced article, we had a research assistant read each case in the sequence and hand code the answer it provided. Overall, we coded 977 sequences;9 of these we borrowed USLW coding for decisions in 713 sequences and hand-coded outcomes in 264 other sequences. Because USLW only reports on splits, we do not have any sequences in which all courts make the same decision about a legal question.

While decisions varied depending on the question asked, we relabeled the answer given by the first court in a sequence to address a question “A”. We labelled the first court that gives an answer that differs from the prior one “B”, the second decision different from all prior ones “C”, and so on.10 Thus, as described above, our data take the form of sequences like AAB, ABBAB, AAAABC, etc.11

We gathered a number of covariates for each of the cases in our data. We obtain these from Google Scholar searches on each case. From Google Scholar, we are able to scrape for each cases the circuit and date of decision. A very small number of cases in the data (37 of 3841 total cases) are from state supreme courts addressing federal questions. We leave these cases in the data, but the results are robust to their exclusion since there are so few. For most cases we are also able to obtain the names of each judge on a panel and are in the process of scraping whether the whether the decision was unanimous or with dissent. We use the judges’ names to extract biographical data, such as appointing

8 USLW is available online after mid-2007. Prior to that it is only available in paper form. However, the format of the online and paper versions are similar and so the data should not vary depending whether USLW was online or paper-only. We verified this by comparing summary statistics for the online and paper-only data subsamples. Sequence lengths are slightly shorter (3.736 v. 4.0002) and the fraction of splits truncated because of C decision is higher (0.0116 v. 0.015) for splits reported in online versions, with both difference significant at 95% confidence level. Differences in other covariates are not significant. 9 Specifically, we gathered 1001 splits but one of them started with a Supreme Court decision and was followed by a divergent case. We dropped that observation. We have 23 sequences for which we are still recording decisions. 10 For certain statistical tests, it is helpful to randomly assign A or B with equal probability to the first decision. When we assign B to the first decision, the first different decision is assigned A. We will clarify when we do that. 11 Because some of our tests work better when there are only two states of the world, we sometimes artificially terminate each sequence just before any C decision appears. Thus AAAABC, for example, would become AAAAB. We will clarify when this is the case.

10 President (and his party), from the Federal Judicial Center (FJC) database available at http://www.fjc.gov/history/home.nsf/page/ export.html.12

Table 2 provides a statistical description of the case sequences in our data. To check if the manner in which we code decisions had an effect on the data we provide summary statistics separately by whether decisions were coded by USLW or our research assistant.13 Our own hand-coding of sequences identified generates longer sequences – by number – but shorter sequences – by duration -- and a smaller fraction that are truncated because of a C decision. In any case, only the difference in the fraction truncated is statistically significant. To test whether differences in data gathering methodology has any effect on outcomes, we run each empirical test once on the whole sample and once on just the subsample of decisions coded by USLW. We report any instances in which the two samples give qualitatively different results.

Sequences can (and do) have different lengths. This can happen for multiple reasons. First, we chose to stop observing a sequence the date it was reported in USLW. The reason for doing so is that we wanted the method for coding a decision to be consistent within sequences. Second, the sequence may be terminated because the Supreme Court took up the legal question addressed in the sequence and resolved it, though we do not have the Supreme Court decision in the sequence.14 Although these truncations may potentially cause selection in the distributions of A’s and B’s over positions in a sequence, we find little of this. First, we regressed an indicator for A decisions on a sequence length and found that the coefficients on sequence length (coef=0.019, se=1.21, 0.772) and length squared (-0.000, 0.14) were small and insignificant. Second, as a precaution, we either include sequence fixed or random effects or include sequence length and length squared as coefficients in our regression analysis where possible.

B. Random order of courts in a sequence

A critical assumption of all our models is that courts or circuits are chosen in random order to answer a legal question. If this assumption were violated, then it could be that judges do not consider outcomes in prior cases but decisions are not independently distributed over time because, e.g., litigants seek out courts where outcomes are likely to be similar to those in prior cases. Likewise, it is possible that judges do consider outcomes in prior cases but that decisions look independently distributed because, e.g., litigants seek out courts where outcomes are likely to differ from prior cases. In other words, random

12 This matching is imperfect, in some cases because our records from Google Scholar are imperfect and in other cases because data on the deciding judges are not in the FJC data. For example, some cases are decided by state supreme courts and those judges are not the in FJC data, which only contain data on federal judges. Where the matching is incomplete, our regression sample size may be lower than the total number of sequences or cases in the data. 13 Appendix Table 8 and Table 9 do the same for the distribution of cases across circuits and time. 14 There are two exceptions, both fully reported by USLW. One sequence started with a Supreme Court case and had 1 divergent circuit court cases; we dropped this observation. Another sequence ended with a Supreme Court case; we kept the sequence but dropped the Supreme Court case because the latter resolved the circuit split. Many other sequences with splits may have ultimately been resolved by the Supreme Court, but we purposely did not observe that resolution to maintain a simple and consistent sampling strategy.

11 court ordering makes it more likely that correlation in case outcomes identifies judicial learning rather than other mechanisms that may influence judicial decisions.

If court order in each sequence were random, then the probability of observing court 푘 in slot 푚 in a sequence would be the same as the probability of observing court 푘 across all slots – anywhere – in that sequence. In other words, if court order were random, you would be no more likely to see the 9th circuit in the first slot in a sequence than in the third or sixth slot. To test our assumption or random order, we conduct a non-parametric, bootstrap-derived test of whether the probability of observing a decision from court 푘 in slot 푚 is not significantly different than the probability of observing court 푘 in any slot.

To implement the test, we note that each sequence of cases is actually a sequence of courts that hear a legal question. For each of these sequences of courts, we draw a new sequence of courts (without replacement), i.e., we randomly reorder the courts in each sequence.15 We do this 5000 times for each sequence, resulting in 5000 drawn court sequences for each actual sequence. We assemble the n-th draw on each sequence into a data set representing the n-th draw on the set of all sequences, resulting in 5000 draws on the entire set of sequences. Then, for each court and slot combination, we look at the fraction of draws on the set of all sequences that have fewer cases from that court in that slot. If the fraction is below 0.025 or above 0.975, then we can reject random order with 95% confidence in a two- sided test.

Table 3 reports the fractional of draws that have fewer cases from court k in slot m than the actual data. No court-slot combinations in the actual data are significantly more or less likely than from randomly drawn data.16 Thus we cannot reject random ordering with 95% confidence. This implies that it is reasonable to use our data to test our models of learning.

C. Prediction 1: Decisions are independent of one another

We first test the prediction that decisions are independent. Following Smeeton & Cox (2003), our test employs a non-parametric, bootstrap-derived runs test much like the test for random ordering of courts.17 Runs are a sequence of 2 or more of the same opinion. E.g., the sequence AAABA has 1 run – AAA and the length of this run is 3. Independent decisions have runs; however, dependent decisions may have more or less runs and generally have longer runs. One can test if decisions are independent or

15 E.g., if the actual sequence of courts that hears a legal question is 3-6-8-9-11, where each element is the circuit name, one would draw 5 courts. There would be an equal probability of drawing 3 or 6 or 8 or 9 or 11 for the first court. If you drew, e.g., 8, then there would be an equal probability of drawing 3 or 6 or 9 or 11 for the second court. And so on. One would continue until the last court was drawn. None of the courts drawn would be one other than 3 or 6 or 8 or 9 or 11; and no circuit would be drawn twice – so this is like sampling without replacement. 16 The closest combination to significance are the 9th and Federal Circuits in the 9th slot, which is more likely in the actual data than in 10.7% of draws, implying only 21% significance in a 2-sided test. Moreover, with a multiple testing adjustment, we cannot even reject that the 9th and Federal Circuits in the 9th slot are random at 21% confidence. 17 Because we have a large number of small sequences of variable length, we cannot employ, e.g., the Wald- Wolfowitz approximation (Wald and Wolfowitz 1940) for the distribution of runs in a sequence. Even exact distributions are not helpful because many of the sequences are so small that there are less than 20 combinations and so we can never achieve 95% confidence that the data are not independent.

12 not by testing whether data on actual decisions sequences have significantly more runs than data withg independent decision sequences.

In order to implement our runs test, we randomly reorder the decisions (A, B, C, D’s etc) in each sequence in the actual data.18 After reordering, we count the total number of runs and the average length of runs across the reordered sequences. We repeat this 5000 times to generate the empirical distribution of number and average length of runs if decisions were independent of one another. The mean and standard deviation of the number runs in the empirical distribution is 745.9 and 13.4, respectively. We are able to reject that the actual data, with 785 runs, are independent with a p-value of 0.0002. The mean and standard deviation of the average length of runs in the empirical distribution is 2.60 and 0.023 respectively. We are able to reject that the actual data, with runs of average length 2.67, are independent with a p-value of <0.0001. Thus, we are able to reject in our data that the decisions in a sequence are independent of one another.19

D. Prediction 2: No splits after slot 2.

18 Random reordering is the same as sampling without replacement. E.g., if the actual sequence is AABAB, one would draw 4 decisions; there would be a 3/5 probability the first decision was A; if it was, then there would be a ½ probability the second would be an A; if it was a B there would be a 2/3 probability that the next decision would be an A, and so on, til one drew a reorder sequence of 5 cases. Notice that we only reorder cases within each sequence, not cases across sequences. Because the legal questions that each addresses is different, the probability of the different possible answers to each question varies, i.e., the probability of observing the answer arbitrarily labeled A varies across sequences. Reordering decisions only within sequences preserves the number of A’s (and thus the probability of A decisions) in each sequence, allowing the probability of A decisions to across the sequence to vary. 19 Our test is unconditional on any covariates. Because we only reorder cases within a sequence, this approach allows the probability of A decisions to vary across sequences, equivalent to a random effects model. However, we do not allow correlations across sequence-level means (i.e., fixed effects), let alone condition on any other covariate, e.g., sequence length, circuit dummies, of authoring judge, etc. Conducting a runs test controlling for covariates, e.g., by regressing the slot 푚 decision on decisions in previous slots and covariates as in Klaasen & Magnus (2001), is challenging because of the dynamic nature of the model we wish to test and the need to include random or fixed effects to allow for the probability of A-decisions to vary across sequences. With short length or 푇 (in our case average sequence length is 3.97 in the censored data), the random or fixed effect generates endogeneity (Nickell 1981). One can use IV such as in Anderson & Hsiao (1981) and GMM methods as in Holtz-Eakin, Newey and Rosen (1988) and Arellano & Bond (1991) to address the problem, but the challenge is that the error terms has to have an AR(푝) structure with finite 푝. However, because the private information model implies that a decision can depend on all prior decisions in a sequence, it is not possible to assume 푝 is finite. Moreover, even if 푝 were finite, the longer it is, the fewer the IV’s that are available since they must extend beyond the range of the error term’s AR structure. Given the short average length of our sequences, this too is a critical problem. We have devised a way to address the lack of IVs problem with a double difference method. First, we difference over slots within a sequence to eliminate fixed or random effects. Second, and more creatively, we match sequences with identical history and then difference across matched sequences, which eliminates the lagged decisions from the regression equation. Only differences in covariates across matched sequences remain as regressors and we can estimate the coefficients on those regressors. We do not present our results because they do not address our core concern: do current decisions depend on past decisions. Those regressors were differenced out.

13 The learning model with private information and constant quality signals predicts that there should be no first splits observed in slot 3 or later (i.e., following any two consecutive A decision every subsequent decision should be an A). In our data we do observe first splits in slots 3 or later. To determine if this is just random error, we employ Fisher exact tests on the probability of first observing a split in slot 푚 or later, with 푚 > 2. As the first panel in Table 6 shows, the probability of a first split is significantly different for 푚 from 3 to 8.

An issue with this test is that, because we identify our sequences for USLW’s circuit split roundup, our sample only includes sequences with a split. This means that the probability of a first split in slot 푚 or later is lower in the set of all sequences that in a set of sequences with splits at some point. That said, it is also the case that if we sampled cases without splits, we would have a larger sample size and so our estimates of first split probabilities would be more precise, and thus perhaps still significantly different than 0. One way to address this concern is to sample actual sequences without splits. That is quite costly and there is no obvious, non-arbitrary method of selecting non-split sequences. Our solution is to generate pseudo sequences without splits, since we know the decisions in all cases in those sequences. First, we randomly draw a number for sequence length from the set of actual sequence lengths in our data. Second, we add to the data a sequence with that length but with only A decisions in each slot. We repeat this enough times to generate augmented sample where sequences with splits are 50%, 20%, 10%, 5% and 1% of the total sample of real plus pseudo sequences. For each augmented sample, we against conduct Fisher tests of the probability of observing a first split in slot 푚 or larger, with 푚 > 2. As Table 6 show, our results the results do not change. There are a statistically significant number of splits after slot 2.

E. Predictions 3’ v. 4: Balanced history test

Our next test examines the prediction that, if the private signals that courts receive vary in quality, judges in cases with a balanced history will place more weight on the opinion in the immediately prior case than those in earlier cases. The econometrician can test this prediction with data on decisions by odd slot courts so long as the odd-slot judge has equal probability of receiving an a or b signal. If we select a large enough sample of cases with only two possible outcomes and with balanced histories, the law of large numbers gives us confidence that is the case in our sample. So we begin by assembling a sample of 312, 80 and 24 sequences that never had a C decision and with a balanced history of length 2, 4 and 6 respectively.20

In these samples we conduct unconditional, (Fisher) exact tests of the hypothesis that the probability of a court with a balanced history issuing a decision that is the same as the decision in the immediately prior case is 0.5. As reported in Table 4 we find that the probability of following is greater than 0.5 for each of these balanced history lengths. However, only cases with balanced histories of length 2 have a probability of following that is significantly different than 0.5. Those with balanced histories of length 4

20 This is meant to approximate focus the sample on sequences that answer questions with only 2 possible answers. Of course it is possible even these sequences can cases where judges receive a 푐 signal. But that simply means our test is conservative. If the probability of receiving an 푎 or 푏 signal is less than one, say 푞 < 1, then the law of large numbers implies the expected probability of an 푎 or 푏 signal in our balanced history sample is 푞/2 < ½. So if we find that in fact courts follow the immediately prior case more than ½ the time (which we do), then it is certainly the case that they do so more than 푞/2 the time.

14 are marginally insignificant at the 90% level. The insignificance of results for cases with balanced history lengths greater than 2 may be due either to sampling error with small sample size or less confidence that there was an equal probability of the court receiving an a or b signal in these small samples.

To address the sample size problem and to check robustness we also conducted regression analysis to test the balanced history prediction. This allows us to pool the 2, 4, and 6 length balanced history sequences to improve sample size. To address the fact that some sequences can be found in the regression sample multiple times because they were balanced after 2, then 4 and then even perhaps 6 cases, we weighted cases in inverse proportion to how often their sequence appeared in our regression sample. We included a constant and indicators for cases with 2, 4, and 6 length balanced histories and relevant covariates. As Table 5 shows, we find that the probability of following is significantly greater than 0.5 for each case history when we include no covariates. The results are robust to including of different sets of covariates, including party affiliation, dissents and largely even circuit dummies.

F. Test 4: AR(p) structure of opinions.

We test whether judicial opinions convey all the information from the case they decide as well as from all prior cases on the same legal question by testing whether opinions are autocorrelated. If opinions convey full information in the manner just mentioned, then it should be that opinions have an AR(1) structure and no more. Specifically, it must be that the opinion in slot m is uncorrelated with opinion in slot 푚 − 2 conditional on the opinion in slot 푚.

The challenge with executing this test is that, while we can certainly read each opinion, it is difficult to quantify the information – specifically the judges posterior belief that the right answer to a question is, e.g., A – in an opinion. All we observe is whether the decision is, e.g., A, whether there was a dissent, or how many decisions preceded the current one. This means we have multiple problems. First, we may observe something with fewer dimensions than the information (moments of the judge’s posterior) in opinions. E.g., maybe the decision measures the mean of the posterior while the existence of a dissent tells us about the mean (relative to 0.5) and variance of the posterior. Moreover, even when we have measure on certain moments, there is measurement error. E.g., the decision only tells use if the mean of the posterior on A is greater than 0.5, not what the actual mean is.

We propose to address the problem in two ways. First, we estimate vector autogressions that on a vector that includes binary variables that indicate whether the decision in a case is A or not and whether there was a dissent or not. (To keep matters simple, we employ a sample in which sequences are censored just before a C is observed.) VARs partly addresses the dimensionality problem because they include multiple observed proxies on the posterior beliefs reflected in opinions. However, there remains a risk of measurement error, which prevents identification.21 Therefore, instead of focusing on coefficients and their significance, we focus on goodness of fit. Although there is no natural goodness of fit measure for VARs, there are an indirect measures: various information criteria built upon the J- statistic (a test of the null that the number of moment restrictions is correct) and used for model selection. Table 7 presents the model selection statistics. The various information criteria (Bayesian,

21 Indeed, our analysis of the probability limit of the coefficient from an AR(1) regression involving just decisions implies that we cannot even sign the nature of bias on OLS coefficient.

15 Akaike, and Hansen-Quinn) suggest the 1 lag model is the most economical fit. The conclusion is reinforced by the J-statistic, which fails to reject the null of correct moment restrictions even for the 1 lag model, though it also fails to do so for higher lag models. Thus, we tentatively conclude that the VAR analysis does not reject the hypothesis that opinions convey full information.

Our conclusion is tentative because it is possible that measurement error corrupts the model selection criteria. The only way that would not be the case is if, e.g., measurement error in the slot m decision is uncorrelated with the measurement error in the slot 푚 − 푘 decisions, 푘 > 0. Although that does not appear to be an unreasonable assumption, we have no way to validate that.

To tackle measurement error more directly, we tackle the AR test a second way, with tetrachoric correlations between the slot 푚 and the slot 푚 − 2 decisions, conditioned on the slot 푚 − 1 decision. Note that whether a judge decides A depends, e.g., on her unobserved posterior. If the mean of her posterior on A is > 0.5, then she decides A; otherwise she does not. In other words, we have a latent variable model with a 0.5 cutoff. The tetrachoric correlation between two indicators based on latent variables is the correlation between the two latent variables on which they are based (Pearson, 1901). Thus, the tetrachoric correlation between two decisions, even in the same sequence, is the correlation between the two posterior means on which they are based.

The problem with simply calculating the tetrachoric correlations between the decision in slot 푚 and the decision in slot 푚 − 2 is that it is not conditioned on the decision in slot 푚 − 1, as is required for a test of the full information model. To address this, we add two steps. First, we demean the decision data within each sequence to account for the possibility that the probability of an A decision varies across sequences. E.g., if a sequence is AABAB, which we code as 11010, we subtract the mean of the sequence from each element in the sequence, making it 0.7, 0.7, -0.3. 0.7, -0.3. Second, we regress the demeaned decision in slot m on the demeaned decision in slot m-1. (We can also add whether there was dissent in decision m-1 as a regressor.) We then calculate the residual, which captures all information in the demeaned decision at m that is orthogonal to observed decision and perhaps the existence of a dissent at time m-1. Finally, we calculate the tetrachoric correlation between the residual and the decision at time m-2, which in turn reveals the correlation between the opinion at m (controlling for the decision and perhaps existence of dissent at time m-1) and the opinion at time m-1. The one weakness of these approach is that the residual contains information from the opinion at time m-1 that was not captured by the decision and dissent at time m-1. We are still in the process of estimating this conditional tetrachoric correlation.

Conclusion

[To be completed.]

References

Baker, Scott and Claudio Mezzetti. 2012. “A Theory of Rational .” Journal of Political Economy, 120(3), pages 513 – 551

16 Banerjee, Abhijit. 1992. "A Simple Model of Herd Behavior." Quarterly Journal of Economics, vol. 107, pp. 797-817.

Bikhchandani, Sushil, Hirshleifer, David and Ivo Welch. 1992. “A Theory of Fads, Fashion, Custom, and Cultural Change as Informational Cascades.” Journal of Political Economy, vol. 100, pp. 992-1026

Daughety, Andrew and Jennifer Reinganum. 1999. “Stampede to Judgment: Persuasive Influence and Herding Behavior by Courts.” American Review, vol. 1, pp. 158-89

Shai Danzigera, Jonathan Levavb,and Liora Avnaim-Pesso. 2011. “Extraneous factors in judicial decisions, Proceedings of the National Academy of Sciences of the of America”

Epstein, Lee and Jack Knight. 2013. “Reconsidering Judicial Preferences.” Annual Review of Political Science, vol. 16, pp. 19.1-19.21.

Epstein, Lee, William M. Landes, and Richard Posner. 2013. The Behavior of Federal Judges: A Theoretical and Empirical Study of Rational Choice. Cambridge, Mass.: Harvard University Press.

Lax, Jeffrey R. 2011. “The New Judicial Politics of Legal Doctrine.” Annual Review of Political Science, vol. 11, pp. 131-157.Daughety, Andrew and Reinganum, Jennifer. 1999. “Stampede to Judgment: Persuasive Influence and Herding Behavior by Courts.” American Law and Economics Review, vol. 1, pp. 158-89

Leiter, Brian. 2003. “American Legal Realism.” The Blackwell Guide to and Legal Theory; W. Edmundson and M. Golding, eds. Oxford: Blackwell.

Pritchett, C. Herman. 1948. The Roosevelt Court.

Pritchett, C. Herman. 1941. “Divisions of Opinion Among Justices of the U.S. Supreme Court, 1939– 1941.” Amer. Pol. Sci. Rev., vol. 35: pp. 890-___.

Segal, Jeffrey A. and Harold J. Spaeth. 2002. The Supreme Court and the Attitudinal Model Revisited.

Spitzer, Matthew and Eric, Talley. 2013. “Left, Right, and Center: Strategic Information Acquisition and Diversity in Judicial Panels.” Journal of Law, Economics, and Organization, vol. 29, pp. 638-680.

Sunstein, Cass, David Schkade, Lisa M. Ellman, and Andres Sawicki. 2006. Are Judges Political. Brookings Institute: Washington D.C.

Talley, Eric. 1999. “Precedential Cascades: An Appraisal.” Southern California Law Review, vol. 73, pp. 87-__.

Tiller, Emerson and Frank Cross. 2006. “What is Legal Doctrine?” Northwestern Law Review, vol. 100, pp. 517-534.

17 TABLES

Table 2. Summary statistics on various covariates for sequences of federal appellate cases, by how decisions in the sequence were coded.

USLW Hand- coded coded Differ- decisions decisions ence P-value All Sequence length, 3.875 4.114 0.239 0.240 3.970 all decisions (no. of decisions) (1.878) (2.075) (1.945) Sequence length, A and B 3.704 4.083 0.379 0.032 3.848 decisions (no.) 1.800 (2.069) (1.896) Sequences truncated because 0.094 0.015 0.079 0.000 * 0.072 C decisions (fraction) (0.292) (0.122) (0.259) Duration (years) 12.310 10.290 2.020 0.184 11.750 (10.87) (8.05) (10.21) "A" decisions (fraction of cases) 0.554 0.536 0.018 0.889 0.549 (0.161) (0.182) (0.167) Opinion by Republican 0.591 0.575 0.016 0.449 0.586 judges (fraction) (0.306) (0.309) (0.307) Observations (No. of sequences) 713 264 977 Note: All sequences are identified by USLW, but decisions may be coded by USLW or a research assistant (hand-coded). Table provides summary statistics by how decisions are coded. Cells contain means with standard deviations in parentheses. P-values are calculated using Wilcoxon rank sum test for continuous variables and Fisher exact test for binary variables. * indicates significance with 95% confidence in a 2-sided test.

18 Table 3. Non-parametric, bootstrap-based test of hypothesis that courts are chosen in random order.

Slot (cells contain probabilities) Court 1 2 3 4 5 6 7 8 9 Circuit 1 0.278 0.256 0.554 0.645 0.706 0.710 0.744 0.540 0.200 Circuit 2 0.730 0.723 0.440 0.254 0.371 0.357 0.288 0.369 0.208 Circuit 3 0.507 0.492 0.480 0.478 0.501 0.442 0.465 0.432 0.113 Circuit 4 0.202 0.206 0.463 0.673 0.751 0.744 0.744 0.668 0.379 Circuit 5 0.535 0.562 0.501 0.442 0.418 0.427 0.457 0.457 0.451 Circuit 6 0.475 0.468 0.501 0.476 0.632 0.545 0.346 0.520 0.432 Circuit 7 0.495 0.491 0.538 0.518 0.545 0.385 0.437 0.339 0.444 Circuit 8 0.431 0.426 0.540 0.572 0.509 0.341 0.524 0.590 0.455 Circuit 9 0.854 0.863 0.539 0.276 0.140 0.133 0.152 0.165 0.107 Circuit 10 0.292 0.301 0.465 0.625 0.629 0.763 0.701 0.725 0.301 Circuit 11 0.432 0.434 0.455 0.593 0.538 0.484 0.577 0.614 0.444 DC Circuit 0.425 0.421 0.458 0.534 0.505 0.714 0.481 0.551 0.447 Fed Circuit 0.557 0.553 0.647 0.279 0.448 0.444 0.572 0.322 0.107 State Court 0.308 0.298 0.366 0.493 0.520 0.821 0.753 0.958 0.108 Note: Table reports the fraction of draws in which the number of times a given circuit court is in a given slot (across all sequences) is less than the number of times in that circuit is in that slot in the actual data (across all sequences). We made 5000 draws. In each draw, we reorder the courts in each actual sequence at random. Draws do not reallocate courts across sequences.

19 Table 4. Unconditional, exact tests of balanced history prediction.

Balanced history length 2 4 6 Follow last case (fraction) 0.599 0.600 0.583 Std error 0.0015 0.006 0.021 P-value v. 1/2 0.001 0.090 0.540 Obs. 312 80 24 Notes. Sequences tested are sequence that never had a C, D, E, or F decision. Table calculates the probability a case follows the decision in the case immediately prior to it, given that the history of cases prior to the case is balanced. If the balanced history length is 2/4/6, then the first 2/4/6 cases in the sequence have pairs of cases with equal numbers of A and B decisions, and we are examining whether the case in position 3/5/7 follows that in position 2/4/6.

20 Table 5. Pooled balanced history test with covariates.

Dependent Variable Decision follows prior case Independent Variable History length 2 -0.154* -0.227*** -0.227*** -0.228*** -0.229*** -0.209** (0.082) (0.086) (0.086) (0.086) (0.086) (0.087)

History length 4 -0.093* -0.106* -0.106* -0.106* -0.106* -0.119** (0.054) (0.058) (0.059) (0.059) (0.059) (0.059)

History length 6 -0.097 -0.106 -0.098 -0.099 -0.100 -0.114 (0.067) (0.072) (0.073) (0.073) (0.073) (0.073)

Constant 0.798*** 0.875*** 0.873*** 0.872*** 0.870*** 0.995*** (0.086) (0.098) (0.103) (0.103) (0.103) (0.179) F-test Results History length 2 0.000 0.001 0.005 0.006 0.008 0.063 History length 4 0.007 0.002 0.004 0.005 0.005 0.037 History length 6 0.046 0.019 0.022 0.024 0.026 0.046 Joint test 0.000 0.001 0.010 0.012 0.014 0.206 Controlled Covariates Party x x x x x Avg. vote of prior judges x x x x from same party Fraction of prior A x x x decisions with dissent Fraction of prior B x x decisions with dissent Circuit dummy x

Observations 443 387 379 379 379 379 R2 0.018 0.027 0.026 0.027 0.027 0.092 Notes: Table tests the probability a case follows the decision in the case immediately prior to it, given that the history of cases prior to the case is balanced. Sample only includes cases with balanced histories, i.e., each pair of cases in the history contains an equal number of A's and B's, e.g., history is AB, ABAB, ABBA, ABABAB, ABBAAB, etc. Sample excludes sequences with any C, D, etc. decisions. If the balanced history length is 2/4/6, then the first 2/4/6 cases in the sequence have pairs of cases with equal numbers of A and B decisions, and we are regressing whether the case in position 3/5/7 follows that in position 2/4/6 on the indicated covariates. Observations are cases with balanced histories and weighted in inverse proportion to number of times the sequence from which the observation is drawn appears in the sample. ***/**/* indicates p<0.01/0.05/0.1.

21 Table 6. Test for timing of first split on split-sequence-only and augmented samples.

Slot 3 or 4 or 5 or 6 or 7 or 8 or 9 or higher higher higher higher higher higher higher Split Sequences Only First split (fraction of cases) 0.187 0.113 0.090 0.083 0.072 0.060 0.120 Std err 0.000 0.000 0.000 0.001 0.001 0.003 0.013 Fisher's test against 0 0.000 0.000 0.000 0.000 0.000 0.029 0.117 Total number of sequences 663 446 283 179 110 59 18 Split Sequences 50% of Sample First split (fraction of cases) 0.095 0.058 0.047 0.043 0.039 0.032 0.057 Std err 0.000 0.000 0.000 0.000 0.001 0.001 0.004 Fisher's test against 0 0.000 0.000 0.000 0.000 0.000 0.030 0.121 Total number of sequences 1316 881 549 353 203 105 37 Split Sequences 20% of Sample First split (fraction of cases) 0.037 0.022 0.017 0.016 0.014 0.011 0.024 Std err 0.000 0.000 0.000 0.000 0.000 0.000 0.001 Fisher's test against 0 0.000 0.000 0.000 0.000 0.000 0.031 0.124 Total number of sequences 3322 2247 1434 929 577 327 91 Split Sequences 10% of Sample First split (fraction of cases) 0.018 0.011 0.009 0.008 0.007 0.006 0.013 Std err 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Fisher's test against 0 0.000 0.000 0.000 0.000 0.000 0.031 0.124 Total number of sequences 6670 4518 2871 1831 1125 607 169 Split Sequences 5% of Sample First split (fraction of cases) 0.009 0.006 0.004 0.004 0.004 0.003 0.006 Std err 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Fisher's test against 0 0.000 0.000 0.000 0.000 0.000 0.031 0.125 Total number of sequences 13266 8889 5673 3629 2238 1179 365 Split Sequences 1% of Sample First split (fraction of cases) 0.002 0.001 0.001 0.001 0.001 0.001 0.001 Std err 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Fisher's test against 0 0.000 0.000 0.000 0.000 0.000 0.031 0.125 Total number of sequences 66363 44846 28505 18082 10989 5915 1785 Notes. Table reports probability of first split in a case by case position in a sequence. Sequences other than split sequences are all A decision sequences and they're generated by randomly selecting a split sequence and create an all A sequence of the same sequence length. We use fisher's exact test against a null of 0. We do not report numbers for positions > 9 because there is no first split case in each of those positions.

22 Table 7. Model selection results for VAR models for data on decisions and dissents.

Lag CD J J pvalue MBIC MAIC MQIC 1 0.732 7.109 0.850 -63.950 -16.891 -35.578 2 0.743 3.175 0.923 -44.198 -12.825 -25.283 3 0.579 1.433 0.838 -22.253 -6.567 -12.796 Notes. Table presents J-statistic and various information criteria to dsicriminate between different VAR models (1, 2 or 3 lags) for a vector that includes the decision in a case and whether a case has a dissent.

23 APPENDIX TABLES

Table 8. Distribution of cases in sequences across circuit courts, by how decisions in the sequence were coded.

No USLW With USLW Court citation citation All 1 5.68 5.20 5.34 2 7.92 9.61 9.12 3 6.34 8.09 7.58 4 8.57 7.86 8.07 5 8.67 10.36 9.87 6 8.67 8.24 8.36 7 10.90 10.29 10.47 8 8.95 7.14 7.66 9 12.40 14.01 13.54 10 8.85 6.19 6.96 11 `8.29 7.93 8.04 Fed 0.00 0.95 0.67 DC 3.91 3.19 3.40 State 0.84 0.95 0.92 Total 100 100 100 Note: Table lists percent of cases in data from each circuit or state court

24 Table 9. Distribution of sequences by date of last observed case in sequence.

USLW- Hand- coded coded decisions decisions (percent) (percent) All 1933 0.00 0.15 0.11 1984 0.00 0.08 0.05 1989 0.19 0.19 0.19 1990 0.00 0.19 0.13 1991 0.00 0.42 0.29 1992 0.00 0.15 0.11 1993 0.19 0.00 0.05 1994 0.00 0.11 0.08 1995 0.56 0.11 0.24 1996 0.00 0.23 0.16 1997 0.00 0.15 0.11 1998 0.00 0.72 0.51 1999 1.76 0.15 0.62 2000 0.46 0.30 0.35 2001 29.32 3.47 10.96 2002 27.75 4.08 10.94 2003 31.54 4.98 12.68 2004 8.23 1.55 3.48 2005 0.00 1.06 0.75 2006 0.00 6.72 4.77 2007 0.00 8.38 5.95 2008 0.00 6.38 4.53 2009 0.00 6.98 4.96 2010 0.00 8.30 5.90 2011 0.00 11.85 8.42 2012 0.00 7.21 5.12 2013 0.00 9.66 6.86 2014 0.00 11.74 8.34 2015 0.00 4.72 3.35 Total 100.00 100.00 100.00 Note: All sequences are identified by USLW, but decisions may be coded by USLW or a research assistant (hand- coded). Table lists percent of decisions in data from each year.

25