HOW TO PICK FIRST ROUND UPSETS IN MARCH MADNESS

A THESIS

Presented to

The Faculty of the Department of Economics and Business

The Colorado College

In Partial Fulfillment of the Requirements for the Degree

Bachelor of Arts

By

Sam

May 2018

i

HOW TO PICK FIRST ROUND UPSETS IN MARCH MADNESS

Sam Block

May 2018

Economics

Abstract

This paper examines first round wins in the NCAA Division I men’s college tournament for teams seeded 12 or lower in each region, with the exception of 16 seeds. This study uses information on 96 matchups over the years 2012-2016 and 2018. A probit regression is used to determine significant variables as well as predict upset chances for each game observed. Difference in adjusted offensive , adjusted defensive efficiency, pace, average height, offensive rate, defensive effective percentage, and defensive offensive rebounding rate of the teams involved were shown to be statistically significant in an upset result. The spread of the contest was also found to be significant.

KEYWORDS: (College Basketball, Upsets, Sports Economics) JEL CODES: (Z20, L83)

ii

ON MY HONOR, I HAVE NEITHER GIVEN NOR RECEIVED UNAUTHORIZED AID ON THIS THESIS

Signature

iii

TABLE OF CONTENTS

ABSTRACT ii 1 INTRODUCTION 1 2 LITERATURE REVIEW 3 3 THEORY 5 4 DATA 6 5 REGRESSION AND VARIABLES 7 6 RESULTS 10 7 DISCUSSION 14 8 CONCLUSION 20 9 REFERENCES 22

iv

Introduction

On March 27, 1939, the Oregon Ducks defeated the Ohio State Buckeyes 46-33 in the first NCAA basketball championship game in history. That inaugural tournament had only featured eight teams - four in the east and four in the west, with the winners meeting in Evanston,

Illinois to battle for the title. The event had been organized in response to the creation of the

National Invitational Tournament (NIT) just a year prior. The Buckeyes’ coach at the time,

Harold Olsen, was skeptical of a tournament conceived by the New York Telegram and the

Metropolitan Basketball Writers Association, not to mention the venue of Madison Square

Garden for every game. Olsen spearheaded efforts to establish an alternative system of determining a national champion, one beyond the reach of eastern influence. The NCAA sympathized with his concerns, and in the spirit of unbiased competition, a new tournament was born.

Over the decades to follow, the NCAA Men’s Division 1 Basketball Tournament would grow significantly. The field of entrants was doubled to sixteen in 1950, then again to thirty-two in 1975 after the NCAA ruled that more than one team per conference could participate in the tournament. For the 1985 season the field expanded to the sixty-four team format that has

(basically) lasted since. In the championship game that year, #8 seeded Villanova shocked #1 seeded Georgetown, winning 66-64 in a contest considered one of the greatest upsets in basketball history and often cited as “The Perfect Game”.

The tournament, affectionately known as “March Madness” or the “Big Dance”, has developed into a full-fledged cultural phenomenon in the United States. Its unique place in society is demonstrated in how it is talked about and consumed by fans. In addition to the aforementioned nicknames, the event has given rise to its own terms that permeate the American

1

lexicon such as “Final Four”, “Cinderella”, and “Bracketology”. During the Obama

Administration, the president would fill out his picks live on ESPN on the eve of each tournament, while Warren Buffet has famously offered a $1 billion dollar prize to pick the perfect bracket in his pool. These sorts of gimmicks are not witnessed surrounding other major sporting events and generate watercooler buzz nationwide.

In recent years, mobile streaming has enabled fans to easily access games and highlights, while improved apps have provided user friendly ways to set up pools, fill out brackets, and digest tournament material. The rising craze has prompted investigation into the cost to

American businesses due to employees being preoccupied with March Madness. A 2016 report by Challenger, Gray, and Christmas, Inc. calculated that the tournament’s opening weekend would drain workplace productivity by nearly $4 billion. Ultimately, despite the findings, the firm advised against workplace restrictions on March Madness consumption, citing concerns over long-term morale and loyalty.

Many fans engage with the frenzy of March Madness not only by tuning in, but by gambling on outcomes. In 2017 the American Gaming Association (AGA) estimated that $295 million would be legally wagered on the NCAA tournament that year through Nevada sportsbooks. In comparison, Super Bowl LI drew $138.4 million to Nevada’s books. The AGA reported, however, that an additional $10.1 billion would be gambled illegally, or ‘off the books’. This eye popping amount is comprised of wagers through off-shore and local books, as well as increasingly popular office pools. The same report expected Americans to fill out north of 70 million brackets for the 2017 tournament, up 13% from the year prior.

To this day, those Villanova Wildcats remain the lowest seeded team to ever cut down the nets. Yet tournament lore is filled with numerous underdogs who have captured hearts and

2

busted brackets around the nation. With the attention and money on games at an all-time high, it is more profitable than ever to understand how to correctly predict outcomes. First round upsets are particularly difficult, yet imperative to a differentiated and successful bracket. This paper examines what factors are important to consider when predicting a first round upset in March

Madness.

Literature Review

Several papers have supported that betting markets across various sports leagues are not entirely efficient and that these inefficiencies can be exploited for profit. In a study spanning several seasons on NFL betting, Badarinathi and Kochman (1996) concluded that bettors may be able to beat the bookie, and more broadly, that betting lines may not discount all available information in swift fashion. In another NFL study, Gray and Gray (1997) found that the market overreacts to a team’s recent performance and discounts overall performance of the team over the season to date. Woodland and Woodland (2001) showed that NHL bettors are inclined to over bet favorites and identified wagering strategies that resulted in profitable returns.

Honing in on basketball, strategies that consistently produce returns become murkier.

Schnytzer and Weinberg (2004) determined that lines for each NBA game are set appropriately and that the market is essentially efficient. Paul and Weinbach (2005) examined the betting market for NCAA college basketball over the 1996-2004 seasons and find that market efficiency cannot be rejected. However, several studies do point towards findings that could be of use in predicting the outcome of games. In an examination of team efficiency in the NBA, Lee and

Berri (2008) found that big men have a greater impact on wins than small forwards or guards. A

3

study of whether “prime time players” exist by Berri and Eschker (2005) revealed there were no players who statistically outperformed their regular season level of play in the postseason.

Despite the popularity of predicting sporting event outcomes and of the NCAA basketball tournament, there have been few academic studies published on the topic. Shi, Moorthy, and

Zimmerman (2013) examined the predictive power of different NCAA basketball ranking methods and concluded that a “glass ceiling” of 75% predictive capability exists for current models. They then raised the question if this limit was due to inherent randomness in the game of college basketball or if the wrong variables were included in the models observed.

The relationship between assigned seeds and game results was examined by Schwertman,

McCready, and Howard (1991). The study assumes a normal distribution of national team strength, with the 64 tournament teams comprising the upper tail of this distribution. Using seed value as the predictor, they derive win probabilities by seed. The model predicted that the higher seed would win each game, although with varying probabilities of victory.

Carlin (1996) extended this approach, using seeds and Jeff Sagarin’s pre-tournament team strength ratings to model likely point spreads for all possible second, third, and fourth round match-ups of the 1994 tournament. He then combined the actual first round betting lines with his predicted spreads (assuming a standard deviation of 10 points for both) to estimate the probability of each team winning its regional championship. His model correctly predicted one of the four regional champions.

Coleman and Lynch (2009) examine the relationship between “nitty-gritty” reports and tournament game results. “Nitty-gritty” reports are released by the Selection Committee prior to each tournament and contain numerous team performance statistics. They use a stepwise binary logit regression to build a model that includes 8 of these 32 factors. They use the difference in

4

these metrics between two teams matched up in the tournament as a predictor for game outcome.

The study finds that the reports do contain information that is representative of team strengths.

Kvam and Sokol (2006) presented a logistic regression / Markov chain (LRMC) model designed to predict the results of single tournament games. The model included only basic input data but accounted for the location of the game and the margin of victory in the estimation of win probability. Kvam and Sokol compare the performance of the model against benchmarks such as

RPI, seeding, and various team strength ratings, and find their model outperformed those predictive tools over the 2000-2005 seasons.

Kaplan and Garstka (2001) evaluated office pools. They designed a prediction strategy expected to maximize the amount of points earned within a pool. Then they considered various

Markov probability models for predicting game winners based on regular season performance, rankings, and Las Vegas betting odds. They compared the success of their own model with those predictors at selecting winners in NCAA tournament games. The study finds that none of the models considered performed better than the 59% accuracy achieved by merely picking the higher seed, but that in more complicated scoring systems of pools their model outperformed this strategy.

Theory

Current literature does not address the issue of predicting first round upsets in the NCAA

Division I men’s basketball tournament. The model presented in this paper attempts to expand on the use of team performance statistics as a predictive tool as laid out by Coleman and Lynch and apply the same principles to a probit model.

5

A probit model is a statistical model relating the probability of the occurrence of discrete random events that take 0,1 values, such as winning or losing, to some set of explanatory variables. It yields probability estimates that the event will occur if the explanatory variables have specified values. Specifically, let Yi represent the outcome of an event, with Yi=1 indicating a win by a lower-ranked seed and Yi=0 otherwise. Then the probit can be specified as:

Prob[Yi=1]=∫−∞β′xiφ(t)dt

where φ(t) is the standard normal distribution, xi is a set of explanatory variables for the ith observation, and β is a set of parameters to be estimated.

Data

The data set is compiled of information on 96 first round matchups from the NCAA

Division I men’s basketball tournament over the years 2012 through 2016, as well as 2018.

Every first round contest played by a team seeded 12th or lower over these years is included, with the exception of 1 vs. 16 matchups. Regular season advanced statistics on the 192 teams involved in these games were accessed via www.kenpom.com. Closing point spreads for each matchup were collected from www.sportsoddshistory.com. Final scores were collected from box scores on www.espn.com.

Contests in the data set that resulted in upsets were given an upset dummy variable of 1, while all games won by favorites were given an upset dummy variable of 0. Games in which the lower seed played at a substantially slower or faster tempo than the favorite were assigned a slow or fast dummy variable of 1, with other contests given slow and fast dummy variables of 0.

6

Regression and Variables

To model first round upsets in March Madness, a probit regression is used. The dependent variable is UPSET, equal to 1 if the underdog prevailed in the matchup or 0 otherwise.

The model is as follows:

UPSET = �{OFF_eFG, OFF_TO, OFF_OR, OFF_FT, DEF_eFG, DEF_TO, DEF_OR, DEF_FT,

HEIGHT, EXPERIENCE, TEMPO, FAST, SLOW, adjOE, adjDE, SEED, CONF_STRENGTH,

POINTSPREAD}

Table 1: Variable Definitions and Descriptive Statistics

Variable Definition M SD Min Max

UPSET Dummy variable (1=underdog won .25 0.44 0 1

matchup)

OFF_eFG Difference in offensive effective field 0.92 3.77 -6.5 8.6

goal percentage

OFF_TO Difference in offensive rate -0.83 2.79 -10.4 4.6

OFF_OR Difference in offensive rebounding 2.05 5.57 -9 14.8

rate

OFF_FT Difference in offensive free throw -1.69 5.95 -14.3 18.3

rate

DEF_eFG Difference in defensive effective field -1.56 2.88 -9 5.3

goal percentage

DEF_TO Difference in defensive turnover rate -0.37 3.33 -8.7 8.3

7

DEF_OR Difference in defensive offensive -0.31 4.11 -12.2 9.3

rebounding rate

DEF_FT Difference in defensive free throw -3.42 8.13 -29.9 21.2

rate

HEIGHT Difference in average height .71 1.11 -2.4 3.2

EXPERIENCE Difference in average experience -0.19 .59 -1.79 1.45

TEMPO Difference in adjusted tempo -0.46 4.2 -10 8.8

FAST Dummy variable (1=underdog plays 0.18 0.38 0 1

substantially faster)

SLOW Dummy variable (1=underdog plays 0.13 0.33 0 1

substantially slower) adjOE Difference in adjusted offensive 7.76 6.62 -8.3 24.2

efficiency adjDE Difference in adjusted defensive -6.69 5.48 -22.4 8.1

efficiency

SEED Difference in seed 10 2.25 7 13

CONF_STRENGTH Difference in respective conference 15.78 7.01 -5.1 29.36

strength ratings

POINTSPREAD Line of game -10.44 4.87 -23.5 -2

Four factors were included to capture the offensive and defensive abilities of teams involved in the matchups observed. Offensive effective field goal percentage is a measure of shooting, calculated as (field goals made + 0.5 x 3-pointers made)/field goals attempted. It

8

differs from conventional field goal percentage by taking into account the extra value of a made

3-point shot. On the defensive end, it reflects the clip a given team allows opponents to shoot over the course of the season. Offensive turnover rate is turnovers committed / possessions, defensive turnover rate is turnovers forced / possessions. Offensive rebounding rate measures how hard a team crashes the offensive glass, calculated as offensive rebounds / total rebounds.

On the other side of the ball, defensive offensive rebounding rate tracks how well teams prevent opposing sides from securing offensive boards and generating valuable extra possessions. Free throw rate is calculated as free throws attempted / field goals attempted. It measures how often a team gets to the line and the frequency with which they allow free throw attempts on the other end of the court. For all games observed, the underdog’s statistics were subtracted from the favorite’s statistics in each category, resulting in figures that measure teams’ skill levels relative to one another.

Difference in the average height and experience of players on each team was also accounted for in the model. Pace of play is measured by adjusted tempo, which reports how many possessions a team plays per 40 minutes, adjusted for the speed at which their competition plays. SLOW and FAST are dummy variables to indicate matchups in which the underdog plays significantly slower or faster than the favorite (by a margin of at least 5 possessions per game).

Adjusted offensive efficiency measures how many points a team scores per 100 possessions, adjusted for the quality of the defense they face. Adjusted defensive efficiency measures how many points a team allows per 100 possessions, adjusted for the quality of the opposing offense.

The SEED variable captures the difference in seeding between the teams matched up.

CONF_STRENGTH is a measure of the relative strength of each Division I men’s basketball conference in a given season. It is calculated by averaging the offensive and defensive

9

efficiencies of teams within each conference. It is included in the model to weight teams’ performances across other statistical categories with quality of opponents. POINTSPREAD is the closing line that Las Vegas sports books set for the contest.

Results

Observations: 96

Log likelihood: -20.271

Pseudo R-squared: 0.6245

10

Table 2: Probit Regression Results

Explanatory variable Coefficient Z-stat Marginal Z-stat

(Std. Error) Effect

OFF_eFG -0.269 -1.07 -0.008 -0.81

(0.252)

OFF_TO -0.191 -0.86 -0.005 -0.55

(0.223)

OFF_OR 0.082 0.75 0.002 0.51

(0.11)

OFF_FT 0.109 1.79* 0.003 0.71

(0.061)

DEF_eFG -0.45 -1.91* -0.013 -0.87

(0.235)

DEF_TO 0.119 0.60 0.003 0.56

(0.199)

DEF_OR -0.442 -2.73*** -0.013 -0.85

(0.161)

DEF_FT -.089 -1.5 -0.003 -0.87

(0.059)

HEIGHT -0.836 -2.32** -0.024 -0.84

(0.361)

EXPERIENCE -.091 -0.16 -0.002 -0.17

(0.577)

11

TEMPO -0.229 -1.26 -0.006 -0.67

(0.183)

FAST -0.722 -0.6 -0.013 -0.55

(1.204)

SLOW 4.736 2.35** 0.966 8.65***

(2.017) adjOE -0.829 -3.2*** -0.024 -0.83

(0.259) adjDE 1.264 3.25*** 0.036 0.87

(0.388)

SEED -0.144 -0.62 -0.004 -0.48

(0.231)

CONF_STRENGTH 0.113 1.47 0.003 0.71

(0.077)

POINTSPREAD -1.071 -3.06*** -0.030 -0.86

(0.349)

* Statistically significant at the 10% level, ** at the 5% level, *** at the 1% level

Several variables are shown to have a statistically significant influence on whether or not a matchup results in an upset. Adjusted offensive and defensive efficiencies were both significant at the 99% confidence interval. The negative coefficient for adjusted offensive efficiency suggests that an underdog’s chances improve as the gap between their own offense and their opponent’s offense lessens, since a higher rated offense for the underdog would mean a negative

12

value in the model as the difference in the teams’ ratings. Similarly, the positive coefficient for defensive efficiency means that an underdog that allows fewer points per possession than the higher seed has greater chance of victory.

Also significant at the 99% confidence level is the point spread of the matchup. Every higher seed in the data set was favored, so the line for every game observed was negative. Of note is the negative coefficient of the variable. In theory, the sign should be positive, since a small spread indicates anticipation of a tight game, while a large spread means bettors expect a less competitive affair. Given the small sample size of the data set, it is possible that several major upsets involving huge spreads threw off the model.

Several team strength characteristics are shown to be important in upset outcomes.

Offensive free throw rate, defensive effective field goal percentage, and defensive offensive rebounding rate all tested as statistically significant in the model. Once again, the signs of these variables’ coefficients are not what we would expect to see. A positive value in the model for offensive free throw rate would indicate an advantage in that area for the favorite, yet the positive coefficient indicates that having a lower offensive free throw rate is in an underdog’s best interest. The same reasoning applies in reverse to the negative coefficients of defensive effective field goal percentage and defensive offensive rebounding rate. One explanation is that because most matchups in the data set featured underdogs weaker than favorites in most statistical categories, the sample size was not large enough to capture these expected advantages and disadvantages in the coefficients. The significance of the variables suggest that these statistical categories could be especially critical to an upset bid.

The difference in average height of matchups was shown to be significant at the 95% confidence level. The negative coefficient indicates that a lower seeded team with taller players

13

than their opponent has a higher chance at an upset. Finally, whether or not an underdog plays at a substantially slower pace than the favorite (at least 5 fewer possessions per 40 minutes) was shown to be significant at the 95% confidence level as well. The variable’s positive coefficient means a low seed that plays considerably slower than it’s opponent stands a better chance at pulling off an upset than one that plays at a tempo closer to or faster than the other team.

Discussion

After running the probit regression, “Chances” were generated by predicting the probability of each observation resulting in 1. The figures represent the likelihood that the given matchup will result in an upset based off the information included in the model.

Table 3: Predicted Chance Values

Matchup Chances Upset Final spread

2 Gonzaga vs 15 N Dakota St 3.92e-24 No -10

2 Villanova vs 15 UNC Asheville 1.03e-20 No -30

2 Kansas vs 15 Detroit 3.72e-17 No -15

2 Michigan vs 15 Wofford 1.83e-16 No -17

4 Syracuse vs 13 Montana 1.85e-16 No -47

4 Michigan vs 13 South Dakota St 2.23e-15 No -15

2 Arizona vs 15 Texas Southern 2.34e-15 No -21

5 N Iowa vs 12 Wyoming 3.55e-15 No -17

4 UCLA vs 13 Tulsa 5.21e-13 No -17

2 Villanova vs 15 Milwaukee 6.12e-12 No -20

14

3 Tennessee vs 14 Wright St 1.55e-11 No -26

2 Kansas vs 15 New Mexico St 4.88e-11 No -19

4 Louisville vs 13 Davidson 5.15e-11 No -7

3 Notre Dame vs 14 Northeastern 1.78e-10 No -4

4 Iowa St vs 13 Iona 1.90e-09 No -13

2 Virginia vs 15 Belmont 8.69e-09 No -12

2 Ohio St vs 15 Loyola Maryland 1.18e-08 No -19

5 Utah vs 12 SF Austin 2.81e-08 No -7

4 Georgetown vs 13 E Washington 3.07e-08 No -10

4 Indiana vs 13 New Mexico St 3.10e-08 No -13

5 Indiana vs 12 Chattanooga 2.12e-07 No -25

5 Vanderbilt vs 12 Harvard 4.89e-07 No -9

3 Syracuse vs 14 Western Michigan 1.66e-06 No -24

3 Florida St vs 14 St. Bonaventure 1.78e-06 No -3

5 Maryland vs 12 South Dakota St 1.91e-06 No -5

3 Florida vs 14 Northwestern St 4.85e-06 No -32

2 Duke vs 15 Iona .0000101 No -22

3 Oklahoma vs 14 Albany .0000582 No -9

5 Clemson vs 12 New Mexico St .0000758 No -11

2 Oklahoma vs 15 Cal St Bakersfield .0002229 No -14

2 Miami vs 15 Pacific .0002279 No -29

3 Texas A&M vs 14 Green Bay .0003421 No -27

4 Louisville vs 13 UC Irvine .0003719 No -2

15

5 Ohio St vs 12 South Dakota St .0004245 No -8

3 Baylor vs 14 South Dakota St .000436 No -8

3 Marquette vs 14 Davidson .0005344 No -1

3 Miami vs 14 Buffalo .000696 No -7

4 Saint Louis vs 13 New Mexico St .0007111 No -20

5 VCU vs 12 Akron .0023432 No -46

2 Cincinnati vs 15 Georgia St .0034279 No -15

3 Utah vs 14 Fresno St .0055685 No -11

3 Michigan St vs 14 Valparaiso .005729 No -11

2 Kansas vs 15 Eastern Kentucky .0095367 No -11

2 Duke vs 15 Albany .011583 No -12

4 Gonzaga vs 13 UNC Greensboro .0172747 No -4

4 Duke vs 13 UNC Wilmington .026742 No -8

3 Texas Tech vs 14 SF Austin .0367496 No -10

4 Michigan St vs 13 Delaware .0399754 No -15

4 Auburn vs 13 Charleston .0439166 No -4

4 Louisville vs 13 Manhatten .0440544 No -7

2 UNC vs 15 Lipscomb .0528428 No -18

4 Wisconsin vs 13 Montana .0562341 No -24

3 Michigan St vs 14 Bucknell .0738075 No -4

2 Purdue vs 15 Cal St Fullerton .0786359 No -26

4 San Diego St vs 13 New Mexico St .0869893 No -4

4 UNC vs 13 Harvard .1109877 No -2

16

5 Saint Louis vs 12 NC St .1115736 No -3

5 Arkansas vs 12 Wofford .116641 No -3

5 VCU vs 12 SF Austin .1223405 Yes 2

3 Creighton vs 14 LA Lafayette .123745 No -10

4 Kentucky vs 13 Stony Brook .1262951 No -28

2 Michigan St vs 15 Middle Tennessee .131836 Yes 9

5 West Virginia vs 12 Buffalo .1377277 No -6

3 Georgetown vs 14 Belmont .1439884 No -15

3 Michigan vs 14 Montana .1638178 No -14

2 Ohio St vs 15 Iona .1968627 No -25

3 Marquette vs 14 BYU .2595831 No -20

5 Cincinnati vs 12 Harvard .305821 Yes 4

3 Duke vs 14 Mercer .327083 Yes 7

3 Iowa St vs 14 NC Central .3493558 No -18

2 Xavier vs 15 Weber St .3574236 No -18

5 Kentucky vs 12 Davidson .366196 No -5

2 Wisconsin vs 15 American .3767715 No -40

2 Missouri vs 15 Norfolk St .4188824 Yes 2

4 Maryland vs 13 Valparaiso .4225907 No -3

5 New Mexico vs 12 Long Beach St .4256163 No -7

5 West Virginia vs 12 Murray St .4654232 No -17

2 Georgetown vs 15 Florida Gulf Coast .5516843 Yes 10

5 Oklahoma St vs 12 Oregon .5684676 Yes 13

17

3 West Virginia vs 14 SF Austin .598263 Yes 14

5 UNLV vs 12 California .6173376 Yes 3

4 California vs 13 Hawaii .6738304 Yes 11

5 Baylor vs 12 Yale .7666628 Yes 4

4 Arizona vs 13 Buffalo .8256401 Yes 21

3 New Mexico vs 14 Harvard .9356968 Yes 6

3 Baylor vs 14 Georgia St .9363964 Yes 1

2 Duke vs 15 Lehigh .9415215 Yes 5

4 Michigan vs 13 Ohio .9750902 Yes 5

5 Wichita St vs 12 VCU .9776726 Yes 3

3 Iowa St vs 14 UAB .9795631 Yes 1

5 Wisconsin vs 12 Mississippi .9806218 Yes 11

4 Kansas St vs 13 La Salle .9985316 Yes 2

5 Oklahoma vs 12 N Dakota St .9985459 Yes 5

4 Wichita St vs 13 Marshall .9987819 Yes 6

5 Temple vs 12 South Florida .999999 Yes 14

5 Purdue vs 12 Arkansas Little Rock .999999 Yes 2

From the sample of 96 matchups, 19 were identified to have better than 50-50 odds of resulting in an upset. All 19 of these games were in fact won by the lower seed. The 5 other upsets that occurred over this time were given chances ranging from approximately 12% to 42%.

Three of these matchups – Missouri vs. Norfolk State in 2012, Duke vs. Mercer in 2014, and

Michigan State vs Middle Tennessee in 2016 – are among the most stunning tournament results

18

in recent memory. It might be a good sign, then, that the model was skeptical of Cinderella outcomes in these contests.

One alarming aspect of the predicted chance values is the degree of certainty. The 2016 contest between Purdue and Arkansas Little Rock is calculated as a virtual lock to result in an upset. And while the plucky 12th seeded Trojans did indeed shock the Boilermakers that day, they needed two overtime periods and some miraculous late game shooting to do so. One tiny break and that game easily could have gone the other way.

The same holds true on the other end of the spectrum. In addition to two outright upsets, there are twelve tilts assigned less than a 15% chance of an upset in which the underdog finished within five points of the favorite. When the margin is that tight, it is absurd to completely discount the possibility that the game could swing the other direction. The predicted chances should reflect these realities and align closer with the final spread.

Also problematic is the near 38% chance assigned to a 2014 game between 2 seed

Wisconsin and 15 seed American University. American’s solid statistics that season make the matchup appear like a possible upset candidate to the model. The issue is those numbers mask the disparity in talent level each team competed against throughout the season. That year,

Wisconsin played in the highest rated conference (the Big Ten), consistently battling top teams.

American played a schedule against small conference, middling competition. The model fails to pick up on this, and the potential Cinderella on paper instead lost by 40 points. This suggests that the model can be improved with more explanatory variables to account for the talent level of competition faced. While the relative conference strength metric addresses part of this concern, another variable such as strength of schedule could help increase the overall accuracy of predictions.

19

Conclusion

This paper aims to model first round wins in the NCAA Division I men’s college basketball tournament for teams seeded 12-15. Using data from 2012-2016 as well as 2018, successful upset bids were found to correlate with the difference in adjusted offensive efficiency, adjusted defensive efficiency, pace, average height, offensive free throw rate, defensive effective field goal percentage, and defensive offensive rebounding rate of the teams involved. The point spread of the contest was also found to be significant.

These results suggest a rough profile to look for in a Cinderella pick – a tall, (relatively) slow tempo, efficient team that excels at holding opponents to poor shooting percentages, getting to the line frequently on offense, and securing defensive rebounds so as to limit second chance opportunities.

There is plenty of room for further and improved study on this subject. A larger sample size might reveal more telling results as outliers get smoothed out. More advanced statistics such as rate, block rate, rate, usage percentage, and true shooting percentage might add predictive power to the model. Explanatory variables to quantify coaching experience, distance traveled to venue, and in-season momentum could be added as well. The model could also be expanded to include 11 vs. 6 and even 10 vs. 7 first round matchups. Information on recruiting class ranks over previous years could be incorporated as well to help measure raw talent.

There will always be a degree of randomness when filling out a bracket. Not one tournament outcome is ever certain, just ask the University of Virginia or UMBC. However, this paper shows that certain conditions of first round matchups correlate to higher odds of an upset.

20

In increasingly crowded pools and with growing monetary and social rewards for correct predictions, these insights provide a competitive advantage to picking a winning bracket.

21

References

Badarinathi, R., & Kochman, L. (1996). Football Betting and the Efficient Market

Hypothesis. The American Economist, 40(2), 52-55.

Berri, D., & Erick Eschker. (2005). Performance When It Counts? The Myth of the Prime Time

Performer in Professional Basketball. Journal of Economic Issues, 39(3), 798-807.

Carlin, B. (1996). Improved NCAA Basketball Tournament Modeling via Point Spread and

Team Strength Information. The American Statistician, 50(1), 39-43.

Coleman, B., & Lynch, A. (2001). Identifying the NCAA Tournament "Dance

Card". Interfaces, 31(3), 76-86.

Coleman, B., & Lynch, A. (2009). NCAA Tournament Games: The Real Nitty-Gritty. Journal of

Quantitative Analysis in Sports, 5(3), 61-84.

Gray, P., & Gray, S. (1997). Testing Market Efficiency: Evidence from the NFL Sports Betting

Market. The Journal of Finance,52(4), 1725-1737.

Kaplan, E., & Garstka, S. (2001). March Madness and the Office Pool. Management

Science, 47(3), 369-382.

Kvam, P. & Sokol, J. (2006) A Logistic Regression/Markov Chain Model for NCAA Basketball.

Naval Research Logistics, 53(8), 788-803.

Lee, Y., & Berri, D. (2008). A Re-Examination of Production Functions and Efficiency

Estimates for the NBA. The Scottish Journal of Political , 55(1), 51-66.

Paul, R. & Weinbach, A. (2005) Market Efficiency and NCAA College Basketball

Gambling. Journal of Economics and Finance, 29(3), 403-408.

Schwertman, N., Schenk, K., & Holbrook, B. (1996). More Probability Models for the NCAA

Regional Basketball Tournaments. The American Statistician, 50(1), 34-38.

22

Smith, T., & Schwertman, N. (1999). Can the NCAA Basketball Tournament Seeding be Used to

Predict Margin of Victory? The American Statistician, 53(2), 94-98.

Woodland, L., & Woodland, B. (2001). Market Efficiency and Profitable Wagering in the

National Hockey League: Can Bettors Score on Longshots? Southern Economic

Journal, 67(4), 983-995.

23