Application of Pagerank Algorithm to Division I NCAA Men's Basketball As Bracket Formation and Outcome Predictive Utility
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Sports Analytics 7 (2021) 1–9 1 DOI 10.3233/JSA-200425 IOS Press Application of PageRank Algorithm to Division I NCAA men’s basketball as bracket formation and outcome predictive utility Nicole R. Matthewsa, Andrew McClaina, Chase M.L. Smithb and Adam G. Tennantc,∗ aBusiness and Engineering Center, University of Southern Indiana, University Boulevard, Evansville, IN, USA bAssistant Professor of Sport Management, Health Professions Center, University of Southern Indiana, University Boulevard, Evansville, IN, USA cAssistant Professor Engineering, Business and Engineering Center, University of Southern Indiana, University Boulevard, Evansville, IN, USA Abstract. This article examines the use of the PageRank algorithm to rank the teams and predict team performance in the tournament. This method has the potential to be utilized as an alternative method to choose tournament participants, as opposed to the traditional ranking and seeding process currently employed by the NCAA. PageRank allows for the consideration of all games played during the regular season (average of 5832 games per season) and for customizable performance weights in the prediction. The PageRank algorithm is a viable tool in predicting tournament outcomes due to depth and extensiveness of the data. The PageRank analysis helped to predict over half of the tournament game outcomes correctly in 2014-2018 and produced an average bracket score of 81.8 points out of 192 possible over the same 5 years. Keywords: Basketball, pagerank, ranking, NCAA 1. Introduction to watch every game every team plays within the NCAA D1 (National Collegiate Athletic Association, Each year, thousands of people fill out a bracket Division-1) season. attempting to predict the outcome of the NCAA Divi- In NCAA basketball, D1 teams compete in regular sion I Men’s Basketball tournament. Most individuals season games to be given a chance to play in the final simply guess winning teams based on a very lim- tournament of the season, the NCAA Men’s Division ited knowledge of the regular season performances of 1 Basketball Tournament (i.e., March Madness, Big each team. With an average of 5832 games per sea- Dance), hereafter referred to as the tournament. The son (Table 1), it is nearly impossible for any person NCAA is divided into 32 conferences. The winners from each conference tournament are automatically ∗ given a spot (i.e., bid) within the tournament bracket. Corresponding author: Adam G. Tennant, Assistant Professor The remaining 32 positions are based on the selec- Engineering, 2030 Business and Engineering Center, University of Southern Indiana, 8600 University Boulevard, Evansville, IN tion committee’s decision after careful deliberations. 47712, USA. E-mail: [email protected]. These decisions can be made based on a team’s ISSN 2215-020X © 2021 – The authors. Published by IOS Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (CC BY-NC 4.0). 2 N.R. Matthews et al. / PageRank NCAA men’s basketball bracket Table 1 2. Contextual background Number of teams and games by year Year Number of Teams Number of Games 2.1. Choosing tournament participants 2014 640 5800 2015 640 5789 Currently, the NCAA utilizes a team of individu- 2016 640 5832 als (i.e., selection committee) to choose and seed the 2017 643 5829 2018 656 5912 tournament teams. The winner from each conference tournament receives an automatic bid to compete within the tournament, and several other teams will earn an at-large bid partially based on their season performance, their location, and even the historic performance. These at-large teams are selected based popularity of the school to drive viewership ratings. on a variety of undisclosed factors used by the selec- Another important aspect to these deliberations is that tion committee. they are behind closed doors. Committee members have access to many While it is known that the selection commit- resources to be employed in the decision-making pro- tee typically utilized the Ratings Percentage Index cess. In addition to watching games during the regular (RPI) as a basis for their decisions in the past season, they also have access to conference monitor- before moving to the NCAA Evaluation Tool (NET) ing calls, National Association of Basketball Coaches they now utilize, there is still room for improve- regional advisory rankings, along with various com- ment. Each year, the committee receives scrutiny for puter metrics. The entire season’s statistics, including their approaches when constructing the tournament wins, losses, and point margins are considered. Addi- bracket of 68 teams. The goal of a committee is to tional factors that may be considered are road record, determine which team appears to be more deserving strength of schedule, player and coach availability, than another. The scrutiny arrives when the commit- and quality of wins and losses (NCAA.COM, 2019). tee appears to become subjective by not considering Several computer-based methods, including the objective measures (e.g., computerized rankings). Saragin method and Rating Percentage Index (RPI) Lunardi (2018) believes that the committee should scores, can be utilized to help choose or seed the have a model that optimizes “performance results teams, but do not seem to be any more effective than (e.g., “most deserving” teams) and predictive data the current method (Gray & Schwertman, 2012). (an objective version of the so-called “eye test”) to With half of the tournament participants being rank a widely disparate Division I more accurately.” chosen by a selection committee of ten school and The PageRank method gives a ranking that is based conference administrators, there should be a more solely on past and predictive performance. The cur- objective method for choosing teams. During selec- rent selection methods by the selection committee, tion, teams are seeded by the committee through a which may include popularity of a team and expected series of confidential polling. Members of the com- attendance at the tournament venue, are more mittee are not permitted to vote for or against any team subjective. with which a conflict of interest may occur. Although The investigators will explore whether Google’s these fail safes have been put in place, the commit- PageRank algorithm is sufficient for two purposes. tee members are humans and prone to bias either The first purpose is to provide an unbiased alterna- purposely or unintentionally. Utilizing an algorithm- tive (or additional) method for deciding which teams based selection process could remove some potential will participate in the tournament. The second will bias. be to find an accurate way to predict outcomes in the NCAA tournament. 2.2. Predicting tournament outcomes To rank the teams, the investigators utilized the PageRank algorithm. PageRank is an algorithm that Several analytical methods, including Winning was developed by Larry Page in 1996 for ranking aca- Percentage, Colley Method, Massey Method, and demic papers. It is a probability distribution that uses NCAA seed are currently utilized to predict the out- a weighted network to optimize rankings, and the past comes of the tournament. Winning percentage is a success of a team produces a cumulative advantage score calculated from the number of games a team that continues to grow as the season continues (Page, won out of the number of games the team played. et al., 1999). This method does not consider the caliber of teams N.R. Matthews et al. / PageRank NCAA men’s basketball bracket 3 playing, the venue, date in the season, or point mar- 3.2. Network creation gin. The Colley method takes things a step further and computes a rating for each team that depends on A method for building a particular season for anal- the ratings of all the team’s opponents. The Colley ysis can be done by creating a network of nodes and method does not take point margin or when the game edges. Nodes were created for each team utilizing occurred in the season into account. their unique numerical identifier and containing the The Massey method utilizes transitivity to include team name for eventual identification and analysis. game scores in ranking but does not consider when Edges were created for each game linking the los- in the season the games are played (Chartier, et al., ing team and winning team. This network can more 2010). Some also predict outcomes by choosing the specifically be labeled a directed graph where an edge teams with the higher NCAA seed. This method flows from a losing team to winning team. is only useful in early rounds, as there is no basis for choosing which team would win from different 3.3. Directed graph regions (Stekler & Klein, 2012). With the unpre- dictable nature of sporting events, different methods A directed graph allows for a visual representa- will achieve better results some years than others. The tion of all the nodes and edges and essentially the PageRank method can include individual game point entire season of NCAA Men’s Division 1 Basket- margins as well as the date and venue of the game ball. For illustrative purposes, a subgraph of the entire that could possibly perform at a superior level than directed graph for the 2017-2018 season can be seen the existing ranking tools. in Fig. 1. This subgraph is made up of entirely Big 10 conference games. The individual nodes shown as rectangles have edges leaving and entering them. 3. Method For example, the rectangular box labeled Indiana is the node that represents Indiana University’s men’s 3.1. Data set basketball team, a green edge can be seen entering this node but originating from the Illinois node. This A publicly available data source was utilized, edge represents a single game played between the Spreadsheet Sports: Sports Analytics and Projection University of Illinois and Indiana University.