Using Experts' Opinions in Machine Learning Tasks

Jafar Habibi*^, Amir Fazelinia*, Issa Annamoradnejad* * Department of Computer Engineering, Sharif University of Technology, Tehran, Iran ^ Corresponding Author, Email: [email protected]

Abstract— In machine learning tasks, especially in the tasks of This being said, the problems of data collection are out of the prediction, scientists tend to rely solely on available historical data scope of this paper, and our focus is to present a framework for and disregard unproven insights, such as experts’ opinions, polls, building strong ML models using experts’ opinions. The main and betting odds. In this paper, we propose a general three-step contributions of this study are: 1) to propose a general three-step framework for utilizing experts’ insights in machine learning tasks framework to employ experts’ viewpoints, 2) apply that and build four concrete models for a sports game prediction case framework in a case study and build models exclusively based study. For the case study, we have chosen the task of predicting on experts’ opinions, and 3) compare the accuracy of the NCAA Men's games, which has been the focus of a proposed models with some baseline models. group of Kaggle competitions in recent years. Results highly suggest that the good performance and high scores of the past For the case study analysis, we selected the task of predicting models are a result of chance, and not because of a good- results of NCAA Men's Division I Basketball Tournament, performing and stable model. Furthermore, our proposed models widely known as NCAA March Madness. Started in 1939, it is can achieve more steady results with lower log loss average (best one of the biggest annual sporting events in the world, consisting at 0.489) compared to the top solutions of the 2019 competition of 67 single-elimination games across the [6]. The (>0.503), and reach the top 1%, 10% and 1% in the 2017, 2018 and main reason for selecting this was the instability of the 2019 leaderboards, respectively. previously proposed machine learning models. Keywords— Experts’ opinions; machine learning; sports game The structure of this article is as follows: Section 2 describes prediction; march madness the related topics of this paper and dives into the past works of using experts’ opinions. Section 3 explains the general I. INTRODUCTION framework of the methodology with its concrete applications to the selected case study. In Section 4, we discuss our People are engaged in some sort of method to predict the experiments, data, baseline models, and results. Section 5 is the future for meeting with it more successfully [1], and become concluding remarks. experts in a subject by possessing sufficient knowledge, skill, experience, training or education. Many examples from the daily world events have demonstrated the high accuracy of experts’ II. BACKGROUND opinions in various aspects of life, from weather forecasts and In this section, first, we will review the literature on using sport result predictions to market and political analysis. expert insight, and then, we will move on to discussing the case Using experts’ opinions in Machine Learning (ML) tasks study. have been neglected by researchers, especially in the tasks of prediction. This is largely due to a lack of aggregated expert A. Experts’ Opinions opinions in an objective format (e.g. numerical rather than texts). The best-known method for using experts’ opinions is the To overcome this, data scientists usually have to make sense of Delphi Technique. Developed for the US Army, the Delphi textual comments aggregated from different sources. However, Technique is an iterative process to collect and distill with the latest advancements in Natural Language Processing anonymous judgments of respondents within their domain of (NLP) and classification algorithms, and the ever-increasing expertise interspersed with feedback [7], [8]. The main existence of online historical data, it is becoming easier to advantage of the Delphi is reported to be the achievement of overcome this challenge. For example, it is possible to crawl consensus in a given area of uncertainty or lack of empirical film critics reviews, and accurately classify these subjective evidence [9]. There is a large corpus of research papers texts with little training and model adjustments using pre-trained incorporating the Delphi method, mostly in research areas that models ([2], [3]), or sentence embedding ([4], [5]). appreciate experts’ insights because of uncertainty, such as healthcare [10]–[12], business [13], [14], law [15], and reflect the predicted margin of victory in a game played on a education [16], [17]. A recent literature survey [18] shows the neutral court [6]. vast impact of the Delphi Technique by identifying more than 2600 published scholarly papers that applied the technique. The earliest works in predicting the NCAA results (such as [26], [27]) tried to find a linear relationship between the rank Experts’ opinions have been partially applied or differences and the probability to win a game. The winner team acknowledged in the field of machine learning. For example, of the 2014’s Kaggle’s March Madness competition used team- [19] used opinions of more than 100 experts to determine the based possession metrics to a logistic regression model [28]. most suitable method for nine machine learning tasks. Another [29] applied a classical soccer-forecasting model to make study [20] considered using the opinions of multiple experts for predictions for the college basketball tournament. classifying clinical data using machine learning. They used three expert reviewers and one meta-reviewer to build a consensus [30] designed a new predictive model using matrix model, which showed improved learning. completion to predict the results of the NCAA tournament. Their method is based on formulating performance details from regular season games of the same year into matrix form and B. Case study: NCAA March Madness Games applying matrix completion to forecast the potential The NCAA (National Collegiate Athletic Association) hosts performance accomplishments by teams in tournament games. a college basketball tournament each year, which is the focus of They utilized their work in the 2016 Kaggle competition and many betting websites. Started in 1939, it is one of the biggest reached 341th out of 596 teams in the leaderboard [31]. annual sporting events in the world. Completing the tournament requires 67 single-elimination games to be played at 14 different [6] based their work on two famous ranking systems and sites spread out across the United States, all of which take place proposed a combined model of two methods: the least squares over a two-and-a-half-week span between mid-March and early pairwise comparison model and isotonic regression. They were April [21]. able to reach the top 5% in 2016 and 2017 Kaggle competitions. Other related prediction research papers include [32] that Traditionally, the tournament included 64 teams to compete applied three variations of linear/polynomial models on in five rounds of games, however, in 2011; the NCAA expanded Pomeroy ratings to see their accuracy in 2016-2018 it from 64 to 68 teams where the last eight teams play for four tournaments, and [33] which used a dual-proportion probability spots making the field into 64 before the first round [22]. In the model on a ranking system to produce the bracket predictions. current format, the so-called march madness tournament holds 67 single-elimination games: 4 in first-four round; 32 in round Finally, some studies focused on the problem from a one; 16 in round two; 8 in round three; 4 in round four; 2 in round different angle by trying to predict the first round upsets. [24] five and 1 for the championship. used a Balance Optimization Subset Selection model to determine games that are statistically similar to historical upsets. These games are preceded by more than four months (133 They applied their model and achieved a 38.4% accuracy. [34] days) of regular season games that are played between more than is another study that tried to analyze the apparent anomalies of 330 teams of Division I to determine the winners of the 32 the previous tournaments. conferences, other qualified teams and their seeding for the next rounds. The team’s seed (from 1 to 16; 1 being the strongest) is III. METHODOLOGY assigned by the selection committee and has been considered as a major predictor of success by past researchers [6]. In this section, we will explain the conceptual framework for using experts’ opinions, alongside a concrete application to a Several ranking and rating systems present weekly case study. estimations for the NCAA basketball league, by taking the latest performance of teams into account. They are monitored and A. Conceptual framework presented by experts in the field and are widely considered in people’s bracket predictions. Large news agencies, such as Our framework to use experts’ opinions is composed of three Associated Press, USA Today, NCAA, and ESPN display a general steps: designed or supported ranking system. 1. The first step is to rank experts based on the accuracy of Pomeroy system is the most popular ranking system for their previous predictions or predictions based on their NCAA tournaments and is based on some efficiency metrics opinions/ranking/rating. By doing this, we reach a new calculated per possession [23], [24]. Pomeroy rates both offense ordered list of experts, ER1 , … , ERi , … , ERn, where n is the and defense on each possession, using efficiency metrics (such number of experts, and i : 1<=i<=n is the rank of expert as the number of scores, rebounds, and turnovers) that he adjusts based on its success on previous predictions. for each game location and opponent. Sagarin ratings [25] is another popular ranking system that appears in USA Today for a 2. Since our final prediction is an ensemble of several few popular sports leagues including the college basketball. The experts’ opinions and some experts may have a negative algorithm takes into account the team’s results, its opponents’ impact on the final prediction, we try to select the top strength, and home-away advantage; however, the exact experts with positive influence. To reach this goal, we algorithm of the system is unpublished. To make predictions evaluate the accuracy of top-k experts for all possible using Sagarin ratings, it is officially suggested to add additional values of k (1<=k<=n) and select the optimal number (N). four points for the home team, since the ranks’ differences In short, the second step is to find the optimal N based on the accuracy of expert merging in previous average with exponentially decreasing weights to merge predictions. This step is formulated as Formula 1: predictions of experts. 푎푟푔푚푎푥 E3 model: In this model, we rank experts based on the (1) 푁 = 퐴퐶퐶_푃푅퐸(푀퐸푅퐺퐸[ER1, . . ER푘]) 1≤푘≤푛 accuracy of their predictions in regular season games of the , where MERGE is a function to build a novel prediction current year. There are around 130 days of regular season games by combining multiple models, and ACC_PRE is a when the ranking systems update their rankings in a weekly manner. To be exact, we use the experts’ rankings around the function to evaluate the accuracy of a prediction in a th previous round of predictions (such as year/season). 100 day to predict the remaining games of the regular season. By evaluating the accuracy of each system, we reach a new 3. The final step is to meld predictions of top-N experts for ranking for experts. Using this new ranking, we select the top N the target games (final rounds), and evaluate the experts and merge them using a simple average method. resulting predictions. It can be formulated as Formula 2: E4 model: This is very similar to the E3 model, except that (2) 퐴퐶퐶 = 퐴퐶퐶_퐶푈푅(푀퐸푅퐺퐸[ER1, . . ER푁]) it uses a weighted average with exponentially decreasing weights to merge experts’ predictions. , in which ACC_CUR is a function to assess the overall accuracy of a prediction for target/current time range IV. EXPERIMENTS (year/season). This means that the output of this step can show the success of the whole method. It should be We selected Kaggle’s 2017 to 2019 March Madness 1 noted that it is not possible to perform ACC_CUR for a competitions as the basis for our evaluation. These future or ongoing event, as the ground truth have not competitions have attracted hundreds of data science teams since been revealed; however, the output of function 2014, with an aim to predict the of every MERGE possible game between the teams of the final rounds. In the can be used for generating final predictions. current version of the tournament, 68 teams will continue to the final rounds, which creates 2278 unique possible games. B. Application to case study In this section, first, we take a brief look at the data structure In this section, we present an application of the framework and evaluation metric. Then, we introduce the baseline models to predict the outcomes of the NCAA tournaments in recent used for comparison and discuss the results of our experiments. years. In total, we propose four models that use experts’ opinions in the form of ranking/rating systems to predict the outcome of games of final rounds, commonly known as March Madness. A. Data Each of these four models follows the three general steps of the We use data provided by Kaggle for the machine learning conceptual framework: competitions. These data are separated into six sections with more than 20 tables. The first section provides basic metadata 1. Rank experts based on previous predictions. about teams, such as team names, seeds, and final scores of all 2. Choose optimal N based on previous results. games since 1984-85. The second section of data includes game- by-game detailed statistics (such as free throws attempted, 3. Merge predictions of the top N experts for the current defensive rebounds, turnovers, etc.) at a team level for all regular season. season, conference tournaments, and NCAA® tournament As we briefly noted in Section 2, the NCAA tournament has games since the 2002-03 season. The third section is about two stages: The first stage is the regular season and the second geographical data and locations of previous games since the stage is the final knockout rounds. While the first two models 2009-10 season. The fourth section, which is the focus of this rank experts based on the results of previous season, the third article, provides weekly team rankings for dozens of top rating and fourth models use regular season results of the same year as systems since the 2002-2003 season. Section five includes a basis for ranking experts. detailed events of games alongside game-player relations since 2009-10 season, and finally, section six provides supporting data Here are the descriptions of the four proposed models for the and metadata, such as coaches, conference affiliations, and case study: bracket structure [35]. E1 model: In this model, we rank experts based on the While competitors build models using tables from all accuracy of their predictions in the final rounds of the previous sections of the dataset and are encouraged to use external year tournament. For step three, we merge the predictions of the datasets, our proposed models will only use the data table from top experts using a simple equal-weight average. Therefore, the section four that contains rankings of teams based on different optimal N is selected in a way that the arithmetic mean of the ranking systems, such as Pomeroy, Sagarin, RPI, ESPN, etc.. top N experts’ predictions in the previous year achieve the highest accuracy in that year. The dataset for section four contains seven columns: Season/year, team ID, system name, ranking day number E2 model: The model is similar to the E1 model, including relative to start of the season (from 0 to 133, the final pre- the first step; therefore, the ranking of experts remains the same. tournament rankings), and the overall ranking of the target team The difference is in the third step, where we use a weighted

1 The official title has changed a few times over the years. in the underlying system. Most systems provide a complete Table 1. Results for evaluation of the proposed models on three ranking from #1 through #350+, but sometimes they include ties most recent competitions and a comparison with the baseline and provide rankings for a smaller set of teams, as with the models. Associated Press's top 25 [35]. Notes: Each cell in 2019, 2018, and 2017 columns contains log loss value of the prediction (in the first line) and the rank of the B. Evaluation prediction in Kaggle’s March Madness competition leaderboard in In Kaggle’s March Madness competitions, each team can the second line. submit up to two files, judged using the Logarithmic loss function (the predictive binomial deviance): NAME DESCRIPTION 2019 2018 2017 MEAN GROUP 1: BASELINE MODELS - KAGGLE 2019’S TOP RANKED TEAMS 푛 1 B1 2019 first rank 0.41477 0.60213 0.49082 0.503 (3) 퐿표푔퐿표푠푠 = − ∑(푦푖 푙표푔(푝푖) + (1 − 푦푖) 푙표푔(1 − 푝푖)) 1st/866 349th/934 88th/441 푛 푖=1 B2 2019 second rank 0.42012 0.85021 1.40911 0.886 2nd/866 808th/934 435th/441 , where 푛 is the number of games, 푝푖 is the winning B3 2019 third rank 0.42698 0.60685 0.50452 0.511 probability of team 1 playing against team 2, and 푦푖 3rd/866 395th/934 150th/441 equals 1 if team 1 wins over team 2 and 0 otherwise. A B1+B2 Ensemble of top three 0.48548 0.58755 0.59708 0.556 smaller value of 퐿표푔퐿표푠푠 indicates better performance of +B3 models of 2019 265th/866 154th/934 346th/441 the predictive model [36]. GROUP 2: PROPOSED MODELS SOLELY BASED ON EXPERTS’ OPINIONS Contrary to several previous research articles on predicting E1 Mean of top experts 0.49388 0.58088 0.49734 0.523 NCAA games that were evaluated based on one or two years of based on previous 329th/866 83rd/934 121th/441 competitions, we will evaluate our proposed method on three season results 2 E2 Weighted mean of top 0.47376 0.58750 0.48506 0.515 most recent consecutive seasons of competitions . This step is th th st taken to prevent over-fit or by-chance models, as it was the case experts based on 191 /866 155 /934 61 /441 for many if not all winning solutions of Kaggle’s March previous season results E3 Mean of top experts 0.45355 0.58894 0.46441 0.502 Madness competitions. For example, a submission exactly based based on regular season 56th/866 172nd/934 6th/441 on the published model of the first rank solution of the 2019 th th games of the same year competition would rank as 349 out of 934 participants and 88 E4 Weighted mean of top 0.45334 0.57293 0.44496 0.491 out of 441 participants in the 2018 and 2017 competitions, experts based on 55th/866 32nd/934 2nd/441 respectively. regular season games of the same year C. Baselines GROUP 3: BLEND OF THE TOP MODELS OF THE FIRST TWO GROUPS In order to evaluate the efficiency and stability of our B1+E2 Ensemble of 2019’s first 0.43231 0.58858 0.47956 0.500 proposed models, we will compare them with the top three rank with E2 7th/866 171th/934 46th/441 models of Kaggle’s 2019 March Madness competition. The B1+E4 Ensemble of 2019’s first 0.41622 0.58546 0.46505 0.489 winner of the competition achieved log loss of 0.41477, gaining rank with E4 2th/866 130th/934 7th/441 proposed models of ours that only depend on the expert’s the lowest value among 866 teams of data scientists. The next opinions, as described in Section III.B, and 3) Two ensemble two teams in line scored 0.42012 and 0.42698 log loss values, models, as a result of merging models of the first two groups. respectively (Table 1). We observe from the results of the baseline models that the Since the winners of the competition were obliged to share good performance and high scores of the competition winners their winning solutions, we used their code to create submissions are most likely a result of chance, and not of a good-performing for the 2017 and 2018 competitions. In addition, we used and stable model. For example, the first baseline model (B1) ensemble learning to blend the three models to generate new which has the lowest mean among the four baseline models, submissions. As a result, the ensemble submission of baselines achieved much higher log loss values for 2017 and 2018 seasons for 2018 was stronger compared to the original baseline models. compared to that of 2019 competition. In addition, no member Finally, we used a Kaggle feature that allows users to submit and of the top-10 teams of 2019 have not been awarded any Kaggle’s evaluate their codes after the competition deadline. competition medals before; a strange and uncommon incident that strengthens the premise of chance involvement. D. Results and discussion In Table 1, we display the log loss for each method in the The second group models show promising results in sense of three most recent seasons of NCAA college basketball games. stability and cohesion over the years. We can observe that the Furthermore, we included the leaderboard rank for participating log loss scores have little variance compared to the baseline in the corresponding Kaggle’s competition. The table shows models. Even in the case of 2018 season that involved a few three groups of methods: 1) The first group is for baseline upsets, the loss value for the proposed models are closer to the models, containing the top-3 models of 2019 Kaggle rest. The stability of the proposed models, as evident in Figure competition and an ensemble of the three models, 2) The four 1, is a great outcome considering that the scores of the previous

2 Due to the spread of COVID-19, the NCAA canceled the 2020 tournament.

Figure 1. Results of evaluation for the three groups of models methods vary from season to season, meaning that some seasons Linguistics: Human Language Technologies - Proceedings of the are more predictable than others [6]. Conference, 2019. Since the first group methods used other data than experts’ [3] I. Annamoradnejad, M. Fazli, and J. Habibi, “Predicting Subjective opinions and showed a great loss in some cases, we considered Features from Questions on QA Websites using BERT,” in 2020 6th merging them with the models of the second group. This step is International Conference on Web Research, ICWR 2020, 2020, doi: taken, first, to keep the stability of the second group models, and 10.1109/icwr49608.2020.9122318. second, to use the knowledge from other data sources. To this end, we used the first baseline model since it scored better than [4] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence the rest in all three years, and merged it with one model using embeddings using siamese BERT-networks,” in EMNLP-IJCNLP the previous season results and one model using the regular 2019 - 2019 Conference on Empirical Methods in Natural season games. The idea of blending was a success and we Language Processing and 9th International Joint Conference on achieved lower loss values by keeping some stability over the Natural Language Processing, Proceedings of the Conference, years. For example, ensemble of 2019’s first rank model with 2020, doi: 10.18653/v1/d19-1410. nd th th the fourth proposed model would rank as 2 , 130 and 7 in the [5] Issa Annamoradnejad, “ColBERT: Using BERT Sentence 2019, 2018 and 2017 Kaggle competitions, respectively. Embedding for Humor Detection,” arXiv 2004.12765, 2020. This experiment shows that by incorporating experts’ [6] A. Neudorfer and S. Rosset, “Predicting the NCAA basketball opinions, whether by ranking experts based on the previous tournament using isotonic least squares pairwise comparison season results or the regular season of the same year, we can model,” J. Quant. Anal. Sport., vol. 14, no. 4, pp. 173–183, 2018, improve the predictive accuracy of the baseline models. This doi: 10.1515/jqas-2018-0039. result can be applied to other areas of machine learning, where previous knowledge of experts exists. [7] N. Rescher, Predicting the Future: An Introduction to the Theory of Forecasting. 1998.

V. CONCLUSION [8] C. C. Hsu and B. A. Sandford, “The Delphi technique: Making sense of consensus,” Pract. Assessment, Res. Eval., vol. 12, no. 10, The theoretical framework established in the current study can be used for the tasks of machine learning to build stable and pp. 1–8, 2007. good performing models based on experts’ viewpoints. Our [9] P. Catherine, “The Delphi technique: myths and realities.,” J. Adv. experimental results on the selected case study show that by Nurs., vol. 41, no. 4, pp. 376–382, 2003, [Online]. Available: incorporating experts’ opinions in the specified methods, we can http://www.embase.com/search/results?subaction=viewrecord&fro improve the predictive accuracy of existing models. The m=export&id=L36478234. framework can be applied to all areas of machine learning, [10] K. Fujimoto, M. Cao, L. M. Kuhns, D. Li, and J. A. Schneider, where previous knowledge of experts exists. “Statistical adjustment of network degree in respondent-driven sampling estimators: Venue attendance as a proxy for network size among young MSM,” Social Networks, vol. 54. pp. 118–131, 2018, REFERENCES doi: 10.1016/j.socnet.2018.01.003. [11] Y. Geng, L. Zhao, Y. Wang, Y. Jiang, K. Meng, and D. Zheng, [1] P. Twitchell, The Eck-Vidya: Ancient Science of Prophecy. “Competency model for dentists in China: Results of a Delphi Illuminated Way Press, 1972. study,” PLoS One, 2018, doi: 10.1371/journal.pone.0194411. [2] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [12] A. Bessa et al., “Consensus in Bladder Cancer Research Priorities training of deep bidirectional transformers for language Between Patients and Healthcare Professionals Using a Four-stage understanding,” in NAACL HLT 2019 - 2019 Conference of the Modified Delphi Method,” European Urology. 2019, doi: North American Chapter of the Association for Computational 10.1016/j.eururo.2019.01.031. [13] E. Hurmekoski, M. Lovrić, N. Lovrić, L. Hetemäki, and G. Winkel, tournament upsets using Balance Optimization Subset Selection,” J. “Frontiers of the forest-based bioeconomy – A European Delphi Quant. Anal. Sport., vol. 13, no. 2, pp. 79–93, 2017, doi: study,” For. Policy Econ., 2019, doi: 10.1016/j.forpol.2019.03.008. 10.1515/jqas-2016-0062. [14] P. Rikkonen, P. Tapio, and H. Rintamäki, “Visions for small-scale [25] J. Sagarin, “Jeff Sagarin Ratings,” USA Today, 2020. renewable energy production on Finnish farms – A Delphi study on https://www.usatoday.com/sports/ncaab/sagarin/ (accessed Aug. 10, the opportunities for new business,” Energy Policy, 2019, doi: 2020). 10.1016/j.enpol.2019.03.004. [26] B. P. Carlin, “Improved NCAA Basketball Tournament Modeling [15] R. Bernal, L. San-Jose, and J. L. Retolaza, “Improvement actions via Point Spread and Team Strength Information,” Am. Stat., 1996, for a more social and sustainable public procurement: A Delphi doi: 10.2307/2685042. analysis,” Sustain., vol. 11, no. 15, pp. 1–15, 2019, doi: [27] N. C. Schwertman, K. L. Schenk, and B. C. Holbrook, “More 10.3390/su11154069. Probability Models for the NCAA Regional Basketball [16] J. M. Swank and A. Houseknecht, “Teaching Competencies in Tournaments,” Am. Stat., 1996, doi: Counselor Education: A Delphi Study,” Couns. Educ. Superv., 10.1080/00031305.1996.10473539. 2019, doi: 10.1002/ceas.12148. [28] M. J. Lopez and G. J. Matthews, “Building an NCAA men’s [17] I. Mat Nashir, A. Yusoff, M. Khairudin, M. R. Idris, and N. N. basketball predictive model and quantifying its success,” J. Quant. Imam Ma’arof, “Delphi Method: the Development of Robotic Anal. Sport., 2015, doi: 10.1515/jqas-2014-0058. Learning Survey in Tertiary Education,” J. Vocat. Educ. Stud., vol. [29] F. J. R. Ruiz and F. Perez-Cruz, “A generative model for predicting 2, no. 1, p. 13, 2019, doi: 10.12928/joves.v2i1.761. outcomes in college basketball,” J. Quant. Anal. Sport., 2015, doi: [18] A. Flostrand, L. Pitt, and S. Bridson, “The Delphi technique in 10.1515/jqas-2014-0055. forecasting– A 42-year bibliographic analysis (1975–2017),” [30] H. Ji, E. O. Saben, A. Boudion, and Y. Li, “March Madness Technol. Forecast. Soc. Change, vol. 150, no. January 2018, p. Prediction : A Matrix Completion Approach,” 2014. 119773, 2020, doi: 10.1016/j.techfore.2019.119773. [31] H. Ji, E. O. Saben, R. Lambi, and Y. Li, “Matrix Completion Based [19] V. Moustakis, M. Lehto, and G. Salvendy, “Survey of Expert Model V2 . 0 : Predicting the Winning Probabilities of March Opinion: Which Machine Learning Method May Be Used for Madness Matches.” Which Task?,” Plast. Rubber Compos. Process. Appl., vol. 8, no. 3, [32] S. Downs, “Using Statistics to Create the Perfect March Madness pp. 221–236, 1996, doi: 10.1080/10447319609526150. Bracket,” pp. 0–33, 2019. [20] H. Valizadegan, Q. Nguyen, and M. Hauskrecht, “Learning [33] A. A. Gupta, “A new approach to bracket prediction in the NCAA classification models from multiple experts,” J. Biomed. Inform., Men’s Basketball Tournament based on a dual-proportion vol. 46, no. 6, pp. 1125–1135, 2013, doi: 10.1016/j.jbi.2013.08.007. likelihood,” J. Quant. Anal. Sport., vol. 11, no. 1, pp. 53–67, 2015, [21] T. (Peculiar) economics of ncaa BaskeTBall and T. A. McFall, The doi: 10.1515/jqas-2014-0047. (peculiar) Economics of NCAA Basketball. 2014. [34] D. L. Zimmerman, N. D. Zimmerman, and J. T. Zimmerman, [22] D. Wilco, “When did March Madness expand to 68 teams?,” NCAA, “March Madness ‘Anomalies’: Are They Real, and If So, Can They 2019. https://www.ncaa.com/news/basketball-men/article/2019-01- Be Explained?,” Am. Stat., vol. 0, no. 0, pp. 1–10, 2020, doi: 09/when-did-march-madness-expand-68-teams (accessed Aug. 10, 10.1080/00031305.2020.1720814. 2020). [35] Kaggle, “Google Cloud & NCAA® ML Competition 2018-Men’s,” [23] K. Pomeroy, “KenPom Ratings methodology update,” 2016. 2018. https://www.kaggle.com/c/mens-machine-learning- https://kenpom.com/blog/ratings-methodology-update/ (accessed competition-2018/ (accessed Aug. 10, 2020). Aug. 10, 2020). [36] “Pattern Recognition and Machine Learning,” J. Electron. Imaging, [24] S. Dutta, S. H. Jacobson, and J. J. Sauppe, “Identifying NCAA 2007, doi: 10.1117/1.2819119.