Classification of Chess Games: an Exploration of Classifiers for Anomaly Detection in Chess
Total Page:16
File Type:pdf, Size:1020Kb
Minnesota State University, Mankato Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato All Graduate Theses, Dissertations, and Other Graduate Theses, Dissertations, and Other Capstone Projects Capstone Projects 2021 Classification of Chess Games: An Exploration of Classifiers for Anomaly Detection in Chess Masudul Hoque Minnesota State University, Mankato Follow this and additional works at: https://cornerstone.lib.mnsu.edu/etds Part of the Applied Statistics Commons, and the Artificial Intelligence and Robotics Commons Recommended Citation Hoque, M. (2021). Classification of chess games: An exploration of classifiers for anomaly detection in chess [Master’s thesis, Minnesota State University, Mankato]. Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato. https://cornerstone.lib.mnsu.edu/etds/1119 This Thesis is brought to you for free and open access by the Graduate Theses, Dissertations, and Other Capstone Projects at Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato. It has been accepted for inclusion in All Graduate Theses, Dissertations, and Other Capstone Projects by an authorized administrator of Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato. CLASSIFICATION OF CHESS GAMES An exploration of classifiers for anomaly detection in chess By Masudul Hoque A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Applied Statistics, MS In Master’s in Applied Statistics Minnesota State University, Mankato Mankato, Minnesota May 2021 i April 2nd 2021 CLASSIFICATION OF CHESS GAMES An exploration of classifiers for anomaly detection in chess Masudul Hoque This thesis has been examined and approved by the following members of the student’s committee. ________________________________ Advisor ________________________________ Committee Member ________________________________ Committee Member ii Table of Contents Chapter 1 Introduction .......................................................................... 1 1.1 Introduction ......................................................................................... 1 1.2 Literature Review .............................................................................. 2 1.3 Research Problem Statement ........................................................ 6 Chapter 2 Data ............................................................................................ 8 2.1 Data Source........................................................................................... 8 2.2 Data Description ................................................................................. 8 2.3 Data Preprocessing ....................................................................... 11 2.4 Data Manipulations ....................................................................... 13 Chapter 3 Methodology ....................................................................... 15 3.1 Classification ..................................................................................... 15 3.1.1 Discriminant Analysis ........................................................ 15 3.1.1.a Linear Discriminant Analysis ............................ 17 3.1.1.b Quadratic Discriminant Analysis ..................... 21 3.1.2 K Nearest Neighbour .......................................................... 24 3.1.3 Multinomial Logistic Regression ................................... 27 3.2 Model Validation ............................................................................. 32 3.2.1 Confusion Matrix .................................................................. 34 3.2.2 K Fold Cross-Validation ..................................................... 39 3.2.3 Leave One Out Cross-Validation..................................... 41 3.3 Technology Use ............................................................................... 42 iii Chapter 4 Results .................................................................................... 44 4.1 Phase I ................................................................................................. 44 4.1.1 Linear Discriminant Analysis .......................................... 51 4.1.2 Quadratic Discriminant Analysis ................................... 54 4.1.3 Multinomial Logistic Regression .................................. 57 4.1.4 K nearest Neighbour ........................................................... 62 4.1.5 Phase I Summaries .............................................................. 70 4.2 Phase II ................................................................................................ 77 4.2.1 Linear Discriminant Analysis .......................................... 81 4.2.2 Quadratic Discriminant Analysis ................................... 85 4.2.3 Multinomial Logistic Regression .................................. 89 4.2.4 K nearest Neighbour ........................................................... 94 4.2.5 Phase II Summaries .......................................................... 102 4.3 Phase III .......................................................................................... 109 Chapter 5 Conclusions ....................................................................... 118 5.1 Conclusion ...................................................................................... 118 5.2 Future Work ................................................................................... 121 References ............................................................................................... 123 Appendix .................................................................................................. 123 Python Codes ........................................................................................ 127 R codes..................................................................................................... 130 iv CLASSIFICATION OF CHESS GAMES An exploration of classifiers for anomaly detection in chess Masudul Hoque A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN APPLIED STATISTICS MINNESOTA STATE UNIVERSITY, MANKATO MANKATO, MINNESOTA MAY 2021 ABSTRACT Chess is a strategy board game with its inception dating back to the 15th century. The Covid-19 pandemic has led to a chess boom online with 95,853,038 chess games being played during January, 2021 on lichess.com. Along with the chess boom, instances of cheating have also become more rampant. Classifications have been used for anomaly detection in different fields and thus it is a natural idea to develop classifiers to detect cheating in chess. However, there are no specific examples of this, and it is difficult to obtain data where cheating has occurred. So, in this paper, we develop 4 machine learning classifiers, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Multinomial Logistic Regression, and K-Nearest Neighbour classifiers to predict chess game results and explore predictors that produce the best accuracy performance. We use Confusion Matrix, K Fold Cross-Validation, and Leave-One-Out Cross- Validation methods to find the accuracy metrics. There are three phases of analysis. In phase I, we train classifiers using 1.94 million over the board game as training data and 20 thousand online games as testing data and obtain accuracy metrics. In phase II, we select a smaller pool of 212 games, select additional predictor variables from chess engine evaluation of the moves played in those games and check whether the inclusion of the variables improve performance. Finally, in phase III, we investigate for patterns in misclassified cases to define anomalies. From phase I, the models are not performing at a utilizable level of accuracy (44-63%). For all classifiers, it is no better than deciding the class with a coin toss. K-Nearest Neighbour with K = 7 was the best model. In phase II, adding the new predictors improved the performance of all the classifiers significantly across all validation methods. In fact, using only significant variables as predictors produced highly accurate classifiers. Finally, from phase III, we could not find any patterns or significant differences between the predictors for both correct classifications and misclassifications. In conclusion, machine learning classification is only one useful tool to spot instances that indicates anomalies. However, we cannot simply judge anomalous games using only this method. v Acknowledgements I would first like to thank my thesis advisor Dr Iresha Premarathna of the Department of Mathematics & Statistics at Minnesota State University-Mankato. Dr Premarathna was always available to provide invaluable insight, guidance, and answer any question about my research or writing. She consistently allowed this paper to be my work but steered me in the right direction whenever I felt lost. I would also like to acknowledge Dr. Mezbahur Rahman and Dr. Deepak Sanjel of the Department of Mathematics & Statistics at Minnesota State University-Mankato as esteemed members of the thesis committee, who were wonderful teachers in many of my courses. I am gratefully indebted to their teachings and valuable comments on this thesis. I must express my very profound gratitude to my mother, for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis.