The Development of a Predictive Model of On-Time High School Graduation in British Columbia

The Development of a Predictive Model of On-Time High School Graduation in British Columbia May 1, 2019 Ross Finnie Eda Suleymanoglu Ashley Pullman Michael Dubois Table of Contents Executive Summary ........................................................................................................................ 1 1. Introduction .............................................................................................................................. 4 2. Literature Review .................................................................................................................... 5 3. Data .......................................................................................................................................... 7 3.1 Outcome Variables of Interest ............................................................................................ 8 3.2 Predictor Variables .............................................................................................................. 8 3.1 Sample Selection ................................................................................................................. 9 4. Methodology .......................................................................................................................... 10 4.1 Predictive Model ............................................................................................................... 10 4.2 Predictive Accuracy .......................................................................................................... 11 4.3 Cross-Validation Method .................................................................................................. 11 4.4 Modelling Approaches ...................................................................................................... 14 4.5 Evaluation Methodology ................................................................................................... 18 5. Results for Grade 8 ................................................................................................................ 21 5.1 Comparison of the Modelling Approaches ....................................................................... 22 5.2 External Validation of the Selected Approach .................................................................. 27 5.3 Risk Scores: Predicted Probability of Not Graduating on Time ....................................... 29 5.4 Selection of a Predicted Probability Threshold and Predictive Accuracy of the Selected Approach 31 5.5 Importance of the FSA Scores for Predictive Accuracy ................................................... 34 6. Results for Grade 5 ................................................................................................................ 37 6.1 Comparison of the Modelling Approaches ....................................................................... 37 6.2 External Validation of the Selected Approach .................................................................. 41 6.3 Risk Scores: Predicted Probability of Not Graduating on Time ....................................... 43 6.4 Selection of a Predicted Probability Threshold and Predictive Accuracy of the Selected Approach 45 6.5 Importance of the FSA Scores for Predictive accuracy .................................................... 46 7. Discussion .............................................................................................................................. 49 References ..................................................................................................................................... 57 Glossary ........................................................................................................................................ 60 ii Table of Figures Figure 1: 5-Fold Cross-Validation ................................................................................................ 12 Figure 2: Nested 5-Fold Cross-Validation .................................................................................... 13 Figure 3: Nested 5-Fold CV and the External Validation Set ...................................................... 14 Figure 4: Decision Tree Example ................................................................................................. 17 Figure 5: Scenarios when Comparing Actual vs. Predicted Outcomes ........................................ 18 Figure 6: Example ROC Curves ................................................................................................... 20 Figure 7: Average AUC by Approach, Grade 8 ........................................................................... 22 Figure 8: Distribution of AUCs by Approach, Grade 8 ................................................................ 23 Figure 9: ROC Curves by Approach, Grade 8 .............................................................................. 23 Figure 10: Average P@10 by Approach, Grade 8 ........................................................................ 25 Figure 11: Distribution of P@10 by Approach, Grade 8 .............................................................. 26 Figure 12: AUC and P@10 for CV and External Validation Set, Grade 8 ................................... 28 Figure 13: ROC Curves for CV and External Validation Set, Grade 8 ........................................ 28 Figure 14: Distribution of Risk Scores, Grade 8 ........................................................................... 29 Figure 15: Cumulative Distribution of Risk Scores, Grade 8 ....................................................... 30 Figure 16: Empirical Risk Curve, Grade 8 ................................................................................... 31 Figure 17: Confusion Matrix using a 0.21 Predicted Probability Threshold, Grade 8 ................. 32 Figure 18: TPR, FPR, and Precision by Predicted Probability Threshold, Grade 8 ..................... 33 Figure 19: AUC and P@10 with and without the FSA Scores, Grade 8 ...................................... 34 Figure 20: ROC Curves with and without the FSA Scores, Grade 8 ............................................ 36 Figure 21: TPR, FPR and Precision by Predicted Probability Threshold with and without the FSA Scores, Grade 8 ..................................................................................................................... 36 Figure 22: Average AUC by Approach, Grade 5 ......................................................................... 37 Figure 23: Distribution of AUCs by Approach, Grade 5 .............................................................. 37 Figure 24: ROC Curves by Approach, Grade 5 ............................................................................ 38 Figure 25: Average P@10 by Approach, Grade 5 ........................................................................ 39 Figure 26: Distribution of P@10 by Approach, Grade 5 .............................................................. 40 Figure 27: AUC and P@10 for CV and External Validation Set, Grade 5 ................................... 41 Figure 28: ROC Curves for CV and External Validation Set, Grade 5 ........................................ 42 Figure 29: Distribution of Risk Scores, Grade 5 ........................................................................... 43 Figure 30: Cumulative Distribution of Risk Scores, Grade 5 ....................................................... 44 Figure 31: Empirical Risk Curve, Grade 5 ................................................................................... 45 Figure 32: Confusion Matrix using a 0.23 Predicted Probability Threshold, Grade 5 ................. 45 Figure 33: TPR, FPR, and Precision by Predicted Probability Threshold, Grade 5 ..................... 46 Figure 34: AUC and P@10 with and without the FSA Scores, Grade 5 ...................................... 47 iii Figure 35: ROC Curves with and without the FSA Scores, Grade 5 ........................................... 48 Figure 36: TPR, FPR, and Precision by Predicted Probability Threshold with and without the FSA Scores, Grade 5 ..................................................................................................................... 48 iv List of Acronyms AUC: Area Under the Curve CV: Cross-Validation FPR: False positive rate FSA: Foundation Skills Assessment P@10: Precision at the top 10% PEN: Personal Education Numbers RF: Random Forest ROC: Receiver Operating Characteristic TPR: True Positive Rate XgBoost: Extreme Gradient Boosting v Executive Summary The work presented in this report is part of a broader research project being undertaken by the Education Policy Research Initiative for the BC Ministry of Education which is intended to improve policy makers’ understanding of on-time high school graduation and develop tools that could potentially be used in policy initiatives that would ultimately lead to improved student outcomes. The project is based on the BC PEN data, which represent an extraordinarily rich data platform that captures student characteristics and enrollment information on a year-by-year basis from the point students enter the British Columbia (BC) school system until they leave, as well as province-wide Foundation Skills Assessment (FSA) scores in reading, writing, and numeracy administered in Grade 4 and Grade 7, all linked by students’ Personal Education Numbers (PEN). The first phase of the project involved an analysis of the relationships between on-time graduation and a range of student characteristics, the

The Development of a Predictive Model of On-Time High School Graduation in British Columbia

Health Outcomes from Home Hospitalization: Multisource Predictive Modeling

Review on Predictive Modelling Techniques for Identifying Students at Risk in University Environment

From Exploratory Data Analysis to Predictive Modeling and Machine Learning Gilbert Saporta

Data Mining and Exploration Michael Gutmann the University of Edinburgh

Comparing Out-Of-Sample Predictive Ability of PLS, Covariance, and Regression Models Completed Research Paper

Deciding the Best Machine Learning Algorithm for Customer Attrition Prediction for a Telecommunication Company

Machine Learning and Deep Learning Methods for Predictive Modelling from Raman Spectra in Bioprocessing

Predictive Modelling Applied to Propensity to Buy Personal Accidents Insurance Products

50 Years of Data Analysis from EDA to Predictive Modelling and Machine Learning

Quantification of Predictive Uncertainty in Hydrological Modelling by Harnessing the Wisdom of the Crowd: Methodology Development and Investigation Using Toy Models

Exploration of Customer Churn Routes Using Machine Learning Probabilistic Models

Operational Analytics & Data and Implementation Sciences