The Development of a Predictive Model of On-Time High School Graduation in British Columbia
Total Page:16
File Type:pdf, Size:1020Kb
The Development of a Predictive Model of On-Time High School Graduation in British Columbia May 1, 2019 Ross Finnie Eda Suleymanoglu Ashley Pullman Michael Dubois Table of Contents Executive Summary ........................................................................................................................ 1 1. Introduction .............................................................................................................................. 4 2. Literature Review .................................................................................................................... 5 3. Data .......................................................................................................................................... 7 3.1 Outcome Variables of Interest ............................................................................................ 8 3.2 Predictor Variables .............................................................................................................. 8 3.1 Sample Selection ................................................................................................................. 9 4. Methodology .......................................................................................................................... 10 4.1 Predictive Model ............................................................................................................... 10 4.2 Predictive Accuracy .......................................................................................................... 11 4.3 Cross-Validation Method .................................................................................................. 11 4.4 Modelling Approaches ...................................................................................................... 14 4.5 Evaluation Methodology ................................................................................................... 18 5. Results for Grade 8 ................................................................................................................ 21 5.1 Comparison of the Modelling Approaches ....................................................................... 22 5.2 External Validation of the Selected Approach .................................................................. 27 5.3 Risk Scores: Predicted Probability of Not Graduating on Time ....................................... 29 5.4 Selection of a Predicted Probability Threshold and Predictive Accuracy of the Selected Approach 31 5.5 Importance of the FSA Scores for Predictive Accuracy ................................................... 34 6. Results for Grade 5 ................................................................................................................ 37 6.1 Comparison of the Modelling Approaches ....................................................................... 37 6.2 External Validation of the Selected Approach .................................................................. 41 6.3 Risk Scores: Predicted Probability of Not Graduating on Time ....................................... 43 6.4 Selection of a Predicted Probability Threshold and Predictive Accuracy of the Selected Approach 45 6.5 Importance of the FSA Scores for Predictive accuracy .................................................... 46 7. Discussion .............................................................................................................................. 49 References ..................................................................................................................................... 57 Glossary ........................................................................................................................................ 60 ii Table of Figures Figure 1: 5-Fold Cross-Validation ................................................................................................ 12 Figure 2: Nested 5-Fold Cross-Validation .................................................................................... 13 Figure 3: Nested 5-Fold CV and the External Validation Set ...................................................... 14 Figure 4: Decision Tree Example ................................................................................................. 17 Figure 5: Scenarios when Comparing Actual vs. Predicted Outcomes ........................................ 18 Figure 6: Example ROC Curves ................................................................................................... 20 Figure 7: Average AUC by Approach, Grade 8 ........................................................................... 22 Figure 8: Distribution of AUCs by Approach, Grade 8 ................................................................ 23 Figure 9: ROC Curves by Approach, Grade 8 .............................................................................. 23 Figure 10: Average P@10 by Approach, Grade 8 ........................................................................ 25 Figure 11: Distribution of P@10 by Approach, Grade 8 .............................................................. 26 Figure 12: AUC and P@10 for CV and External Validation Set, Grade 8 ................................... 28 Figure 13: ROC Curves for CV and External Validation Set, Grade 8 ........................................ 28 Figure 14: Distribution of Risk Scores, Grade 8 ........................................................................... 29 Figure 15: Cumulative Distribution of Risk Scores, Grade 8 ....................................................... 30 Figure 16: Empirical Risk Curve, Grade 8 ................................................................................... 31 Figure 17: Confusion Matrix using a 0.21 Predicted Probability Threshold, Grade 8 ................. 32 Figure 18: TPR, FPR, and Precision by Predicted Probability Threshold, Grade 8 ..................... 33 Figure 19: AUC and P@10 with and without the FSA Scores, Grade 8 ...................................... 34 Figure 20: ROC Curves with and without the FSA Scores, Grade 8 ............................................ 36 Figure 21: TPR, FPR and Precision by Predicted Probability Threshold with and without the FSA Scores, Grade 8 ..................................................................................................................... 36 Figure 22: Average AUC by Approach, Grade 5 ......................................................................... 37 Figure 23: Distribution of AUCs by Approach, Grade 5 .............................................................. 37 Figure 24: ROC Curves by Approach, Grade 5 ............................................................................ 38 Figure 25: Average P@10 by Approach, Grade 5 ........................................................................ 39 Figure 26: Distribution of P@10 by Approach, Grade 5 .............................................................. 40 Figure 27: AUC and P@10 for CV and External Validation Set, Grade 5 ................................... 41 Figure 28: ROC Curves for CV and External Validation Set, Grade 5 ........................................ 42 Figure 29: Distribution of Risk Scores, Grade 5 ........................................................................... 43 Figure 30: Cumulative Distribution of Risk Scores, Grade 5 ....................................................... 44 Figure 31: Empirical Risk Curve, Grade 5 ................................................................................... 45 Figure 32: Confusion Matrix using a 0.23 Predicted Probability Threshold, Grade 5 ................. 45 Figure 33: TPR, FPR, and Precision by Predicted Probability Threshold, Grade 5 ..................... 46 Figure 34: AUC and P@10 with and without the FSA Scores, Grade 5 ...................................... 47 iii Figure 35: ROC Curves with and without the FSA Scores, Grade 5 ........................................... 48 Figure 36: TPR, FPR, and Precision by Predicted Probability Threshold with and without the FSA Scores, Grade 5 ..................................................................................................................... 48 iv List of Acronyms AUC: Area Under the Curve CV: Cross-Validation FPR: False positive rate FSA: Foundation Skills Assessment P@10: Precision at the top 10% PEN: Personal Education Numbers RF: Random Forest ROC: Receiver Operating Characteristic TPR: True Positive Rate XgBoost: Extreme Gradient Boosting v Executive Summary The work presented in this report is part of a broader research project being undertaken by the Education Policy Research Initiative for the BC Ministry of Education which is intended to improve policy makers’ understanding of on-time high school graduation and develop tools that could potentially be used in policy initiatives that would ultimately lead to improved student outcomes. The project is based on the BC PEN data, which represent an extraordinarily rich data platform that captures student characteristics and enrollment information on a year-by-year basis from the point students enter the British Columbia (BC) school system until they leave, as well as province-wide Foundation Skills Assessment (FSA) scores in reading, writing, and numeracy administered in Grade 4 and Grade 7, all linked by students’ Personal Education Numbers (PEN). The first phase of the project involved an analysis of the relationships between on-time graduation and a range of student characteristics, the