The Development of a Predictive Model of On-Time High School Graduation in British Columbia

May 1, 2019

Ross Finnie Eda Suleymanoglu Ashley Pullman Michael Dubois

Table of Contents Executive Summary ...... 1 1. Introduction ...... 4 2. Literature Review ...... 5 3. Data ...... 7 3.1 Outcome Variables of Interest ...... 8 3.2 Predictor Variables ...... 8 3.1 Sample Selection ...... 9 4. Methodology ...... 10 4.1 Predictive Model ...... 10 4.2 Predictive Accuracy ...... 11 4.3 Cross-Validation Method ...... 11 4.4 Modelling Approaches ...... 14 4.5 Evaluation Methodology ...... 18 5. Results for Grade 8 ...... 21 5.1 Comparison of the Modelling Approaches ...... 22 5.2 External Validation of the Selected Approach ...... 27 5.3 Risk Scores: Predicted Probability of Not Graduating on Time ...... 29 5.4 Selection of a Predicted Probability Threshold and Predictive Accuracy of the Selected Approach 31 5.5 Importance of the FSA Scores for Predictive Accuracy ...... 34 6. Results for Grade 5 ...... 37 6.1 Comparison of the Modelling Approaches ...... 37 6.2 External Validation of the Selected Approach ...... 41 6.3 Risk Scores: Predicted Probability of Not Graduating on Time ...... 43 6.4 Selection of a Predicted Probability Threshold and Predictive Accuracy of the Selected Approach 45 6.5 Importance of the FSA Scores for Predictive accuracy ...... 46 7. Discussion ...... 49 References ...... 57 Glossary ...... 60

ii

Table of Figures Figure 1: 5-Fold Cross-Validation ...... 12 Figure 2: Nested 5-Fold Cross-Validation ...... 13 Figure 3: Nested 5-Fold CV and the External Validation Set ...... 14 Figure 4: Decision Tree Example ...... 17 Figure 5: Scenarios when Comparing Actual vs. Predicted Outcomes ...... 18 Figure 6: Example ROC Curves ...... 20 Figure 7: Average AUC by Approach, Grade 8 ...... 22 Figure 8: Distribution of AUCs by Approach, Grade 8 ...... 23 Figure 9: ROC Curves by Approach, Grade 8 ...... 23 Figure 10: Average P@10 by Approach, Grade 8 ...... 25 Figure 11: Distribution of P@10 by Approach, Grade 8 ...... 26 Figure 12: AUC and P@10 for CV and External Validation Set, Grade 8 ...... 28 Figure 13: ROC Curves for CV and External Validation Set, Grade 8 ...... 28 Figure 14: Distribution of Risk Scores, Grade 8 ...... 29 Figure 15: Cumulative Distribution of Risk Scores, Grade 8 ...... 30 Figure 16: Empirical Risk Curve, Grade 8 ...... 31 Figure 17: Confusion Matrix using a 0.21 Predicted Probability Threshold, Grade 8 ...... 32 Figure 18: TPR, FPR, and Precision by Predicted Probability Threshold, Grade 8 ...... 33 Figure 19: AUC and P@10 with and without the FSA Scores, Grade 8 ...... 34 Figure 20: ROC Curves with and without the FSA Scores, Grade 8 ...... 36 Figure 21: TPR, FPR and Precision by Predicted Probability Threshold with and without the FSA Scores, Grade 8 ...... 36 Figure 22: Average AUC by Approach, Grade 5 ...... 37 Figure 23: Distribution of AUCs by Approach, Grade 5 ...... 37 Figure 24: ROC Curves by Approach, Grade 5 ...... 38 Figure 25: Average P@10 by Approach, Grade 5 ...... 39 Figure 26: Distribution of P@10 by Approach, Grade 5 ...... 40 Figure 27: AUC and P@10 for CV and External Validation Set, Grade 5 ...... 41 Figure 28: ROC Curves for CV and External Validation Set, Grade 5 ...... 42 Figure 29: Distribution of Risk Scores, Grade 5 ...... 43 Figure 30: Cumulative Distribution of Risk Scores, Grade 5 ...... 44 Figure 31: Empirical Risk Curve, Grade 5 ...... 45 Figure 32: Confusion Matrix using a 0.23 Predicted Probability Threshold, Grade 5 ...... 45 Figure 33: TPR, FPR, and Precision by Predicted Probability Threshold, Grade 5 ...... 46 Figure 34: AUC and P@10 with and without the FSA Scores, Grade 5 ...... 47

iii

Figure 35: ROC Curves with and without the FSA Scores, Grade 5 ...... 48 Figure 36: TPR, FPR, and Precision by Predicted Probability Threshold with and without the FSA Scores, Grade 5 ...... 48

iv

List of Acronyms

AUC: Area Under the Curve CV: Cross-Validation FPR: False positive rate FSA: Foundation Skills Assessment P@10: Precision at the top 10% PEN: Personal Education Numbers RF: Random Forest ROC: Receiver Operating Characteristic TPR: True Positive Rate XgBoost: Extreme Gradient Boosting

v

Executive Summary

The work presented in this report is part of a broader research project being undertaken by the Education Policy Research Initiative for the BC Ministry of Education which is intended to improve policy makers’ understanding of on-time high school graduation and develop tools that could potentially be used in policy initiatives that would ultimately lead to improved student outcomes.

The project is based on the BC PEN data, which represent an extraordinarily rich data platform that captures student characteristics and enrollment information on a year-by-year basis from the point students enter the British Columbia (BC) school system until they leave, as well as province-wide Foundation Skills Assessment (FSA) scores in reading, writing, and numeracy administered in Grade 4 and Grade 7, all linked by students’ Personal Education Numbers (PEN).

The first phase of the project involved an analysis of the relationships between on-time graduation and a range of student characteristics, the Grade 4 and Grade 7 FSA scores, and school district information available in the PEN data.

The second phase of the project, covered in this report, focuses on the development of models that predict on-time graduation at the individual student level which could be used by the Ministry to target student success initiatives on at-risk students to improve their on-time graduation rates and possibly other outcomes.

This work draws upon recent advancements in to develop predictive models using five established approaches to predict the probability of on-time graduation at the individual student level.

Predictive models are developed for students in Grade 5 and Grade 8, a choice guided by the timing of the Grade 4 and Grade 7 FSAs. These two models provide predictions that could be used to implement student success interventions at both an earlier point in time (Grade 5) and at a later point (Grade 8).

The project addresses the following research questions:

1. Which approach provides the most accurate predictive models of on-time high school graduation in BC based on the PEN data available, including the Grade 4 and Grade 7 FSA scores? 2. How well do the Grade 5 and Grade 8 predictive models perform and how does the accuracy of the two models compare? 3. To what extent do the Grade 4 and Grade 7 FSA scores improve the accuracy of the predictions of on-time graduation?

1

The first main finding is that more complex modelling approaches tend to bring limited gains in predictive accuracy. Therefore, a logistic regression modelling approach, the simplicity of which has potential advantages in terms of interpretability and implementation in practice, is selected to assess the predictive accuracy of the models developed.

Second, the results suggest that the models would provide good predictions for new cohorts of Grade 5 and Grade 8 students, which is the ultimate goal of the development of these predictive models. In particular, the predicted probabilities of not graduating on time (i.e., “risk scores”) generated by the models do a very good job of ordering students by their actual leaving rates, and the risk scores are very close to the actual rates of not graduating on time.

The models also perform quite well in terms of true positive rates (TPRs), representing those who are correctly predicted to not graduate on time, and false positive rates (FPRs), which represent those who are again predicted to not graduate on time but in fact do so, all in a context where a good model is one that generates higher TPRs and lower FPRs.

Third, the Grade 8 models yield more accurate predictions of not graduating on time compared to the Grade 5 models, at least in part due to the availability of the later FSA scores.

Fourth, the Grade 4 and Grade 7 FSA scores substantially improve the predictive accuracy of the models.

A predictive model of on-time graduation of the type developed in this project could be used to target students in two main ways. First, if an initiative aimed at improving on-time graduation is intended to support a specific number of students, presumably due to a limit in the resources available (i.e., budgetary restrictions), the risk scores produced by the predictive models could be used to target students with the highest estimated probabilities of not graduating on time. This would be done by ordering students by their risk scores and counting down until the designated number of students identified. In this way, a policy maker could be assured that the resources available are targeted on students in greatest need.

Second, true and false positive detections could be used to inform the targeting approach adopted. A true positive detection represents a case where a student is correctly predicted to not graduate on time, whereas a false positive detection represents a case where a student is predicted to not graduate on time when they actually do. If an initiative is directed to those students who are predicted to not graduate on time, the former (i.e., true positives) represents a case where resources are directed on a student who needed assistance, whereas the latter represents resources being spent on students who did not need assistance in the first place.

The predictive models developed in this project could inform these trade-offs by guiding the selection of the threshold to be used to target students. Choosing the preferred trade-off between true and false positives ultimately represents a policy decision, since it essentially reflects whether the policy maker prefers to err on the side of making sure as many of those who may need the assistance receive it at the cost of also helping some who do not need it, or vice versa (i.e., avoiding using resources to help students who do not need it). Presumably, this choice will be guided at least in part by considerations of the costs of any initiative(s) and the associated expected benefits in terms of improved on-time high school graduation rates.

2

One avenue for future research could involve bringing additional information on students related to their academic engagement or other aspects of their schooling experiences and outcomes, to their situation outside of school, or to their families into the development of the predictive models to improve their accuracy. The PEN data are extremely rich and of remarkable depth and quality when placed not only in the Canadian context but even at the international level, and one source of additional information could include making more of the PEN data that exist available for the purposes of developing these predictive models.

At a broader level, bringing the PEN data into Statistics Canada’s Social Data Linkage Environment (SDLE), as has recently been done, may provide some extremely innovative opportunities in terms of data possibilities for the development of predictive models of on-time high school graduation, and for other purposes

Another future avenue of research could involve the design, implementation, and evaluation of student success initiatives aimed at improving on-time graduation.

A further area for new work could be to examine the relationships between the risk scores generated by the Grade 5 and Grade 8 models and other outcomes, including access to post- secondary education and students’ post-schooling labour market earnings now that the PEN data have been linked to tax data, among others.

Finally, predictive models of access to post-secondary education, students’ post-schooling labour market earnings, and possibly other student outcomes could be developed using the PEN data and methods similar to those employed here.

The PEN data represent a remarkable resource for improving our understanding of a range of schooling and post-schooling outcomes and for developing predictive models of a comparable range of outcomes for which the current project represents an excellent starting point.

3

1. Introduction

This report presents the development and assessment of predictive models of on-time high school graduation for students in the British Columbia (BC) school system based on a very rich longitudinal administrative data platform that captures student characteristics and enrollment information on a year-by-year basis from the point students enter the school system until they leave, as well as province-wide Foundation Skills Assessment (FSA) scores in reading, writing and numeracy administered in Grade 4 and Grade 7, linked by students’ Personal Education Numbers (PEN).

Previous research in this area has largely been based on identifying individual indicators associated with high school completion and other related outcomes using relatively simple analytical approaches. In more recent years, the area of has become much more sophisticated and has become more focused on generating individual risk scores (i.e., an estimate for each student that characterizes their individual risk of not graduating on time) each student based on more predictors using advanced algorithms (i.e., different predictive approaches). This report draws upon these recent advancements in predictive modelling and, combined with the richness of the PEN data, provides a unique approach to predicting which students are at risk of not graduating on time, all placed in the BC context and intended to be of practical use to policy makers.

This work builds on a previous analysis that uses descriptive and regression modelling approaches to investigate the relationships between on-time graduation and a range of student characteristics and province-wide Foundation Skills Assessment (FSA) scores in reading, writing, and numeracy administered in Grade 4 and Grade 7. That earlier work not only provides a detailed profile of on-time graduation, but also points to the various factors that are likely to represent good predictors of on-time graduation in the predictive models developed here.

Predictive models are developed for students at two points in time, Grade 5 and Grade 8, each based on the information available at that time. The choice of these years was based principally on the timing of the Grade 4 and Grade 7 FSAs, which represent strong early predictors of on- time graduation.

Furthermore, the Grade 5 model generates very early predictions of on-time graduation, which could potentially be used to guide commensurately early interventions aimed at improving student success for those students predicted to be at greater risk of not graduating on time at that point in their studies. In comparison, the Grade 8 model generates more accurate predictions of on-time graduation due to the additional information available (including the Grade 7 FSA scores) and, therefore, better targeting of interventions, but at a later point in time.

Both the Grade 5 and Grade 8 predictive models estimate the probability that a student will not graduate from high school “on time.” On-time graduation is defined as 1) graduating within six years of starting Grade 8, and 2) graduating within nine years of starting Grade 5. The Grade 8 measure represents the standard definition of on-time graduation used by the BC Ministry of Education, while the Grade 5 measure developed for this project provides earlier predictions of on-time graduation which are consistent with the standard Grade 8 measure.

4

The analysis involves the development and comparison of predictive models of on-time graduation based on a variety of approaches established in the predictive analytics literature using information on a range of student characteristics, the Grade 4 and Grade 7 FSA scores, and school district. It then further tests the accuracy of the predictions of on-time graduation that would be expected for new cohorts of students entering Grade 5 and Grade 8 using one particular modelling selected for its accuracy and its relative ease of implementation, and finds that this expected accuracy is very good. The report then shows how the FSA scores contribute to the overall accuracy of the model.

This report should inform policy makers and others on a range of questions, including:

1. Which approach provides the most accurate predictive models of on-time high school graduation in BC based on the PEN data available, including the Grade 4 and Grade 7 FSA scores? 2. How well do the Grade 5 and Grade 8 predictive models perform and how does the accuracy of the two models compare? 3. To what extent do the Grade 4 and Grade 7 FSA scores improve the accuracy of the predictions of on-time graduation?

The report starts with a review of the related predictive modelling literature with a focus on previous work pertaining most closely to the on-time graduation models developed here. It then describes the PEN data, the predictor variables used in the analysis, the samples over which the models are developed, and the methodology employed. The results sections then present and discuss the findings for the Grade 8 and Grade 5 models. The discussion section concludes the paper by summarizing the main findings, outlining the policy purposes to which these predictive models could be put and how they would be implemented in practice, and discussing the limitations of this project as well as potential directions for further research.

2. Literature Review

In educational research, predictive modelling encompasses scholarship on educational (Baker, 2011; Baker & Yacef, 2009) and academic and learning analytics (Arnold & Pistilli, 2012; Brooks & Thompson, 2017; Campbell, deBlois, & Oblinger, 2007; Gašević et al., 2016). The main intent of using predictive modelling to measure high school graduation is to identify groups of students at risk of non-completion. Predicting high school completion is different than understanding the range of factors associated with graduation (e.g., the focus of the first report). Alternatively, predictive modelling provides insight into how well information from a set of predictor variables (e.g., demographic, academic, or other indicators) can be used to explain a given outcome.

Often with the aim of providing early interventions for students at risk of not graduating high school before they even approach their senior years, predictive modelling can ascertain the degree to which a set of predictor variables can be applied to new data (Aguiar et al., 2015; Gleason & Dynarski, 2002). As predictive modelling aims to generate inferences regarding

5

uncertain outcomes like high school graduation, it is defined as “the process of applying a or data mining algorithm to data for the purpose of predicting new or future observations” (Shmueli, 2010, p. 291).

With increasing use of machine-learning techniques and other advanced algorithms, predictive modelling is a rapidly expanding field. Many forms of dropout identification use simpler models that aim to find the “best” predictor variables that “accurately identifies students who will ultimately dropout of school” (Bowers, 2010, p. 12). Jurisdictional programs with the explicit aim of dropout identification exist, such as the Wisconsin Dropout Early Warning System (Knowles, 2015) and The Chicago On-Track system (Allensworth & Easton, 2005). Researchers also use similar approaches to estimate outcomes that are linked to high school graduation, such as test performance (Sullivan, Marr, & Hu, 2017).

Predictive modelling approaches to dropout identification also use all information from a set of predictor variables—rather than identify the best indicators—and incorporated more computationally complex algorithms, from linear probability (Adelman, Haimovich, Ham, & Vazquez, 2018) models, to random forest (Chung & Lee, 2019; Lakkaraju et al., 2015) logit post-LASSO (Sansone, 2018), and XG-Boost (Hlosta, Zdrahal, & Zendulka, 2017) approaches. Although the intent of using these methods is to make better predictions, it is often necessary to balance the information gained from a more complex algorithm with a simpler approach that is easier to understand and implement at the school and district level.

Along with variation in approaches, there are also major differences in what information researchers use to predict high school graduation—differences that are often due to data availability. For example, a predictive model may have the explicit intention of only using course enrollment and completion information (Allensworth & Easton, 2005). Other studies include factors outside of school, such as coming from a single-parent household (Croninger & Lee, 2001; Pagani et al., 2008). For these reasons, predictive models measuring high school graduation vary widely across jurisdictions and data sources and no single model is applicable across all contexts.

Although there are major differences across studies, prior research does indicate there are key predictor variables that provide information that can predict not completing high school, such as failing key courses, attendance records, and socioeconomic status (Chung & Lee, 2019; Lakkaraju et al., 2015; Sansone, 2018). Academic achievement indicators—such as those measured through standardized tests or single or multiple course grades—are often one of the most influential indicators in a model predicting high school completion (Bowers, Sprott, & Taff, 2012; Gleason & Dynarski, 2002).

Prior research also demonstrates that certain predictor variables can correctly classify students who are likely to not complete high school; yet, these same predictor variables may also misidentify students who do go on to successfully complete the credential. For example, in Mahoney and Cairns (1997) early study, a predictor variable that represents participating in one or no extracurricular activities identified 95% of students who did not complete high school. Yet, the same variable also captured 82% of students who did go on to graduate. Thus, while most non-completers do not participate in any or many extracurricular activities, many completers are

6

also not involved. While a predictor variable of extracurricular involvement can correctly identify students who will not graduate on time, it ultimately is a weak predictor of high school graduation for all students and thus would preform poorly in a predictive model.

Even with more advanced approaches and a greater number of predictor variables, researchers have not yet developed a model that can correctly classify all students who will not graduate. Predictive models of high school completion also include “false alarm” students (i.e., students who are incorrectly classified as not likely to graduate) (Bowers, Sprott, & Taff, 2012). A main aim of predicting high school graduation is finding which model has a high true-positive rate (i.e., correctly identifies at-risk students) and a low false-positive rate (i.e., correctly identifies false-alarm students).

As a key concern in predictive modelling is accuracy (e.g., how well it predicts high school completion), prior research also examines how predictions change at different grade levels (Aguiar et al., 2015). Although a model using information from higher grade levels are typically more informative and can better predict who will not graduate, early predictions may still offer useful information that is necessary for early interventions (Balfanz, Herzog, & Mac Iver, 2007; Johnson et al., 2015). Further, measures associated with high school graduation are often longitudinal in nature. For example, marks are strongly related to experiences in prior years (e.g., low marks in Grade 9 will be associated with low marks in Grade 8 or earlier). Thus, some predictive studies model longitudinal trajectories through cumulative information or year-by- year change (Bowers & Sprott, 2012; Janosz, Archambault, Morizot & Pagani, 2008). Although often offering high true-positive and low false-positive rates, these longitudinal models may be difficult to replicate from year-to-year and for all students.

Although the strength of predictive modelling is its ability to identify students at higher risk of not graduating, it is important from the onset to highlight two aspects of predictive modelling and what it attempts to achieve. First, predictive models have varying levels of misidentification that results in either failure to identify and help at-risk students or targeting resources towards those who would graduate on time without intervention (i.e., “false alarm” students) (Gleason & Dynarski, 2002). Second, predictive modelling is just the first step in establishing an early warning system and it is necessary to generate customized prevention policy at the district and school level where administrators and teachers are responsible for identifying at-risk students and implementing preventative measures (O'Cummings & Therriault, 2015). The end of the report will address these issues in detail.

3. Data

This study uses administrative data on students who attended primary or secondary schools in BC at any point during the 1991/1992 to 2016/2017 school years. These administrative records, provided by the BC Ministry of Education, capture information for all students in all grades (i.e., from kindergarten to Grade 12). The complete dataset is therefore longitudinal in nature and captures year-by-year enrollment information for each student, matched using the PEN.

7

3.1 Outcome Variables of Interest

Two binary outcome variables capture on-time graduation (i.e., 1=yes, 0=no): one for Grade 5 and the other for Grade 8.

The Grade 8 on-time completion measure is derived directly from the definition used by the BC Ministry of Education: graduating with a Certificate of Graduation (i.e., the “Dogwood Diploma”) within six years of beginning Grade 8. For example, if a student began Grade 8 in the 2009/2010 school year, they graduated “on time” by receiving a Dogwood Diploma by the 2014/2015 school year.1

Following the logic underlying the Grade 8 on-time graduation measure, for this project a Grade 5 outcome variable defines on-time graduation as graduating with a Dogwood Diploma within nine years of beginning Grade 5. For example, a student who began Grade 5 in the 2008/2009 school year graduated “on time” if they received a diploma by the 2016/2017 school year.

3.2 Predictor Variables

Student Characteristics and Related Variables

The PEN-based administrative dataset includes a range of predictor variables capturing a student’s personal and program characteristics, as well as geographic information that can be used to add other variables to the analysis.

The individual characteristics considered in the analysis are gender, self-reported Indigenous ancestry,2 a “special needs” designation in Grade 5 or 8, ESL during Grade 5 or Grade 8, ESL prior to Grade 5 or Grade 8, French immersion during Grade 5 or 8, enrollment in a “gifted” stream during Grade 5 or Grade 8, and if a student repeated a grade prior to Grade 5 or Grade 8.3

In terms of student geographical information, a student’s forward sortation area (i.e., the first three digits of their postal code) in Grade 5 and Grade 8 is used to construct an indicator

1 A limitation of the data is that they cannot differentiate between a student who does not complete a Dogwood Diploma or leaves the province entirely. For both on-time graduation measures, this limitation will result in the underestimation of on-time completion for the entire sample, and particularly sub-groups that are more transitory. 2 In the administrative data, the Indigenous predictor variable captures whether a student is ever identified as Indigenous. 3The analysis estimates grade repetition based on the number of years between the grade in which a student first appeared in the data and Grade 5 or Grade 8. For example, if a student is first observed in Grade 1 and takes more than four years to start Grade 5, they are flagged as repeaters. Similarly, if Grade 2 is the first grade they appear in the data, a student is flagged as a repeater if they take more than 3 years to start Grade 5.

8

representing area size (i.e., rural or urban area).4 Additionally, the same code is also matched to neighbourhood median family income as measured by the 2006 Census.5

The analysis also considers where a student went to school in Grade 5 and Grade 8 by including a set of binary predictor variables representing a student’s school district. In the data, 60 public school districts across BC are identified.

Another set of binary predictor variables also represent cohort, which is defined as the year in which a student first started Grade 5 or Grade 8 and capture students who entered Grade 5 between 2000 and 2008 and students who entered Grade 8 between 2003 and 2011.

Foundation Skills Assessment (FSA) Scores

In the 1999/2000 school year, BC introduced the FSA in three domains: numeracy, reading, and writing. There are two assessment periods, one in Grade 4 and another in Grade 7, for all students enrolled in public schools or schools that receive provincial subsidies.6

The FSA scores processed by the Ministry are manipulated to create categorical predictor variables representing the level of achievement in each domain. This ordinal variable captures reading, writing, and numeracy percentage scores that fall into eight categories: 1-29, 30-39, 40- 49, 50-59, 60-69, 70-79, 80-89, and 90+.7 Using these categorical predictor variables allows the relationships between the FSA scores and on-time graduation to be non-linear.

3.1 Sample Selection

Given the outcome variables of interest, the analysis is separated into two samples: Grade 5 and Grade 8 students. Each sample captures student-level information available in the year after the Grade 4 and Grade 7 FSAs.8 The data cleaning phase removed a small number of students whose age appeared to be extremely atypical to others at their grade level (e.g., over the age of 18 in

4 A postal code with a “0” in the second character is classified as a rural location by Canada Post. Rather than based on a strict population size, these postal locations are serviced by rural route postal drivers and/or outlets. 5 A limitation of this approximation is that some forward sortation areas can be fairly large and contain households with very different socio-economic characteristics. 6 All students are expected to participate; however, students with a language proficiency level or type of disability which would prevent their successful completion of the assessment are exempted and categorized as “no attempt” in this study. 7 The underlying reading and numeracy scores are continuous, whereas the writing scores are heavily clustered. Additionally, the writing scores do not evenly range from 1 to 100. For example, individuals who score in the 50-59 category have a mark of 50, 53, 57 or 58, while students in the 90+ category have a mark of 92 or 100. 8 A small number of individuals are missing grade level information but appear as enrolled in the data (n = 255 in the Grade 5 sample and n = 545 in the Grade 8 sample). For this small group, a student’s age and the timing of their FSAs serve as a proxy to estimate their grade level, a technique which is in accordance with the Ministry of Education.

9

Grade 8) (n = 38 in the Grade 5 sample and n = 49 in the Grade 8 sample) or were missing geographical-related information (n = 1,065 in the Grade 5 sample and n = 1,262 in the Grade 8 sample).

Sample restrictions also account for when the FSAs began in BC and the date range necessary to measure on-time completion (e.g., within nine years of starting Grade 5 and within six years starting Grade 8). Because FSAs began in the 1999/2000 school year, the analysis does not include students who did not complete the assessments prior to this period. With these exclusions, the Grade 5 samples cover the 1999/2000 to 2007/2008 school years and the Grade 8 samples cover the 2002/2003 to 2010/2011 school years.

4. Methodology

4.1 Predictive Model

Whether a student does not graduate on time represents a binary outcome; that is, only two outcomes are observed in the data: students did not graduate on time or graduate on time (defined in Section 3.1). Determining which outcome is expected for a student is called a classification problem.9

A predictive model is a rule or formula that produces predictions for the classification problem. In other words, the model produces predictions on the outcomes of new observations (i.e., new students) using observable characteristics (e.g., gender, special needs, ESL, etc.). These observable characteristics are called features or predictor variables.

Predictive models could be simple mathematical formulas or more complex mathematical structures. There are several approaches that could be used to address classification problems in the machine learning literature, ranging from basic extensions of linear regression models, such as logistic regressions, to complicated approaches such as random forests and deep neural networks.

Regardless of its complexity, a predictive model is developed (i.e., its parameters and structures are determined) using historical data. The process of determining the exact parameters and the structure of a predictive model is called training, meaning that the model is trained to produce predictions based on the historical data.

9 Machine learning is a fairly new and emerging field and the terminology changes depending on the field of study in which it is used (statistics, computer science, business, etc.). Make sure to refer to Section 09 for the terminology of this report.

10

4.2 Predictive Accuracy

The predictive accuracy of a model refers to the accuracy of the predictions when they are produced using new observations. Different metrics could be used to measure the predictive accuracy, which will be discussed in detail later in this section.

Since the accuracy of the predictions for new students cannot be known before observing their outcomes, the historical data is used to assess the predictive accuracy of different models and to build a predictive model that is expected to produce the most accurate predictions among all the candidate models.

However, there are basic concerns with predictive modelling regarding how well a predictive model explains the relationships between predictor and the outcome, which is called fit. The most basic concern is underfitting: the inability of a model to capture the fundamental relationship between the predictors and the outcome of interest. Ultimately, flexible and rich predictive models are more likely to be able to explain complex relationships and to fit the historical data better.

This creates an incentive to employ richer and more sophisticated models to fit the historical data as well as possible, which may lead to another problem called overfitting. A predictive model that fits the historical data too well will not necessarily produce accurate predictions. This happens when an unnecessarily rich or complex model captures even the random patterns specific to the historical data, rather than the core relationship between the predictors and outcome. Since the recent advances in modern machine learning and computation, which allow researchers to develop increasingly complex models, overfitting, rather than underfitting, is arguably the more serious concern.

The best method to objectively assess the predictive accuracy of a model and assess whether overfitting is a problem is to treat a segment of the historical data as new information. This is done by splitting the data into two sets, one of which is used to train the model and the rest to measure the accuracy of predictions of the trained model. This is called external validation, which is used extensively in predictive modelling. A special case of external validation and how it is used is explained in detail below.

4.3 Cross-Validation Method

A key aspect of predictive modelling is tuning. Some predictive models have additional parameters called tuning parameters and they determine the complexity, size, or flexibility of the model to be trained. The tuning parameters need to be extensively tested as these parameters cannot be inferred directly from the information on the outcomes or the predictors. External validation methods are used to choose these parameters based on their implications for the out- of-sample prediction accuracy. The most popular of these is called cross-validation (CV).

To find the right set of tuning parameters, this analysis uses a 5-fold CV approach. In this approach, the dataset is separated into five random, non-overlapping parts called folds (see

11

Section 09).10 Then, four of the folds are used as a training set to train the model with a given set of tuning parameters and the fifth fold is used as a validation set to estimate the performance of the model when it uses this specific set of tuning parameters. This is repeated five times (once for each alternative validation fold), and then the performance metric is averaged to create a value reflecting the predictive performance of the tuning parameters selected. This procedure is repeated for different sets of tuning parameters to identify the optimal tuning parameters. Then, these parameters are used on the entire data to train the final model.

Figure 1: 5-Fold Cross-Validation

Another aspect of predictive modelling is feature selection. It is not immediately clear which subset of predictors should be used in the model. Using too many predictors may lead to overfitting. Models could be fitted with different subsets of predictors, one-by-one, and the set that provides better predictive performance, as measured through the CV process described above, could be selected. Alternatively, some machine learning approaches have built-in feature selection, offering automatic removal of unhelpful predictors with little additional computational cost.

Cross-validation could also be used to compare competing different modelling approaches, which are explained in the next subsection, based on their predictive accuracy, which is an important objective of this report. If one or more modelling approaches require CV to tune model parameters or select features, it is best to nest another CV process inside each training set of the CV. This leads to a structure called nested CV (see Figure 2Figure 2). Formatted: Font: Not Italic

This procedure involves applying a 5-fold CV to the entire data to compare the predictive accuracy of the modelling approaches, called outer CV. Within each training set (or subsample)

10 Random separation is done in a way that ensures all parts have roughly the same proportion of students not graduating on time. The choice of the number of folds is not guided by any rule, but popular choices for the number of folds in a CV are 5 and 10 (e.g., see Sansone (2018) for a study on predicting high school dropout behaviour using a 5-fold CV).

12

of each iteration of the outer CV, an additional 5-fold CV is applied, called the inner CV, for model tuning or feature selection purposes. The optimal tuning parameters are identified using the inner CV. Then, these parameters are used to train the model (given a modelling approach) on the entire training set of the outer CV and estimate the predictive accuracy of the modelling approach using the validations set of the outer CV. Finally, the predictive accuracy estimates for each modelling approach are compared to select the best or optimal modelling approach.

Figure 2: Nested 5-Fold Cross-Validation

The main purpose of using a nested CV structure is to avoid using the same data for both tuning models and estimating performance since it has been widely accepted in the literature that such a practice tends to lead to over-optimistic estimates of predictive accuracy. This is because model tuning procedures may also be influenced by idiosyncrasies in the training data and lead to overfitted models. The nested CV structure provides more objective estimates at the cost of additional computation time by using separate subsamples for tuning models and estimating performance.

The performance metrics produced from the outer CV iterations during the selection of the modelling approach may be optimistic estimates of the predictive accuracy of the final selected model. This is because the modelling approach used to produce the final model uses all the historical data to guide model selection as well as estimate the predictive accuracy. Just like in overfitting, it is possible that the determination of the final model might have been affected by idiosyncrasies in the historical data. Therefore, a portion (30%) of the historical data is set aside from the modelling approach selection stage and used for an unbiased assessment of the

13

predictive performance of the final model estimated using the selected modelling approach.11 We call this portion of the data external validation set.

Figure 3: Nested 5-Fold CV and the External Validation Set

4.4 Modelling Approaches

There are many predictive modelling approaches that could be used for this analysis. Some, however, are overly complicated and do not have clear interpretations, while many others require immense computational resources for training (such as the use of for image recognition). Five approaches were identified and are investigated to address the classification problem at hand based on how widely used they are, their ease-of-use, their interpretability, and their computational requirements.

Logistic Regression with Linear Predictors (Baseline Logit)

The logistic regression is probably the most popular method to estimate the relationships between a binary dependent variable (such as a not graduating on time) and predictors, and likely

11 This has high computational cost if it is done through a third layer of CV with multiple iterations in the outer-most layer, as the dataset is very large. For this reason, only one partition of the entire data is set aside as the external validation set (a random 30% of the entire data), and, therefore, there are no iterations on the outer-most stage of the predictive modelling process.

14

the easiest to estimate, understand, and employ. It models the probability of a positive outcome, which is determined as not graduating on time for this project, as:

exp (훽 + 훽 푋 + 훽 푋 + ⋯ ) Pr(푁표푡 퐺푟푎푑푢푎푡푒 표푛 푇푖푚푒) = 0 1 1 2 2 1 + exp (훽0 + 훽1푋1 + 훽2푋2 + ⋯ ) where Xs represent different observable characteristics. Logistic regressions do not have any tuning parameters. The training is to estimate the β parameters. Once these parameters are estimated, the probability of not graduating on time could be predicted using the above formula.

For the purposes of this project, a logistic regression model that includes all the predictors available in only linear form (i.e., no interactive terms or pairwise multiplications among predictors) is defined as the Baseline Logit approach. The first concern with this approach is overfitting. It is possible that the number of predictors included is unnecessarily large and fits the patterns of the specific training data rather than creating a model that will effectively predict not graduating on time for new students.

The second concern with the Baseline Logit is underfitting. The Baseline Logit approach uses a relatively simple statistical model that may not be able to capture complex relationships in the historical data. There could, in fact, be nonlinearities in the relationship between not graduating on time and the predictors, or interactions between predictors that could help predict not graduating on time better. To allow for more model complexity and also to avoid underfitting, this report investigates various alternatives.

L1-Regularized Logistic Regression with Linear Predictors (L1-Baseline)

To address the overfitting possibility of the Baseline Logit, the analysis will also use a L1- regularized logistic regression model that includes only linear predictors. This model has a built- in feature selection process: it ensures that an optimal set of predictors are included in the regression using a technique that introduces a special penalty (called an L1-penalty) for each additional predictor used. This penalty estimates coefficients at zero if they are not improving predictions in a meaningful way, and essentially removes predictors that are not useful, which could potentially lead to an increase in predictive performance over the Baseline Logit.12 The L1-regularized regression model has one tuning parameter (the size of the penalty), which is selected within the inner-CV. For this project, this special case (linear predictors only) of L1- regularized logistic regression model is defined as the L1-Baseline approach.

Interactive L1-Regularized Logistic Regression (L1-Interactive)

The L1-regularized logistic regression represents an extension to the Baseline Logit and L1- Baseline by using the L1-regularized logistic regression on all predictors and their pairwise

12 This model is implemented via the glmnet software package (Hastie & Tibshirani, 2010).

15

multiplications, except for the district predictor variable.13 This model is richer than the Baseline Logit and L1-Baseline as it accounts for more complex relationships between predictors and the outcome. Using its built-in penalty term, the resulting model is again limited to the interactions the model indicates that they improve the accuracy of predictions.

In this project, this L1-regularized logistic model with all the pairwise multiplications is defined as the L1-Interactive approach. This approach, before model training, involves an initial data processing procedure, whereby highly correlated and near-zero variance predictors are eliminated.14

While the L1-Baseline addresses the overfitting concern of the Baseline Logit, the L1-Interactive addresses the underfitting of the Baseline Logit.

Random Forest (RF)

Another popular method in the machine learning literature is to use classification trees for predictions. To understand the idea of a tree, consider the following hypothetical example.

Suppose there are two predictors available to predict outcomes for students: gender and ESL, and that there are two possible values for gender (male and female) and for ESL (Yes, No) in the data. This makes 4 different observable types of students. If a large enough sample is available, it would be possible to find enough students representing each student type. For example, if 60% of male ESL students do not graduate on time, the best prediction for a new incoming male ESL student to not graduate on time is estimated at 60%.

The problem is that finding all types of students and measuring the proportion that do not graduate on time is impossible, since the number of predictors included in this analysis would yield billions of different types of students. Decision trees are used to approximate this logic using a feasible method. A very simple tree is presented in Figure 4Figure 4. Formatted: Font: Not Italic

13 The memory requirement and computation time increase significantly when multiplicative terms with the district predictor variable are included in the model. Therefore, only pairwise multiplications among other predictor variables are included in the model. 14 These problematic predictors occur due to the pairwise multiplicative predictors. Such predictors would make the estimation of logistic regression models impossible

16

Figure 4: Decision Tree Example

Instead of dealing with all possible splits in the data, a smaller selection is made to divide the data into groups. This selection is done via an algorithm so that each time the tree splits into two branches, the optimal predictor is selected to maximize predictive power. In practice, the trees are usually much bigger than shown in Figure 4Figure 4. However, once the tree is completed, Formatted: Font: Not Italic regardless of its complexity or size, predictions could easily be produced merely by asking a series of yes or no questions. In Figure 4Figure 4, if a prediction is needed for a female student Formatted: Font: Not Italic who is ESL stream, she falls into the rightmost node and proportion that did not graduate on time in that node (which is based on historical data) is her predicted probability of not graduating on time.

Classification trees are easy to understand, implement, and interpret, but the literature shows that they tend to underperform. Several methods were developed to address this issue, the two most popular of which are random forests and gradient boosted trees.

A random forest (RF) model involves selecting random subsets of predictors multiple times to develop several trees from the same training data (Breiman, 2001). Each tree in the RF will produce a different prediction for a new student. Then these predictions are aggregated to produce a final prediction for that student.

The challenge with a RF approach is that it is more difficult to tune compared to L1-Baseline or L1-interactive as they have more than one tuning parameter. In fact, the space of all possible values for the tuning parameters is very large and the computation time for a given set of tuning parameters is also a consideration given the large sample sizes. Therefore, in the analysis below, all of the tuning parameters, except for one, were set to their default values (set by the software).

Another related issue is interpretability. Unlike a simple decision tree, this approach use hundreds or potentially even thousands of trees to produce predictions.

17

Extreme Gradient Boosting (XgBoost)

Gradient boosted trees involve building several simple trees successively where each successive tree aims to improve its predecessor’s performance (Friedman, 2001). A popular extended version of this method is called extreme gradient boosting (XgBoost) (Chen & Guestrin, 2016). This modelling approach represents one of the more recent and popular ones in the literature.

The challenges in terms of tuning and interpretability described above for the RF also exist for the XgBoost. The specific set of tuning parameters to search over is determined by first starting with the default parameter set (fixed default values for each type of parameter) and recording the predictive accuracy of the final model. Then, the parameter set is extended slightly to try two or more values for certain sets of parameters. The predictive accuracy resulting from this second parameter search is then compared with the one from the first search to guide the direction of the search for each type of tuning parameter. This process is iterated multiple times.15

4.5 Evaluation Methodology

Measuring Predictive Accuracy

The predictive models yield a predicted probability of the positive outcome (not graduating on time for this project) ranging between 0 and 1 for each observation, with the predicted probability being closer to 1 meaning the student is more likely to not graduate on time.16 In this project, these predicted probabilities are referred to as risk scores. These risk scores provide a ranking of students in terms of their risk level of not graduating on time.

The risk scores could be used to assign students to either of the two classes: Not graduate on time or graduate on time. To do this, one must use a probability threshold to predict a binary outcome (i.e., not graduate on time or graduate on time) for each student. Given a risk score threshold, all students with scores above the threshold are predicted to not graduate on time, and those with scores equal or below the threshold are predicted to graduate on time. Having predicted the outcome for each student, one could now compare the predicted binary outcomes with the actual outcomes to determine how well the predictions and the actual outcomes match. When dealing with two classes, there are four different possible scenarios for each prediction, which are summarized in the confusion matrix below in Figure 5Figure 5. Figure 5: Scenarios when Comparing Actual vs. Predicted Outcomes Actual Outcome Predicted Outcome Not Graduate on Time Graduate on Time Not Graduate on Time TRUE POSITIVE FALSE POSITIVE

15 The parameter sets over which the search is conducted is available upon request. 16 One could also multiply these values by 100 and think in terms of percentages.

18

Graduate on Time FALSE NEGATIVE TRUE NEGATIVE

It is possible to produce various performance metrics by comparing predictions using the validation data and the actual outcomes available in the validation data. The key metrics used in this report are the following:

• True positive rate (TPR) represents the proportion of true positive predictions among students who did not graduate on time. For instance, if the TPR is 80%, it means that out of all students who did not graduate on time, the model identified 80% correctly. Since the TPR is a measure of the ability to predict not graduating on time, a high TPR is desirable.

• False positive rate (FPR) represents the proportion of false positive predictions among students who graduated on time. A FPR of 20% means that, out of all students who graduated on time, 20% were incorrectly predicted to not graduate on time. A low FPR is desirable.17

• Precision represents the proportion of true positive predictions among students who are predicted to not graduate on time. A 40% precision means that, out of all students who were predicted to not graduate on time, 40% were correctly predicted.

• Precision at the top 10% (P@10) represents the proportion of true positive predictions among students with risk scores in the top decile. A 70% P@10 means that, out of all the students in the top decile of the risk score distribution, 70% were correctly predicted as not graduating on time. This represents an established measure which focuses on predictive accuracy for those with the predicted probability of not graduating on time within the top decile, and provides a more intuitive predictive accuracy measure compared to average AUCs and ROC curves (see following).

The Receiver Operating Characteristic (ROC) curve offers a way to graph the TPR and FPRs. Given a probability threshold, the two predicted outcomes combined with the actual outcomes make up the confusion matrix above and determine the TPR, FPR and precision. If a different threshold is selected, different predicted outcomes are produced, leading to different TPR, FPR and precision. Precision at the top 10% does not change with threshold.

A lower threshold yields higher TPR, but also a higher FPR. The ROC curve embodies this trade-off: it plots FPR against TPR where each point on the curve represents a different threshold and the associated TPR and FPR values. Figure 6Figure 6 contains two hypothetical examples of Formatted: Font: Not Italic ROC curves: a higher curve A and a lower curve B.

17 There are also two related measures which contain the same information as the ones listed here. The false negative rate (=100 – TPR) is the proportion of false negatives among students did not graduate on time. The true negative rate (=100 – FPR) is the proportion of true negatives among students who graduated on time.

19

Starting with curve A, the two points on the curve represent two different probability threshold values. When a threshold of 0.5 is used, the TPR is 68% (vertical (x) axis) and FPR is 4% (horizontal (y) axis). When the threshold is decreased to 0.15, TPR increases to 80% at the cost of FPR increasing to 10%.

In Figure 6Figure 6, curve A represents a model with a much better predictive accuracy Formatted: Font: Not Italic compared to curve B. Curve A has a much higher TPRs for the same level of FPRs. The top left corner of the graph represents a point where the TPR is 100% and the FPR is 0%, an ideal scenario. Therefore, models with ROC curves which are closer to the top-left corner are considered to have a higher prediction accuracy. ROC curves, too, provide a threshold- independent comparison of different models based on predictive accuracy.

This is done visually, as well as by measuring the Area Under the Curve (AUC), the size of which could be used to assess the ability of a model to accurately predict outcomes. As mentioned above, ROC curves closer to the top left corner are considered better so larger AUC values are associated with better predictive performance. Since AUC is just the area under the ROC curve, it does not change with threshold.

The TPR, FPR, and P@10 are selected as the main predictive accuracy metrics of interest, as they are more informative and easier to interpret than the more traditional measures or plots, such as the AUC or ROC. The latter (also the P@10), though, provide a simple and straightforward comparison of the performance of different models, as well as guide the choice of tuning parameters, which is explained in the next subsection.

Figure 6: Example ROC Curves

Selection of the Modelling Approach

Each the modelling approaches explained above, except for L1-Baseline and L1-Interactive, require different model training and while some require model tuning or feature selection and

20

data processing steps, some do not. The approaches that require tuning or have a built-in feature selection complete these steps using the inner 5-fold CV (see Figure 2Figure 2). Formatted: Font: Not Italic

Since AUC, ROC curve and P@10 are threshold-independent, they are used extensively to compare the predictive performance of different modelling approaches and guide the selection of the best or optimal modelling approach. The selection of tuning parameters within the inner CV, which is done for some of the modelling approaches (explained above), are guided by comparing the average AUC values from the five inner CV iterations.

The outer 5-fold CV is used to train the models on the entire training set of the outer CV given the tuned parameters from the inner CV, and their predictive accuracy is estimated using the validation set of the outer CV. The AUC values, ROC curves, and the P@10 are calculated and stored for each approach, and one by one, the 5 parts in the outer CV are used as validation sets, creating 5 AUC values, ROC curves, and P@10 in total. Then, to choose the best modelling approach, the predictive accuracy of the approaches are compared using average AUC values, ROC curves, and average P@10 generated over the 5 outer CV iterations.

Final Predictive Model and Its Predictive Accuracy

Having selected a modelling approach, this approach is used to train the model on the entire data reserved for the outer CV (i.e., the combination of training and validation sets of the outer CV) to build the final predictive model that could be used to produce student-level predictions for the new cohorts.

As mentioned earlier, the estimate of the predictive accuracy of the selected modelling approach generated using the outer CV may be optimistic, as it is estimated on the dataset that is used for selecting a modelling approach. Therefore, the external validation set (a random 30% of the entire data) that is isolated from the process of modelling approach selection is used to estimate the predictive accuracy of the final predictive model. The predictive accuracy of the final predictive model is assessed by TPR, FPR, and P@10.

5. Results for Grade 8

This section first examines the predictive accuracy of the five modelling approaches, the Baseline Logit, L1-Baseline, L1-Interactive, RF, and XgBoost, using the sample of Grade 8 students. It compares the average AUC, distribution of AUC values from the 5-fold CV, ROC curves, average P@10 of risk score distributions, and distribution of P@10 values for all approaches. The section continues with the validation of a selected approach on data that are not used in the training of the models (i.e., the external validation set). It then follows with the presentation of risk scores (i.e., the predicted probability of not graduating on time at the individual student level), including a comparison of the predicted probabilities of not graduating on time with actual outcomes. It subsequently describes the selection of the predicted probability threshold for evaluating predictive accuracy and potentially determining which students to target with student success initiatives. The section concludes with the examination of the importance of the FSA scores for predictive accuracy.

21

5.1 Comparison of the Modelling Approaches

Average AUC

Figure 7Figure 7 presents the average AUCs for the five modelling approaches. The average Formatted: Font: Not Italic AUC is similar across all approaches, ranging from 0.769 to 0.785, but the XgBoost and L1- Interactive approaches yield slightly higher AUCs than the others, while the RF approach performs below the others.

Figure 7: Average AUC by Approach, Grade 8

The use of average AUC is a simple and clear criterion, but to provide a broader perspective of accuracy, the distribution of AUC values from the CV, ROC curves, as well as P@10 of the risk score distributions are compared across the approaches.

Figure 8Figure 8 shows the distribution of AUC values from the CV and demonstrates they are Formatted: Font: Not Italic similar across different iterations of the 5-fold CV (i.e., based on the testing sets of from the five different partitions of the entire training data), with the horizontal lines representing the average AUC values given in Figure 7Figure 7 (i.e., the XgBoost is again seen to perform slightly better Formatted: Font: Not Italic

22

than the simpler approaches, with RF lagging behind). Overall, the XgBoost and the RF approaches exhibit slightly more variation in AUC values compared to the other approaches.

Figure 8: Distribution of AUCs by Approach, Grade 8

ROC Curves

Figure 9Figure 9 presents the ROC curves for the different approaches to compare predictive Formatted: Font: Not Italic accuracy at a more detailed level than what the summary values provided by AUCs. As mentioned earlier, higher ROC curves represent greater predictive accuracy, and points on the curve also allow for the comparison of the TPR for a given FPR value. In particular, there is a trade-off between the TPR and FPR: if one value increases, the other has to increase as well. The shape of the ROC curves represents this trade-off. For example, for an FPR value of 0.3 all the approaches yield a TPR of approximately 0.72, except the RF approach shows a TPR of 0.70. The difference in TPR values between the RF and other approaches changes for different FPR values.

Figure 9: ROC Curves by Approach, Grade 8

23

The ROC curves for all the approaches, except the RF, largely overlap, which is consistent with the very close AUC values seen in Figure 7Figure 7, and also indicates very similar TPR values Formatted: Font: Not Italic for all given FPR values. This further points to the similar predictive accuracy of the approaches, the one exception is the ROC curve for the RF approach, which lies slightly below the other approaches for almost all the possible FPR values – yielding the lowest AUC.

Average Precision at the Top 10% (P@10)

Figure 10Figure 10 shows the P@10 values for the top decile of the risk score (i.e., predicted Formatted: Font: Not Italic probability of not graduating on time) distribution for each approach. This measure represents the predictive accuracy of a model focusing on the proportion of true positive predictions among students with risk scores in the top decile, as discussed in more detail above.

The XgBoost approach has the highest precision value, while the Baseline and L1-Baseline approaches have the lowest. The differences across the approaches are slightly larger compared to the average AUC values. The XgBoost approach outperforms the other approaches by 0.6-1.5 percentage points; although, these differences are not very large.

For example, suppose there are 1,000 students in the top decile of the risk score distribution. Out of the 1,000, the XgBoost approach would correctly predict 717 to not graduate on time, whereas the Baseline Logit approach would correctly predict 702 students. Therefore, the Baseline Logit approach would fail to predict 15 students who would in fact not graduate on time. This has particular implications for targeting student success initiative on those students who need them the most.

24

The distribution of precision values show a little more variation across the CV iterations (Figure 11Figure 11) compared to the AUCs (Figure 8Figure 8), but there are no major outliers that Formatted: Font: Not Italic could bias the average P@10 values shown in Figure 10Figure 10. Formatted: Font: Not Italic Formatted: Font: Not Italic A Selected Modelling Approach

Overall, the findings show that the XgBoost approach outperforms the other approaches in terms of average AUC and P@10, and also performs well based on the ROC curves and the distribution of AUC and P@10 values. The XgBoost approach is, however, also one of the more complex approaches considered in this analysis.

The complexity of a model could be an important consideration, because greater complexity could have disadvantages in terms of interpretability and ease of using the model in practice. Therefore, if the improvement in predictive accuracy is relatively small when a more complex approach is used, it may be better to use a simpler approach.

In the present context, the XgBoost approach improves the predictive accuracy only slightly compared to the Baseline Logit approach, which is the simplest approach among all the five approaches considered. The Baseline Logit approach is, therefore, selected to illustrate the predictive accuracy that would be expected with new data, because not only is its predictive accuracy comparable to the XgBoost approach (only a 0.3 and 1.5 percentage point difference in the average AUCs and P@10s, respectively), but also would be considerably easier to implement compared to the XgBoost approach.

Thus, the remaining results for Grade 8 students use the Baseline Logit approach to illustrate the various predictive accuracy measures on the external validation set and individual risk scores (i.e., predicted probabilities of not graduating on time).

Figure 10: Average P@10 by Approach, Grade 8

25

Figure 11: Distribution of P@10 by Approach, Grade 8

26

5.2 External Validation of the Selected Approach

After selecting an approach to build a predictive model using the training data (i.e., 70% of the entire data as outlined in Section 4.3), the model is then trained (or estimated) on the entire training data and its AUC, ROC curve, and P@10 are calculated using the external validations set (i.e., the remaining 30%) that was set aside and isolated from the initial stage of approach selection. A comparison of the AUC and P@10 measures, as well as the ROC curves from the CV produced during the modelling approach selection stage and the external validation set demonstrates how well the predictive model developed generalizes to new unseen data.

Figure 12Figure 12 and Figure 13Figure 13 show that the AUC and P@10 measures and ROC Formatted: Font: Not Italic curves, respectively, are very close for the CV and external validation set. Therefore, it is Formatted: Font: Not Italic, Check spelling and expected that the predictive model estimated using the selected Baseline Logit approach will grammar produce similar results when applied to new data. There may be discrepancies, however, especially if there are significant structural differences in new data in terms of the relationships between observable characteristics and the outcome variable, as well as differences in the way the predictors are coded in future data.

From this point on, the accuracy measures are calculated using the external validation set to better reflect the expected predictive accuracy on new data.

27

Figure 12: AUC and P@10 for CV and External Validation Set, Grade 8

Figure 13: ROC Curves for CV and External Validation Set, Grade 8

28

5.3 Risk Scores: Predicted Probability of Not Graduating on Time

As explained in Section 4.5, given the parameter estimates of the predictive model estimated on the entire training data (i.e., 70% of the sample), each student in the external validation set (i.e., the remaining 30%) is assigned a predicted probability of not graduating on time ranging between 0 and 1, which is referred to as a risk score in this report (Section 4.5). These risk scores provide a ranking of students and with higher values signalling a student has a higher expected probability of not graduating on time.

Figure 14Figure 14 shows the distribution of students across risk scores, which range from 0.01 to 0.99. This shows that the peak of the distribution is between 0.05 and 0.10, and the proportion of students declines steadily after that.

Figure 15Figure 15 shows the cumulative distribution of students across risk scores, with 50% of the students having risk scores 0.16 and below and 75% with scores of 0.30 and below (Figure 15Figure 15).

Figure 14: Distribution of Risk Scores, Grade 8

29

Figure 15: Cumulative Distribution of Risk Scores, Grade 8

The performance of the predictive model in terms of how well it ranks students could be examined by comparing these risk scores with actual outcomes for students at each risk score level.

30

Figure 16Figure 16 shows the risk scores on the x-axis and the corresponding actual proportion of students who did not graduate on time on the y-axis.18 The actual proportion of students who did not graduate on time not only increases consistently with the risk score level, but there is also a close match between the actual proportion of students and the risk scores. In other words, the Baseline Logit approach generally ranks students correctly and generates predicted probabilities of not graduating on time that are very close to the actual rates of not graduating on time.

These risk scores could be used to target student success initiatives on students at higher risk levels. In particular, the cumulative distribution shows the number of students with a certain risk score or below (e.g., a pre-determined risk score threshold) or the risk score level that corresponds to a pre-determined number of students with higher risk levels to be targeted.

Figure 16: Empirical Risk Curve, Grade 8

5.4 Selection of a Predicted Probability Threshold and Predictive Accuracy of the Selected Approach

An alternative approach to deciding which students will be targeted with initiatives is to choose a risk level threshold which determines which students are expected to graduate on time or not according to the predictive model. These binary predictions (i.e., expected to graduate on time or

18 Aguiar et al. (2015) calls the plot shown in Figure 16 the empirical risk curve.

31

not) produced given any risk score threshold could be used to compute predictive accuracy measures such as TPR, FPR, and precision (e.g., not just P@10).

As discussed in Section 4, each threshold on the ROC curve (Figure 13Figure 13) represents a Formatted: Font color: Auto different TPR and FPR pair of values. Therefore, in order to discuss the performance of a Formatted: Font: Not Italic, Font color: Auto, Check predictive model in terms of the TPR and FPR values (as well as precision), a probability spelling and grammar threshold (or risk score threshold) needs to be chosen.

One method of selecting a threshold is to choose one that is closest to the top-left corner of the ROC curve, which represents an ideal point where there is 100% TPR and 0% FPR (see the left panel of Figure 13Figure 13). Measuring the closest value to the top-left corner requires Formatted: Font: Not Italic calculating the Euclidian distance of each point of the plot to the top-left corner of the ROC curve using the external validation set (i.e., Figure 13Figure 13). The threshold associated with Formatted: Font: Not Italic the minimum distance to the top-left corner of the ROC curve could be used as the threshold to compute the TPR, FPR, and precision. The top-left corner threshold (i.e., 0.21) is marked with a dot in Figure 13Figure 13 and a vertical dashed line in Figure 14Figure 14. Of course, one could Formatted: Font: Not Italic always choose a lower threshold, which would lead to a higher TPR, but also a higher FPR.

In contrast, the threshold could also be set by focusing on the students most at risk of not graduating on time. The dotted vertical line in Figure 14Figure 14 marks a risk score threshold of 0.53, which corresponds to the top 10th percentile; that is, students with risk scores that lie to the right of this dotted line are in the top decile of the risk score distribution and thus are the most likely to not graduate on time.

Given a risk score or predicted probability threshold, TPR, FPR, and precision values could now be calculated to evaluate the predictive accuracy of the selected approach on the external validation set. These are calculated using the number of true positive, false negative, false positive, and true negative predictions, which are shown in the confusion matrix in Figure 17Figure 17, given a threshold of 0.21 as selected using the method explained in Subsection 5.4. Formatted: Font: Not Italic

There are 109,742 students in the external validation set, with 17,377 and 60,904 of them being predicted correctly to not graduate on time (true positive) and graduate on time (true negative), respectively. Of all the students, 7,616 are predicted to graduating on time when they actually did not graduate on time (false negative), and 23,845 are predicted to not graduating on time when they actually did graduate on time (false positive).

Figure 17: Confusion Matrix using a 0.21 Predicted Probability Threshold, Grade 8 Actual Outcome Predicted Outcome Not Graduate on Time Graduate on Time Not Graduate on Time 17,377 23,845 Graduate on Time 7,616 60,904

The TPR is the proportion of students who actually did not graduate on time that are predicted correctly as not graduating on time, which is 0.70 = 17,377/(17,377 + 7,616) for the threshold 0.21. The FPR is the proportion of students who actually graduated on time that are predicted

32

incorrectly as not graduating on time, which is 0.28 = 23,845/(23,845+60,904). The precision is the proportion of students who are predicted as not graduating on time that actually did not graduate on time, which is 0.42 = 17,377/(17,377+23,845).

TPR and FPR change with each probability threshold, with lower thresholds yielding higher TPRs and FPRs. In cases where a high TPR is desired, a lower threshold may be preferable. Conversely, a higher threshold would decrease the FPR, but also the TPR, which could be preferable when false positive detections are undesirable.

The precision also depends on the threshold choice, with lower thresholds resulting in higher precision values. In Section 5.1, a special case of the precision measure was shown, which is the average [email protected]

Figure 18Figure 18 shows how the TPR, FPR, and precision change with the predicted Formatted: Font: Not Italic, Check spelling and probability threshold selected. It helps to illustrate the trade-off between TPR and FPR when the grammar threshold is increased or decreased, and to make a more informed threshold decision by aiming for certain TPR and FPR levels. Similarly, if the focus is more on achieving a certain precision level from the predictive model, one could examine the right-most panel for how precision varies with each probability threshold. For example, for the threshold value of 0.21, the precision is 0.42. Although, from a policy perspective, setting a threshold of 0.21 may be too costly as it would predict more than 60% of the students to not graduate on time.

One could set a higher threshold or focus on a certain portion of the risk score distribution to provide resources to students who may be at the greatest risk of not graduating on time. For example, a school may consider setting initiatives for students with risk scores in the top decile, which corresponds to the portion of the risk score distribution that is to the right of the dotted vertical line (threshold = 0.53). Among these students in the top decile, 70% are correctly predicted to not graduate on time.

Figure 18: TPR, FPR, and Precision by Predicted Probability Threshold, Grade 8

19 The top 10th percentile of the risk score distribution corresponds to a risk score of 0.53.

33

5.5 Importance of the FSA Scores for Predictive Accuracy

This section examines how FSA scores affect the predictive accuracy of the Baseline Logit approach in terms of AUC, P@10, ROC curves, TPR, and FPR. To do this, the predictive model is re-trained on the entire training set (i.e., 70% of the entire data) without including the Grade 4 and 7 FSA scores in the model, and the various predictive accuracy measures are calculated for the model without the assessment scores.20

Figure 19Figure 19 shows the AUC and P@10 values for models with and without FSA scores. The model including the FSA scores outperforms the one without the FSA scores by 4.1 and 6.7 percentage points in terms of AUC and P@10, respectively.

Figure 19: AUC and P@10 with and without the FSA Scores, Grade 8

20 The predictive accuracy measures computed in this section use the external validation set, rather than the CV sets.

34

As Figure 20Figure 20 illustrates, the ROC curve for the model with the FSA scores also lies above the one for the model without the FSA scores for each threshold value, which results in higher AUC values for the model that includes the FSA scores as seen Figure 19Figure 19.

Figure 21Figure 21 shows the TPR, FPR, and precision graphs for models with and without the FSA scores. The TPR values for the model without the FSA are almost always below the TPR values for the model with the FSA scores, except for a small range of threshold values at the lower end. For the threshold of 0.21, as shown by the vertical dashed line, the TPR for the model with the FSA scores is 10 percentage points higher than the one without (i.e., 0.70 vs. 0.60).21 The FPR values generally look similar between the two models, with slightly higher FPR for the model with the FSA scores around the 0.21 threshold. However, the higher FPR for the model with the FSA scores is accompanied by a much higher TPR for that model. Thus, the model with the FSA scores generally produces higher precision values than the one without the FSA scores.

21 For a threshold of 0.19, which corresponds to the top-left corner of the ROC curve for the model without the FSA scores, the TPRs are 0.74 and 0.65 for the models with and without the FSA scores, respectively.

35

Figure 20: ROC Curves with and without the FSA Scores, Grade 8

Figure 21: TPR, FPR and Precision by Predicted Probability Threshold with and without the FSA Scores, Grade 8

36

6. Results for Grade 5

6.1 Comparison of the Modelling Approaches

Average AUC

Figure 22Figure 22 shows the average AUCs for the five modelling approaches. The average AUC is similar across all approaches, ranging from 0.709 to 0.739, but the XgBoost and L1- Interactive approaches yield slightly higher AUCs than the others, while the RF approach performs below the others. As Figure 23Figure 23 illustrates, the AUCs also differ only minutely across different iterations of the CV.

Figure 22: Average AUC by Approach, Grade 5

Figure 23: Distribution of AUCs by Approach, Grade 5

37

ROC Curves

Figure 24Figure 24 shows that the ROC curves for all approaches, except the RF approach, largely overlap, which is consistent with the similar AUC values seen above and also indicates very similar TPR and FPR values for these approaches. The ROC curve for the RF approach lies below the other approaches for almost all the FPR values, which results in the lowest AUC among all approaches (Figure 22Figure 22).

Figure 24: ROC Curves by Approach, Grade 5

38

Average Precision at the Top 10% (P@10)

The XgBoost approach has the highest average P@10, while the RF approach has the lowest. The XgBoost approach outperforms the other approaches, by 0.2-1 percentage points; although, these differences are not very large.

For example, suppose there are 1,000 students in the top decile of the risk score distribution. Out of the 1,000 students the XgBoost approach would correctly predict 665 to not graduate on time, while the Baseline Logit approach would correctly predict 655. Therefore, the Baseline Logit approach would fail to predict 10 students who would in fact not graduate on time.

Again, as Figure 26Figure 26 shows, the precision values have more variation across the CV Formatted: Font: Not Italic iterations compared to the AUCs. Nevertheless, the values do not vary by more than 2 percentage points across the iterations, with the Baseline Logit approach showing a slightly tighter distribution of P@10 values.

A Selected Modelling Approach

Again, the Baseline Logit approach is selected as not only its predictive accuracy is comparable to the XgBoost (e.g., only 0.3 and 1 percentage point difference for AUC and P@10, respectively), but also it is easier to use in practice. Following the framework used for the Grade 8 results, the rest of the Grade 5 results below use the Baseline Logit approach.

Figure 25: Average P@10 by Approach, Grade 5

39

Figure 26: Distribution of P@10 by Approach, Grade 5

40

6.2 External Validation of the Selected Approach

Figure 27Figure 27 and Figure 28Figure 28 indicate that the predictive model estimated using the selected Baseline Logit approach will generalize well to new data as the average AUC, P@10, and ROC curves are very close for the CV and the external validation set.22 The performance measures shown in the subsequent sections are calculated using the external validation set to better reflect expected predictive accuracy on new data.

Figure 27: AUC and P@10 for CV and External Validation Set, Grade 5

22 As noted above, similar predictive accuracy on the new data is conditional on no significant structural differences in new data in terms of the relationships between observable characteristics and the outcome variable.

41

Figure 28: ROC Curves for CV and External Validation Set, Grade 5

42

6.3 Risk Scores: Predicted Probability of Not Graduating on Time

Figure 29Figure 29 shows the distribution of students across risk scores, which range from 0.02 to 0.97. The peak of the distribution is between 0.12 and 0.13, and the proportion of students declines steadily after that. Figure 30Figure 30 shows the cumulative distribution of risk scores, with 50% of the students having risk scores at and below 0.19 and 75% having risk scores at or below 0.31 (Figure 30Figure 30).

Figure 29: Distribution of Risk Scores, Grade 5

43

Figure 30: Cumulative Distribution of Risk Scores, Grade 5

Figure 31Figure 31 illustrates that the actual proportion of students not graduating on time Formatted: Font: Not Italic increases consistently with the risk score and the risk score values match closely with the actual rates of not graduating on time; that is, the selected Baseline Logit approach generally ranks

44

students correctly and provides good predictions for probability of not graduating on time. There is a slight decrease in the actual rate for the highest risk scores (risk scores higher than 0.95), where the sample size is very small sample size (10 students). This decrease is likely to be a result of large sampling error, and therefore it may not be very meaningful.

Figure 31: Empirical Risk Curve, Grade 5

6.4 Selection of a Predicted Probability Threshold and Predictive Accuracy of the Selected Approach

As explained for the Grade 8 results, the predicted probability threshold associated with the minimum distance to the top-left corner of the ROC curve could be used set the threshold to compute the TPR, FPR, and precision. The top-left corner probability threshold is marked with a dot in Figure 28Figure 28 as 0.23 and also in Figure 29Figure 29 with a vertical dashed line. The risk score marking the top 10th percentile is 0.50 and is shown as the dotted vertical line in Figure 29Figure 29.

Figure 32Figure 32 provides the confusion matrix for a probability threshold of 0.23. There are 111,667 students in the external validation set, with 17,749 predicted correctly to not graduate on time (i.e., true positive) and 58,832 to graduate on time (i.e., true negative). Of all the students in the sample, 9,725 are predicted as graduating on time when they actually did not graduate on time (i.e., false negative), and 25,361 are predicted as not graduating on time while they actually did graduate on time (i.e., false positive). Given these numbers, the TPR, FPR, and precision values for a threshold of 0.23 are 0.65, 0.30, and 0.41, respectively.

Figure 32: Confusion Matrix using a 0.23 Predicted Probability Threshold, Grade 5

45

Actual Outcome Predicted Outcome Not Graduate on Time Graduate on Time Not Graduate on Time 17,749 25,361 Graduate on Time 9,725 58,832

Figure 33Figure 33 shows how TPR, FPR, and precision change with predicted probability threshold. The vertical dashed and dotted lines respectively mark thresholds of 0.23 and 0.50. Among the students in the top decile of the risk score distribution, 65.5% are correctly predicted to not graduate on time.23

Figure 33: TPR, FPR, and Precision by Predicted Probability Threshold, Grade 5

6.5 Importance of the FSA Scores for Predictive accuracy

Figure 34Figure 34 shows the AUC and P@10 values for models with and without the FSA scores. Akin to the Grade 8 results, the model including the FSA scores outperforms the one without the FSA scores by 3.4 and 5.2 percentage points in terms of the AUC and P@10 values,

23 The precision values at the upper end of risk score distribution show a fluctuating pattern as they are calculated using very few observations. Only 0.1% and 0.009% of the sample (144 and 10 students) have risk scores above 0.90 and 0.95. Therefore, these precision values at the very top of the distribution may not be very meaningful.

46

respectively. Again, in Figure 27, the ROC curve for the model with the FSA scores lies above the one without the FSA scores.

Figure 34: AUC and P@10 with and without the FSA Scores, Grade 5

The TPR values for the model without the FSA scores are almost always below the TPR values for the model with the FSA scores, except for a small range of threshold values at the lower end. For a threshold of 0.23 (i.e., the vertical dashed line), the TPR for a model with the FSA scores is 7 percentage points higher than the one without (e.g., 0.65 vs. 0.58).24 The FPR values are generally similar between the two models, with a slightly higher FPR for the model with the FSA scores at the 0.23 threshold; however, the higher FPR for the model with the FSA scores is accompanied by a much higher TPR for that probability threshold, which may be desirable. The

24 For a threshold of 0.22, which corresponds to the top-left corner of the ROC curve for the model without the FSA scores, the TPRs are 0.67 and 0.61 for the models with and without the FSA scores, respectively.

47

model with the FSA scores generally produces higher precision values than the one without the FSA scores.

Figure 35: ROC Curves with and without the FSA Scores, Grade 5

Figure 36: TPR, FPR, and Precision by Predicted Probability Threshold with and without the FSA Scores, Grade 5

48

7. Discussion

Overview

The work presented in this report is part of a broader research project being undertaken by the Education Policy Research Initiative for the BC Ministry of Education which is intended to improve policy makers’ understanding of on-time high school graduation and develop tools that could potentially be used to develop policy initiatives that would ultimately lead to improved student outcomes.

The project is based on the PEN data, which represent an extraordinarily rich data platform that captures student characteristics and enrollment information on a year-by-year basis from the point students enter the BC school system until they leave, as well as province-wide FSA scores in reading, writing, and numeracy administered in Grade 4 and Grade 7, linked by students’ PEN.

The first phase of the project involved an analysis of the relationships between on-time graduation and a range of student characteristics, the Grade 4 and Grade 7 FSA scores, and school district information available in the PEN data.

The second phase of the project, covered in this report, focuses on the development of models that could be used to predict on-time graduation at the individual student level, which could then be used by the Ministry to target student success initiatives on at-risk students with the aim of improving their on-time graduation rates and possibly other outcomes.

49

This work draws upon recent advancements in machine learning to develop predictive models of not graduating on time using five established approaches which are implemented using the rich PEN data, thus providing a unique and powerful approach to the development of models that can predict students’ risk of not graduating on time, all placed in the BC context and intended to be of practical use to policy makers.

Predictive models are developed for students in Grade 5 and Grade 8, a choice guided by the timing of the Grade 4 and Grade 7 FSAs. These two models provide predictions that could be used to implement student success interventions at both an earlier point in time (Grade 5), as well as at a later point (Grade 8) when the predictions are improved, at least in part due to the availability of the later FSA scores. These two models thus provide policy makers with two distinct policy strategy options with respect to the timing of student success initiatives.25

On-time graduation is defined as receiving a Dogwood diploma within six years of starting Grade 8, which represents an established Ministry standard, as well as graduating within nine years of starting Grade 5, a definition developed for this project which is consistent with the Ministry Grade 8 standard.

The project set out to answer the following research questions:

1. Which approach provides the most accurate predictive models of on-time high school graduation in BC based on the PEN data available, including the Grade 4 and Grade 7 FSA scores? 2. How well do the Grade 5 and Grade 8 predictive models perform and how does the accuracy of the two models compare? 3. To what extent do the Grade 4 and Grade 7 FSA scores improve the accuracy of the predictions of on-time graduation?

Main Findings

Of the five approaches used to develop the predictive models, the XgBoost approach outperforms the others for both the Grade 8 and Grade 5 models by a relatively small margin, with the differences in accuracy depending on the specific measure of predictive accuracy used.

The XgBoost approach is, however, also one of the more complex approaches considered in this analysis. More generally, increasing the complexity of the modelling approach tends to have a limited effect on predictive accuracy, which is consistent with other findings in the literature where the data available are limited in terms of their ability to predict the outcome of interest—

25 Initiatives could also be implemented in other grades based on either the predictions generated by the Grade 5 and Grade 8 models developed here or with new models that incorporated any additional information that could potentially be available, such as recent grades.

50

notwithstanding the richness of the PEN data in the context of information on students and their schooling experiences and outcomes.26

In particular, the XgBoost approach improves the predictive accuracy only slightly compared to the Baseline Logit approach, which is the simplest of all five approaches considered. This simplicity has potential advantages in terms of interpretability and ease of using the model in practice.

In particular, student-level risk scores (i.e., the predicted of probabilities of not graduating on time) could be produced quite easily using any spreadsheet program such as Excel by plugging the student-level information corresponding to the variables used in the model (in exactly the same format) into the relatively simple mathematical formula that comprises the Baseline Logit model that has been developed. In contrast, the XgBoost model would require the use of a statistical programming language to run the model in order to produce the student-level risk scores.

We therefore select the Baseline Logit approach to assess the predictive accuracy that would be expected with new cohorts of students. Given the relatively comparable accuracy across the models developed using the other approaches, it could be expected that similar results would be found with them. Indeed, the predictive accuracy of the other approaches are similar to the Baseline Logit approach.

To assess how the models would perform on “new” data (i.e., new cohorts of students), the final Grade 5 and Grade 8 predictive models developed using the Baseline Logit approach are first trained (or further developed) on the entire training sets of data (representing 70% of the entire data) and then their level of predictive accuracy is calculated using the external validation sets that are isolated from the process of developing the models and the final model training stages.

The comparisons show that the predictions generated using the “new” data are very close to the initial results for both the Grade 8 and Grade 5 students. The evidence, thus, suggests that the models developed would provide good predictions for new cohorts of Grade 5 and Grade 8 students, which is the ultimate goal of the development of these predictive models.

To begin, each model yields a predicted probability of not graduating on time for each student— or their “risk score”. The accuracy tests first show that the risk scores associated with the selected Baseline Logit model do a very good job of ordering students by their actual leaving rates. That is, those students with higher risk scores do in fact tend to have higher actual rates of not graduating on time than those at lower test scores. Furthermore, this ordering holds at a fine level of detail across relatively small differences in risk scores (e.g., .05 vs. .10, .10 vs. .15, etc.) and also across the entire spectrum of risk scores, from the lowest (i.e., predicted probability of leaving or risk score near 0) to the highest (risk score near 1.0).

26 Perlich, C., Provost, F., & Simonoff, J. S. (2003). Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research, 4(Jun), 211-255.

51

Furthermore, the tests show that there is a close match between risk scores and actual rates of not graduating on time. For example, those students with a certain risk score level (.05, .10, etc., again over the entire range of risk scores from 0 to 1) generally do not graduate on time at approximately that same rate.

In other words, the selected Baseline Logit model generally ranks students correctly in terms of their actual leaving rates and generates predicted probabilities of not graduating on time that are very close to the actual rates of not graduating on time. This not only attests to the accuracy of the model, but points directly to one way the predictive model could be used in practice: that is, to target students according to their risk scores, a strategy which is discussed further below.

The other main metrics of predictive accuracy are TPR (True Positive Rate), FPR (False Positive Rate), and P@10 (precision at the top 10 percent). Overall, the findings show that the models produce fairly accurate predictions. The TPR, FPR, and P@10 are 70%, 28%, and 70%, respectively, for the Grade 8 model. The TPR means that, among all the students who actually did not graduate on time, 70% are predicted correctly as not graduating on time. The FPR is similarly interpreted and, among all the students who actually graduated on time, 28% are predicted incorrectly as not graduating on time. Finally, the P@10 value means that among all students with a risk score in the top decile, 70% are predicted correctly as not graduating on time. For the Grade 5 model, the TPR, FPR, and P@10 values are 65%, 30%, and 66%, respectively—slightly less accurate measures compared to the Grade 8 results.27 These represent quite good values within the education context.

Finally, the analysis also shows that the Grade 4 and Grade 7 FSA scores substantially improve the predictive accuracy of the model. For the Grade 8 model, including the Grade 4 and Grade 7 FSA scores increases the TPR from 60% to 70% (based on the selected threshold of 0.21). Likewise, for the Grade 5 students, including the FSA scores in the model increases the TPR from 58% to 65% (based on the selected threshold of 0.23). While the FPR values are generally similar between the models with and without the FSA scores, the P@10 values increase from 65% to 71% for the Grade 8 students and from 60% to 66% for the Grade 5 students when the FSA scores are included.

A simple example would also help illustrate the importance of including FSA scores in the predictive model. Suppose there are 1,000 students with risk scores in the top decile. For both Grade 8 and Grade 5 students, models that include the FSA scores would correctly identify 60 more students (710 versus 650 and 660 versus 600) as not graduating on time compared to models without the FSA scores.

27 The TPR and FPR rates are based on risk score thresholds (or predicted probability thresholds) of 0.21 and 0.23, which represent, respectively, the optimal trade-off between the FPR and TPR values as conventionally defined in the literature, for the Grade 8 and Grade 5 students, respectively.

52

These findings regarding the importance of the FSA variables to the predictive power of the models are consistent with the first phase of this overall project, where strong relationships were found between FSA scores and on-time graduation.

Predictive Models as Policy Tools

A predictive model of on-time graduation of the type developed in this project could be used to target students in two main ways.

First, if an initiative aimed at improving on-time graduation is intended to support a specific number of students, presumably due to a limit in the resources available (i.e., budgetary restrictions), the risk scores produced by the predictive model could be used to target students with the highest estimated probabilities of not graduating on time. This would be done by ordering students by their risk scores and counting down until the designated number of students identified. In this way, a policy maker could be assured that the resources available are targeted on those students in greatest need.

Second, true and false positive detections could be used to inform the targeting approach adopted. A true positive detection represents a case where a student is correctly predicted to not graduate on time, whereas a false positive detection represents a case where a student is predicted as not graduating on time when they actually do. If an initiative is directed to those students who are predicted to not graduate on time, the former (i.e., true positives) represents a case where resources would be directed on a student who needed assistance, whereas the latter would result in resources being spent on students who did not need assistance in the first place.

Targeting more students based on risk scores by choosing a lower risk score threshold—above which students would get the initiative and below which they would not—will increase the chances of true positive detection, as would be wished (that is, more of those students who need assistance would receive the initiative). Targeting more students would, however, also catch more students who would graduate on-time without the intervention. The selection of a risk score threshold therefore represents a trade-off between targeting more students in need versus allocating resources on those who do not need the support. It is not possible to increase one and lower the other simultaneously without more information on students.

The predictive models developed in this project could inform these trade-offs by guiding the selection of the threshold to be used to target students. This would be done using the true positive rates (TPRs) and false positive rates (FPRs) for each risk score threshold that have been produced in the course of the development of the Baseline Logit model (if that is, in fact, the model that is applied). Alternatively, if another model was chosen, a similar exercise could be carried out for it as well.

Choosing the preferred trade-off between true and false positives ultimately represents a policy decision, since it essentially reflects whether the policy maker prefers to err on the side of making sure as many of those who may need the assistance receive it at the cost of also helping some who do not need it, and vice versa.

53

Presumably, this choice will be guided at least in part by considerations of the costs of any initiative(s) and the associated expected benefits in terms of improved high school graduation rates. The subsequent gains that could accrue from higher high-school graduation rates include higher PSE participation rates, better labour market outcomes, and improvements in a range of other outcomes at the individual and social level associated with high school graduation and further educational attainment (e.g., better health, increased civic engagement, lower crime rates, etc.) These benefits could also include any fiscal gains realised when a province’s population has a higher high school completion rate and sees improvements in associated outcomes, since these will tend to increase government tax revenues and reduce expenditures, such as those related to income support (e.g., Employment Insurance and Social Assistance), health care, policing, and more.

In principle, some interventions could potentially even pay for themselves if they were not too costly and were effective in increasing high school graduation rates and thereby lead to the other improved outcomes just noted and possibly others. Of course, such self-financing programs are the holy grail of social policy and are not common, but if there is likely one area where this is possible, early interventions to help students improve their life opportunities, starting with high school graduation, probably represents one of the best hopes in this regard.

Limitations

This project provides valuable insights into the development and assessment of predictive models of on-time graduation for BC students. As the final models performed quite well in terms of predictive accuracy, the student-level predictions that could be produced using these models could be used to target student success initiatives on those students who are at higher risk of not graduating on time. However, the analysis carried out here has two important limitations that are worth noting.

First, the incidence of not graduating on time is generally overstated in the PEN data as students who leave school before graduation are not differentiated from those who leave the province; that is, both groups are simply observed to no longer enrolled and are categorised as not graduating on time. This is probably a more serious concern for Grade 5 students than Grade 8 students, as the likelihood of leaving the province is presumably higher for younger students.

The identification of inter-provincial migration of students would, therefore, represent a valuable addition to future work, as this would allow the models to be revised to include only those students who remain in the province. That said, it is difficult to imagine how this could be done

54

with the data currently available, although careful use of linked tax data could potentially provide one option in this regard.28

A second limitation is that predictive modelling relies on historical data and any model that is developed may become outdated with changes in the relationship between the predictors and outcome of interest; in this case, on-time graduation. Factors of this type could include changes in schooling and other related policies, curriculum, student characteristics not captured in the models. If this occurs, predictions for new cohorts of students generated by the predictive model developed may not be as accurate as those generated before any such changes or for the cohorts included in the actual development of the predictive model (as discussed above, as per those students used in the final assessments of model accuracy). For example, changes in the design and implementation of the FSA in 2017 may affect the performance of the predictive models developed for this project when using the new FSA scores in the predictive model.

Directions for Future Work

While the predictive models developed here perform well, adding additional information on students could potentially lead to the development of even better models. For example, information related to students’ academic engagement or other aspects of their schooling experiences and outcomes (e.g., attendance rates, suspensions, other behaviour), to their situation outside of school (e.g., being in foster care or having contact with the Ministry for Family and Child Development), or to their families (e.g., family income or parental education levels), could potentially improve predictive accuracy.

It is recognised that the PEN data have been developed based on the underlying system data available and there are a range of considerations regarding which data could and should be included. It is equally understood that the PEN data are extremely rich and of remarkable depth and quality when placed not only in the Canadian context but even at the international level. And finally, it is important to recognise the kind of innovative analysis with practical policy applications that the PEN data have permitted in the two phases of this project on on-time graduation. Even still, it is worth stating that any further data enhancements of the PEN data could push these frontiers even further.

To start, making more variables currently included in the PEN data available for the purposes of developing of predictive models of on-time graduation could lead to improved models and predictions.

At a broader level, bringing the PEN data into Statistics Canada’s Social Data Linkage Environment (SDLE), as has recently been done, may provide some extremely innovative

28 In particular, it is at least conceivable that students’ tax records could be used to identify their families of origin, with mothers and fathers then followed in their tax data to identify those who moved out of the province while the child (student) was of school-attending age.

55

opportunities in this regard. This could make the PEN data of even greater value across the range of uses to which they are put—including the development of the predictive models of on-time high school graduation developed here.

Beyond data developments of this nature, future research could involve the design, implementation, and evaluation of student success initiatives aimed at improving on-time graduation. It is, of course, one thing to develop a predictive model that allows initiatives to target at-risk students, but quite another to know which initiatives work best for which students when implemented—or to otherwise implement initiatives already known to improve student outcomes.

Using a risk score threshold to target initiatives, whereby students with risk scores above the threshold receive the initiative while those with scores below the initiative do not, would not only target initiatives on those students at higher risk of not graduating on time, but would also allow the use of regression discontinuity methods to identify the causal effects of the initiative.

Alternatively, random assignment approaches could be used to estimate the causal effects of an initiative, and these effects could be estimated at different risk levels to see which students benefit the most from the initiative.

Initiatives could be targeted on students either in Grade 5 or Grade 8, corresponding to the points at which the predictive models have been developed, or even in subsequent years (i.e., Grades 6 or 7, or Grades 9 to 12) using Grade 5 and Grade 8 risk scores, respectively.29 While on-time high school graduation for initiatives put in place in Grade 5 and Grade 8 would ultimately be measured as far out as nine and six years later, interim assessments could include the monitoring and assessment of academic progress (i.e., who continues to be enrolled and advancing through their studies) in the intervening years.

A third general line of future research could involve examining the relationships between the risk scores generated by the Grade 5 and Grade 8 models and other outcomes, such as access to post- secondary education (as analysed in an earlier project using the PEN data) or even students’ post-schooling labour market earnings now that the PEN data have been linked to tax data, could be examined.

As shown in this report, the models developed produce relatively accurate predictions of on-time high school graduation, and—with a high school diploma generally representing a prerequisite for entering PSE (with some special access program exceptions) and PSE in turn generally representing the starting point for labour market success—these predictions of on-time high school graduation could provide a forward view of this series of later outcomes. In addition, the

29 Alternatively, as mentioned earlier, predictive models could be developed as of those other grades rather than Grade 5 and Grade 8.

56

risk scores related to on-time graduation may also be correlated with other student characteristics and related factors that have their own (independent) influence on labour market success.

Finally, not only would it potentially be interesting to examine the relationships between risk scores and access to PSE and later earnings related to on-time high school graduation, predictive models of these outcomes based on the PEN data could be developed using methods similar to those employed here.

The PEN data represent a remarkable resource for improving our understanding of a range of schooling and post-schooling outcomes and for developing predictive models of a comparable range of outcomes for which the current project represents an excellent starting point.

References

Adelman, M., Haimovich, F., Ham, A., & Vazquez, E. (2018). Predicting school dropout with administrative data: New evidence from Guatemala and Honduras. Education Economics, 26(4), 356-372.

Aguiar, E., Lakkaraju, H., Bhanpuri, N., Miller, D., Yuhas, B., & Addison, K. L. (2015, March). Who, when, and why: a machine learning approach to prioritizing students at risk of not graduating high school on time. In Proceedings of the Fifth International Conference on Learning Analytics and Knowledge. https://www3.nd.edu/~dial/publications/aguiar2015who.pdf

Allensworth, E. M., & Easton, J. Q. (2005). The on-track indicator as a predictor of high school graduation. Chicago: Consortium on Chicago School Research. Retrieved from https://consortium.uchicago.edu/sites/default/files/publications/p78.pdf

Arnold, K. E., & Pistilli, M. D. (2012, April). Course signals at Purdue: Using learning analytics to increase student success. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge.

Baker, R. D. (2011). Data mining for education. In B. McGaw, P. Peterson, and E. Baker (Eds.), International Encyclopedia of Education (pp. 112-114). Amsterdam: Elsevier.

Baker, R. S., & Yacef, K. (2009). The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining, 1(1), 3-17.

Balfanz, R., Herzog, L., & Mac Iver, D. J. (2007). Preventing student disengagement and keeping students on the graduation path in urban middle-grades schools: Early identification and effective interventions. Educational Psychologist, 42(4), 223-235.

Bowers, A. J. (2010). Analyzing the longitudinal K-12 grading histories of entire cohorts of students: Grades, data driven decision making, dropping out and hierarchical cluster analysis. Practical Assessment Research and Evaluation, 15(7), 1-18.

57

Bowers, A. J., & Sprott, R. (2012a). Examining the multiple trajectories associated with dropping out of high school: A growth mixture model analysis. Journal of Educational Research, 105(3), 176-195.

Bowers, A. J., Sprott, R., & Taff, S. A. (2012). Do we know who will drop out? A review of the predictors of dropping out of high school: Precision, sensitivity, and specificity. The High School Journal, 96(2), 77-100.

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.

Brooks, C., & Thompson, C. (2017). Chapter 5: Predictive modelling in teaching and learning. In C. Lang, G. Siemens, A. Wise, & D. Gašević (Eds.), Handbook of learning analytics (pp. 61-68). Solar. Retrieved from https://pdfs.semanticscholar.org/2cd4/901b07f3562f98e1e56dc5712e8bc03bdc2e.pdf

Campbell, J. P., deBlois, P. B., & Oblinger, D. G. (2007). Academic analytics: A new tool for a new era. EDUCAUSE Review, 42(4), 40-57.

Chen, Tianqi & Guestrin, Carlos. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Chung, J. Y., & Lee, S. (2019). Dropout early warning systems for high school students using machine learning. Children and Youth Services Review, 96, 346-353.

Croninger, R. G., & Lee, V. E. (2001). Social capital and dropping out of high school: Benefits to at-risk students of teachers' support and guidance. Teachers College Record, 103(4), 548-581.

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-25.

Gašević, D., Buckingham Shum, S., Nelson, K., Alexander, S., Lockyer, L., Kennedy, G., et al. (2016). Student retention and learning analytics: A snapshot of Australian practices and a framework for advancement. Sydney, NSW: Australian Office of Learning & Teaching. Retrieved from http://www.olt.gov.au/system/files/resources/SP13_3249_Dawson_Report_2016.pdf.

Gleason, P., & Dynarski, M. (2002). Do we know whom to serve? Issues in using risk factors to identify dropouts. Journal of Education for Students Placed at Risk, 7(1), 25-41.

58

Hlosta, M., Zdrahal, Z., & Zendulka, J. (2017). Ouroboros: Early identification of at-risk students without models based on legacy data. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference (pp. 6-15).

Janosz, M., Archambault, I., Morizot, J., & Pagani, L. S. (2008). School engagement trajectories and their differential predictive relations. Journal of Social Issues, 64(1), 21-40.

Johnson, R. A., Gong, R., Greatorex-Voith, S., Anand, A., & Fritzler, A. (2015). A data-driven framework for identifying high school students at risk of not graduating on time. Conference proceedings in the Bloomberg Data for Good Exchange Conference. Retrieved from https://www3.nd.edu/~dial/publications/johnson2015data.pdf

Knowles, J. E. (2015). Of needles and haystacks: Building an accurate statewide dropout early warning system in Wisconsin. Journal of Educational Data Mining, 7(3), 18-67.

Lakkaraju, H., Aguiar, E., Shan, C., Miller, D., Bhanpuri, N., Ghani, R., & Addison, K. L. (2015). A machine learning framework to identify students at risk of adverse academic outcomes. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1909-1918).

Mahoney, J. L., & Cairns, R. B. (1997). Do extracurricular activities protect against early school dropout? Developmental Psychology, 33(2), 241-253.

O'Cummings, M., & Therriault, S. B. (2015). From accountability to prevention: Early warning systems put data to work for struggling students. Washington, D.C.: American Institutes for Research.

Pagani, L. S., Vitaro, F., Tremblay, R, Ε., McDuff, P., Japel, C., & Larose, S. (2008). When predictions fail: The case of unexpected pathways toward high school dropout. Journal of Social Issues, 64(1), 175-194.

Perlich, C., Provost, F., & Simonoff, J. S. (2003). Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research, 4(Jun), 211-255.

Sansone, D. (2018). Beyond early warning indicators: High school dropout and machine learning. Oxford Bulletin of Economics and Statistics, early release.

Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289-310.

Sullivan, W., Marr, J., & Hu, G. (2017). A predictive model for standardized test performance in Michigan schools. In R. Lee (Ed.), Applied computing and information technology (pp. 31-46). New York: Springer.

59

Glossary

AUC: The area under the receiver operating characteristic curve is a metric used to assess a predictive model’s ability of accurately predicting outcomes. The closer a model’s ROC curve, which considers all probability thresholds, can be to the top left corner of the graph (which represents an ideal scenario), the more accurate the model is. Therefore, a larger AUC is associated with a ROC curve that is closer to the top left corner and therefore better predictive performance.

Classification problem: In this report, determining whether a student will not graduate on time is the classification problem at hand. Since this is a binary classification problem, the output of the predictive model will be one of two mutually exclusive outcomes: will or will not graduate on time.

Cross-validation: Cross-validation (CV) is a type of external validation that separates the data into two sets (subsamples): the training and validation sets. This analysis uses a 5-fold CV approach. The dataset is separated into 5 random, non-overlapping parts called folds. Then, 4 of the folds are used as a training set to train the model with a given set of tuning parameters and the 5th fold is used as a validation set to estimate the performance of the model when it uses this specific set of tuning parameters. This is repeated 5 times (once for each alternative validation folds), then the performance metric is averaged to create a value reflecting the predictive performance of the tuning parameters selected.

External validation set: External validation set used in this report is defined as the part of the entire data (a random 30%) that is set aside from the training process to evaluate the expected predictive accuracy of the model on unseen new data.

False positive rate: The false positive rate captures the proportion of incorrect predictions of graduating on time (i.e., the negative outcome as defined for the purposes of this project). The false positive rate is the x-axis in a ROC curve. The FPR is defined as: False Positives FPR = False Positives + True Negatives A low false positive rate is preferable.

Feature selection: Using too many predictors in a predictive model may lead to overfitting and it is not immediately evident which subset of

60

variables should be used as predictors. Some predictive modelling approaches have built-in feature selection, which automatically removes predictors that do not and improve predictive performance with little additional computational cost.

Negative outcome: In this report, the negative outcome is defined as graduating on time. Also see the definition of “Positive outcome”.

Outcome: The outcome used in this report is a binary one, which represents whether a student graduated on time or not.

Overfitting: Creating a model that matches the training data so closely that the model fails to make correct predictions on new data. This could either by done by having a small number of observations or a very complex and flexible model that explains even the idiosyncrasies in the data.

Positive outcome: The positive outcome is the event that the predictive model is set to predict. In this report, the positive outcome is defined as not graduating on time.

Precision at top 10% Precision is the proportion of true positive predictions among (P@10) students who are predicted to not graduate on time (i.e., the positive outcome). Precision at top 10% (P@10) is the proportion of true positive predictions among students with the top 10% highest predicted probability of not graduating on time.

Predictive accuracy: Often referred to as, out-of-sample predictive accuracy in the machine learning literature is the ability of a predictive model to produce predictions using new observations.

Predictor: A predictor, also referred to as a feature, is an input variable for each observation (e.g., participant age, sex, and program information are all predictors) used in making predictions.

Probability threshold: A probability or classification threshold is a single value on the ROC curve that separates the positive from the negative class. This is used when mapping predictions to a binary classification problem. Changing the threshold value has a direct impact on the true positive and false positive rates.

Risk score: Predictive models produce a predicted probability of not graduating on time (between 0 and 1), with higher values indicating higher likelihood of not graduating on time. In this report, these predicted probabilities of not graduating on time are referred to as risk scores.

61

ROC curve: For each student, predictive models produce a predicted probability of not graduating on time (between 0 and 1). If closer to 1, a student is more likely to not graduate on time. The Receiver Operating Characteristic (ROC) curve traces true positive and false positive rates for different probability threshold values.

Training set: The subset of the data set used to train a model.

Training: Training or fitting a model is the process of estimating the parameters of a model. The coefficients in a logistic regression for example are estimated during the training process.

True positive rate: The true positive rate captures the proportion of correct predictions of not graduating on time. The true positive rate is the y-axis in a ROC curve. The TPR is defined as: True Positives TPR = True Positives + False Negatives A high true positive rate is preferable.

Tuning parameters: Tuning parameters are optional hyperparameters used by certain predictive models such as the error term of the L1-regularized logistic regression or the structure of a decision tree. These hyperparameters cannot be inferred directly from the information on the outcomes or the predictors

Tuning: Tuning is a part of model training but is only required for sophisticated predictive approaches using models with specific hyperparameters or complex structures.

Underfitting: Underfitting refers to the inability of a model to capture the fundamental relationship between the predictors and the outcome of interest.

Validation set: A subset of the data set – separate from the training set – used to adjust hyperparameters and structures or to test predictive performance.

62