1

Receiver Operating Characteristic (ROC) Area Under the Curve (AUC): A Diagnostic Measure for Evaluating the Accuracy of Predictors of Outcomes1,2 Alex J. Bowers Xiaoliang Zhou Teachers College, Columbia Teachers College, Columbia University [email protected] [email protected]

ABSTRACT:12 monitor students at risk of falling off track for graduation and Early Warning Systems (EWS) and Early Warning Indictors to provide the interventions and resources to intervene” (Davis, (EWI) have recently emerged as an attractive domain for states Herzog, & Legters, 2013, p. 84). EWS and EWI have been and school districts interested in predicting student outcomes applied to predicting education outcomes, such as high school using data that schools already collect with the intention to dropout (Knowles, 2015; Stuit et al., 2016; Tamhane, Ikbal, better time and tailor interventions. However, current Sengupta, Duggirala, & Appleton, 2014), successful middle diagnostic measures used across the domain do not consider the school to high school transition (Allensworth et al., 2014; Faria dual issues of sensitivity and specificity of predictors, key et al., 2017), college readiness (Hughes, & Petscher, 2016; components for considering accuracy. We apply signal Koon, & Petscher, 2016; Phillips et al., 2015), and academic detection theory using Receiver Operating Characteristic performance (Ikbal et al., 2015; Lacefield & Applegate, 2018; (ROC) Area Under the Curve (AUC) analysis adapted from the Lacefield, Applegate, Zeller, & Carpenter, 2012; Macfadyen, & engineering and medical domains, and using the pROC package Dawson, 2010) to name but a few. EWS and EWI are becoming in R. Using nationally generalizable data from the Education increasingly popular because they can identify a small set of Longitudinal Study of 2002 (ELS:2002) we provide examples variables that allow educators to make timely decisions of applying ROC accuracy analysis to a variety of predictors of (Frazelle, & Nagel, 2015; Macfadyen, & Dawson, 2010), that student outcomes, such as dropping out of high school, college can inform personalized interventions and help to reduce enrollment, and postsecondary STEM degrees and careers. student retention and dropout (Ikbal et al., 2015; Tamhane, Ikbal, Sengupta, Duggirala, & Appleton, 2014), and that “often Keywords: ROC, AUC, Early Warning System, Early prompt[s] improvements in a district’s system of supports” Warning Indicator, signal detection theory, dropout, college (Supovitz, Foley, & Mishook, 2012, p. 2). For example, enrollment, Postsecondary STEM Degree, hard STEM career, randomly assigning 73 schools to treatment and control groups soft STEM career to use an Early Warning Intervention and Monitoring System (EWIMS), Faria et al. (2017) found that EWIMS was helpful for improving course completion and student attendance, each BACKGROUND and LITERATURE REVIEW: of which are major indicators of students being off track for Using Early Warning Systems (EWS) and Early Warning graduation (Allensworth & Luppescu, 2018). Indicators (EWI) to predict student outcomes has become an emerging domain of interest (Agasiti & Bowers, 2017; Soland (2013) examined the accuracy of an EWS with the Allensworth, 2006; Allensworth, & Easton, 2007; Allensworth, nationally representative National Education Longitudinal Nagaoka, & Johnson, 2018; Allensworth, Gwynne, Moore, & Survey (NELS) data of high school students, teachers, and de la Torre, 2014; Bowers, Sprott, & Taff, 2013; Faria et al., parents from 1988-2000, predicting dropout and college going. 2017; Kemple, Segeritz, & Stephenson, 2013; Ocumpaugh et After doing several longitudinal logistic analyses, Soland al., 2017). EWS and EWI are defined as “an intentional process (2013) found that whereby school personnel collectively analyze student data to …EWS forecasts could be valuable both as organizational tools and for their precision. For example, model predictions of college 1 This document is a pre-print of this manuscript, published in the going were more accurate than teacher Journal of Education for Students Placed At Risk (JESPAR). Citation: predictions and remained accurate for Bowers, A.J., Zhou, X. (2019) Receiver Operating Characteristic students about whom teachers were unsure (ROC) Area Under the Curve (AUC): A Diagnostic Measure for Evaluating the Accuracy of Predictors of Education Outcomes. or disagreed. (p.259) Journal of Education for Students Placed At Risk, 24(1) 20-46. https://doi.org/10.1080/10824669.2018.1523734 Soland’s (2013) study is a recent and robust investigation into 2 This was supported by a grant from the National Science the performance of an EWS. However, as with the vast Foundation (NSF IIS-1546653). Any opinions, findings, and majority of the literature in this domain, missing from this conclusions or recommendations are those of the authors and do not conversation is an examination of the accuracy of the predictors necessarily reflect the views of funding agencies. used in the early warning system. More recently, in a follow-up Bowers & Zhou (2019)

2 study Soland (2017) used a decision tree and a lasso regression two broad drawbacks. As Gleason and Dynarski (2002) noted analysis framework with n=9,601 students from the NELS in reference to this issue of accuracy with early warning sample from 1988-2000. In this study, the author worked to indicators of dropping out of high school, low accuracy EWIs reduce thousands of measures with cross-validation techniques lead to students who are identified as at risk of dropping out but down to a set of variables measuring SES, postsecondary who would have never dropped out. These students receive aspirations, academic readiness, and teachers’ perceptions of dropout interventions, but as noted by the authors, if the readiness. The author then used the reduced set of variables to students would have not dropped out, perhaps these resources predict students enrolling in college and persisting for a could be spent in better ways, in addition to the problem of semester, noting that his model achieved almost 90% accuracy. negatively labeling a student as at risk with an inaccurate However, Soland (2017) calculated classification accuracy with predictor. Conversely, students who will drop out but are never two separate measures, one dividing the number of correct identified by the indicator do not receive needed resources and predictions by the total number of students, the other dividing interventions that could help them persist in school. Importantly false positives by false negatives. Although the second method for this domain, in reviewing 110 dropout flags from the used false positives and false negatives, the two metrics do not literature, Bowers et al. (2013), showed that the vast majority of take into account the current literature on accuracy of predictors EWIs in the dropout domain were not much better than a 50-50 from the research domain of signal detection theory. The core guess, thus misidentifying large percentages of students as issues of signal detection theory in assessing predictor accuracy dropping out when they would not have dropped out, as well as is a central missing detail across almost all of the current missing a large percentage of students who will drop out but are research in the EWI/EWS domain. never identified. The authors term this second set of students the “quiet dropouts” (Bowers & Sprott, 2012a, 2012b). Signal detection theory is a domain of research which originated in engineering and medicine (Gönen, 2007; Hanley In the present study, we detail the use of a measure of accuracy & McNeil, 1982; Swets, 1988; Swets, Dawes, & Monahan, known as the Receiver Operating Characteristic (ROC) Area 2000; Zwieg & Campbell, 1993) and has more recently been Under the Curve (AUC), drawn from Signal Detection Theory applied to domains such as law, education, and youth in the Engineering and Medical domains. Whereas ROC AUC development to understand the accuracy of at-risk indicators is commonly used to evaluate the accuracy of predictors in and predictors (Bowers, Sprott & Taff, 2013; Knowles, 2015; fields such as medicine, radiology, engineering, , Olver, Stockdale, & Wormith, 2009; Rice & Harris, 2005; Vivo and machine , it is a relatively new concept for & Franco, 2008). A core measure of accuracy within signal education early warning systems. detection theory is the Receiver Operating Characteristic (ROC) Area Under the Curve (AUC), in which the sensitivity Signal Detection Theory of a predictor (the true-positive proportion) is compared to the Signal detection theory deals with sensitivity and specificity. In specificity (the true-negative proportion), a calculation which is Figure 1 we adapt a contingency table from the previous the central concern of the present study and which we review literature, noting how to compute sensitivity, specificity, and and provide example applications in more detail below. other related diagnostics (Swets, 1988, p. 1286; Bowers, Sprott, Although there have been some reviews of metrics for & Taff, 2013, p. 83 evaluating the accuracy of student at-risk prediction models (National Center on Response to Intervention, 2010; Pelánek, The upper part of the figure is a contingency table, with rows as 2014; Pelánek, 2015) or case studies on the use of Receiver predictions and columns as events. The table also shows the Operating Characteristic (ROC) Area Under the Curve (AUC) row and column marginal sums and total sum. The lower part in human computer interaction research (Fogarty, Baker, & of the figure is the calculation of measures of interest for Hudson, 2005), only a few studies have used ROC to examine accuracy. Discussing the contingency table in Figure 1 using the accuracy of diagnostic measures for predictors of outcomes Type I error and Type II error can help to understand the in education research (e.g., Carlson, 2018; D'Agostino, calculations for sensitivity and specificity. Here Sensitivity is Rodgers, & Mauck, 2018; Johnson & Semmelroth, 2010; analogous to “1-Type II error” is defined as the probability of Jordan, Glutting, Ramineni, & Watkins, 2010; Liao, Yao, identifying positive events as positive (True-positive Chien, Cheng, & Hsieh, 2014; Nicholls, Wolfe, Besterfield- Proportion), so 1-Sensitivity is analogous to “Type II error” as Sacre, & Shuman, 2010; Stuit et al., 2016). Research studies in the probability of identifying positive events as negative (False- education have rarely tested the accuracy of flags that are used negative Proportion). Specificity analogous to “1-Type I error” to predict students at risk of specific outcomes (Bowers, Sprott, is defined as the probability of identifying negative events as & Taff, 2013), which has resulted in a lack of reported accuracy negative (True-negative Proportion), so 1-Specificity analogous for predictors across this domain. to “Type I error” is the probability of identifying negative events as positive (False-positive Proportion). Another related Thus, the question for EWI and EWS in education is how to concept precision (Positive Predictive Value) is defined as the identify and report accurate predictors of students at-risk. probability of correctly predicting positive events. In other Education researchers, policymakers, and practitioners need a words for example, for an EWI predicting dropping out of high means to measure the accuracy of diagnostic systems, as school, sensitivity can be thought of as the “hits” where it is the misidentification of students at risk of negative outcomes has proportion of students predicted to drop out who actually did Bowers & Zhou (2019)

3

Observation Positive Negative Prediction Positive True-positive (a) False-positive (b) Negative False-negative (c) True-negative (d)

Accuracy (a + d) / N Precision a / (a + b) Sensitivity (Recall / True-positive Proportion) a / (a + c) Specificity (True-negative Proportion) d / (b + d) 1- Specificity (False-positive Proportion) b / (b + d) Kappa (Accuracy – R) / (1 – R) R = ((a + c)(a + b) + (b + d)(c + d)) / N2

Figure 1: Confusion Table for Calculating Contingency Proportions drop out, while 1-specificity can be thought of as the “false to compare AUCs of individual diagnostic predictors within the alarms” which is the proportion of students who had the same ROC analysis to identify the most accurate predictors dropout predictor but still graduated. (Bowers et al, 2013; Hanley & McNeil, 1982; Swets, 1988; Swets et al., 2000; Zwieg & Campbell, 1993) certain fields As noted in the signal detection theory literature (Gönen, 2007; have established rules of thumb for levels of accuracy as Hanley & McNeil, 1982; Swets, 1988; Swets et al., 2000; determined by AUC. For example, in screener validity analysis Zwieg & Campbell, 1993), the central issue in the accuracy of in the Response to Intervention field (RTI) (D'Agostino, predictors of outcomes is that both the sensitivity and Rodgers, & Mauck, 2018), the National Center on Response to specificity must be taken into account, as these two dimensions Intervention (2010) (NCRTI) as their “Technical Standard 1” of the true-positive proportion (sensitivity) and the false-postive has indicated that for classification accuracy of screening tests proportion (1-specificity) act in concert when discussing for RTI, that an AUC above 0.85 is considered “convincing accuracy (Bowers et al., 2013). Said another way, for an early evidence” of classification accuracy, between 0.75 and 0.85 is warning indicator to be accurate, the net that is cast should “partially convincing evidence” and less than 0.75 is catch the intended targets (it must be sensitive) while not “unconvincing evidence”. misidentifying the wrong targets (1-specificity, the opposite of specificity). Interestingly, these two dimensions of accuracy Framework and Research Questions can then be plotted in an x,y coordinate plane, with the Currently, there are only a small number of studies in education sensitivity and 1-specificity for each early warning indicator using ROC or AUC as a diagnostic measure for evaluating plotted in these two dimensions. In this way, a Receiver predictors in the EWI/EWS literature. Even for those studies Operating Characteristic (ROC) plot incorporates sensitivity that do use ROC (Allensworth et al., 2014; Torres, Bancroft, & and specificity into one figure, making it possible for Stroub, 2015) or AUC (Becker et al., 2014; Cummings, & researchers to visually compare the accuracy of different Smolkowski, 2015; Laracy, Hojnoski, & Dever, 2016; indicators (Allensworth et al, 2014; Bowers et al., 2013; Knowles, 2015; Vivo, & Franco, 2008; Wilson et al., 2016) to Knowles, 2015; Swets, 1988; Swets et al., 2000). The false- evaluate predictors, researchers used the technique among positive proportion (1-specificity) is on the x-axis and the true- several other measures and did not necessarily consider ROC as positive proportion (sensitivity) is on the y-axis. For any more accurate than other measures in terms of measuring a predictor, a change in specificity leads to a corresponding predictor’s accuracy. One exception is the research by Horn change in sensitivity, and this relation is plotted as a curve in and Lee (2017) who advocated the use of sensitivity, the ROC plot. The Area Under the Curve is called ROC AUC, specificity, and precision together to judge the accuracy of their and ranges from 0 to 1.0, with 0.5 as a 50-50 guess and 1.0 as a performance indicators. Bowers, Sprott and Taff (2013) provide predictor that is 100% accurate. As a diagnostic measure, the a comprehensive study aimed to demonstrate the accuracy of AUC calculation takes into account both the true-positive ROC as a diagnostic measure in which the authors analyzed proportion and the false-positive proportion, and can be 110 high school dropout flags from the dropout prediction compared to identify which predictors are more accurate than literature using a ROC analysis. They illustrated the possibility others, and which are closest to a 50-50 guess, or worse. An of visually comparing predictor accuracy in terms of both accurate predictor should have a large AUC (closer to point 0,1 sensitivity and specificity. The Bowers et al. (2013) study on the ROC AUC plot with an AUC above 0.5 and closer to demonstrated how researchers and practitioners could use a 1.0), which means that the predictor performs well in both ROC plot to graphically display and compare the accuracy of dimensions of sensitivity and specificity. While the broader dichotomous flags as a point in the ROC space (such as high ROC AUC literature encourages researchers and practitioners suspensions, failed English, failed mathematics, high absences).

Bowers & Zhou (2019)

4

For example, the study concluded that the most accurate cross- sectional single time point dropout flag in the literature was the Variables included in the Analysis: Predictors of Dropout, Chicago on-track indicator which includes low course credits College Enrollment, Postsecondary STEM Degree, and and failing at least one core course in ninth grade (Allensworth, Hard/Soft STEM Occupation 2013; Allensworth & Easton, 2007), while more accurate We drew on the current EWI/EWS literature to inform our predictors each used longitudinal predictors and growth mixture variable selection for this study, including the outcomes of modeling to identify significantly different trajectories of interest of 1) dropping out of high school, 2) college student performance (Bowers et al, 2013), such as declining enrollment, and 3) postsecondary STEM degree, 4) STEM trajectories of non-cumulative grade point average in the first career occupation at age 26, as well as the indicators used to three semesters of high school (Bowers, 2019; Bowers & predict these outcomes. We selected to focus on both secondary Sprott, 2012b; Brookhart et al., 2016). However, a critical and post-secondary schooling indicators and outcomes for our missing component for use of these measures in EWS was that examples of ROC AUC as the EWI/EWS literature is Bowers et al. (2013) did not consider the use of AUC in the developing rapidly for K-12 and colleges and in the ROC accuracy analysis. The signal detection theory literature education analytics, , education data science, (Swets, 1988) demonstrates that indicator accuracy can be and academic analytics domains, especially in relation to better compared through plotting continuous variables (rather student STEM outcomes, as researchers and practitioners wish than just dichotomous indicators) within the ROC space and to identify accurate indicators of positive transitions from high then assessing the area under the curve (AUC) for each school, through college, into careers (Agasisti & Bowers, 2017; continuous variable in which a larger AUC indicates a higher Baker, 2013; Bowers, 2017; Knight & Shum, 2018; Krumm, level of accuracy. Additionally, AUC provides a means to Means, & Bienkowski, 2018; Piety, Hickey, & Bishop, 2014; statistically compare the AUC of two different indicators to Siemens, 2013). Nevertheless, as noted above, across this assess the extent to which they are statistically significantly literature there are few examples of providing accuracy different (DeLong, DeLong, & Clarke-Pearson, 1988; Hanley analysis, however as the literature across these domains uses & McNeil, 1983). multiple predictors for early warning indicators in each of the four outcomes of interest in this study, we draw on this With increasing attention in the education analytics literature to literature to inform our variable selection. creating and analyzing early warning indicators and systems (Agasisti & Bowers, 2017; Bowers, 2017; Piety, Hickey, & The first education outcome we examine is high school Bishop, 2014), there is a need to provide a discussion of how to dropout. Following Bowers et al. (2013), our first set of assess and compare accuracy of indicators in the EWI/EWS indicators to predict high school dropout mirrors the domain using ROC AUC with examples and code to help dichotomous indicators from Balfanz, Herzog, and Mac Iver inform current and future research, policy and practice. In this (2007) of students in sixth grade, including attendance, study we select a range of early warning predictors and suspension, misbehavior, failed math, failed English, any one outcomes of interest using publically available nationally- or more of these flags, any one flag, any two flags, any three generalizable data from the U.S Department of Education flags, and all four flags. The Balfanz et al. (2007) study is an National Center for Education and drawing on the important study to start with as a focus, as it examines a large EWI/EWS literature noted above, and provide the ROC AUC sample of sixth graders from Philadelphia, is widely cited, and analysis for each using open source , comparing these dropout flags are used in many state policy multiple early warning indicators for dropping out of high recommendations for at-risk prediction systems (Rumberger et school, post-secondary college enrollment, and high school al., 2017). However, currently there are no public national students eventually choosing a career in STEM (science, datasets that include the sixth grade as well as overall high , engineering, or mathematics) by age 26. school and career outcomes, thus we adapted the Balfanz et al. (2007) dropout flags to the ELS:2002 dataset. As we selected METHODS: the ELS:2002 dataset for this study, ELS:2002 includes a Data representative sample of students in grade 10 from across 750 This study is a secondary analysis of the publically accessible high schools in 2002. To provide an example of ROC AUC Education Longitudinal Study of 2002 (ELS:2002), a US high accuracy analysis with similar variables to Balfanz, Herzog, school student sample collected by the National Center for and Mac Iver (2007) using the public ELS:2002 data, we Education Statistics (NCES) from 2002 to 2012 (Ingles et al., included the variables “1st quantile standardized math score” 2014). ELS:2002 includes a nationally generalizable sample of where we dichotomized students’ standardized math scores about 16,000 students who were in grade 10 in 2002 across 750 with the cutoff point being the 1st quantile score. To follow high schools. For the present analysis the sample sizes of our Balfanz et al. (2007), we dichotomized both “absent” and selected variable range from 10,511 (college enrollment) to “suspension” to derive composite flags with Boolean operators 16,197 (dropout, one or more flags). We selected ELS:2002 as such as and and or. In addition to the Balfanz et al. (2007) it is currently the most recent comprehensive ten-year predictors of dropout, we examined the accuracy of multiple longitudinal U.S. high school student survey of its type. continuous predictors of dropout from the literature, such as Appendix A lists the descriptive statistics and labels of the attendance (Adelman, 2006; Allensworth, Gwynne, Moore, & variables analyzed in this study. de la Torre, 2014; Allensworth, Nagaoka, & Johnson, 2018; Bowers & Zhou (2019)

5

Balfanz & Boccanfuso, 2007; Bowers, & Sprott, 2012a, 2012b; fields, foreign languages critical to national security, or Geiser & Santelices, 2007; Kemple et al., 2013; Neild, Balfanz, qualifying liberal arts (U.S. Department of Education, 2014). & Herzog, 2007; Noble & Sawyer, 2004; Quadri & Kalyankar, We selected this definition for STEM courses in our study as 2010; Soland, 2013; Soland, 2017), suspension (Balfanz, NCES adopted this definition of STEM for ELS:2002. Herzog, & Mac Iver, 2007; Bowers & Sprott, 2012a; Soland, Following the recommendations of the previous literature 2013), extracurricular activities (Mahoney, & Cairns, 1997; (Willis, 2013; Whittaker, 2014), typical hard STEM courses Renzulli, & Park, 2000; Soland, 2013), and standardized test include engineering and computer science, whereas typical soft scores (Bowers, 2010; Kemple et al., 2013; Soland, 2013; STEM courses include forensic and archaeological science, and Soland, 2017). the social sciences. As shown in Appendix A, the sample size for the variable of either hard STEM occupations or soft STEM As for accurate predictors for college enrollment, researchers occupations at age 26 is 12,796. For the outcome variables, find that high school grades (Allensworth et al., 2018; Bowen, "Item legitimate skip/NA" and "Survey component legitimate Chingos, & McPherson, 2009; Geiser, & Santelices, 2007; skip/NA" were coded as 0 because students having these two Kemple et al., 2013; Soland, 2013; Soland, 2017), attending at options did not enter STEM careers and should be included in least one Advanced Placement (AP) program (Becker et al., the comparison group. 2014; Soland, 2013), and extracurricular activities (Soland, 2013) are significant predictors for four-year college Receiver Operating Characteristic (ROC) Analysis enrollment. As shown in Appendix A, the sample size for the The signal detection theory literature (Hanley & McNeil, 1982; outcome variable is 10,511. For the outcome variables, "Item Swets, 1988; Swets et al., 2000; Vivo & Franco, 2008; Zwieg legitimate skip/NA" and "Survey component legitimate & Campbell, 1993) recommends that studies calculate skip/NA" were considered as in the comparison group, coded as precision, sensitivity, and specificity (see Figure 1). In the 0 because students having these two options did not receive contingency table of Figure 1, an event indicates if a student college education and should be included in the comparison experiences the at-risk outcome (columns), whereas a predictor group. pROC, the analysis package used below, carries out predicts if the student will have the outcome (rows). According listwise deletion on the missing values of both the outcome to the signal detection theory literature, precision is defined as variable and the predictor before calculating AUC. the true-positives divided by the number of students predicted to have the flag, and the true-positive proportion (sensitivity) as For predictors of postsecondary STEM degree (Science the true-positives divided by the actual number of students at Technology Engineering Mathematics), researchers have risk. The true-negative proportion (specificity) is defined as the shown that STEM course selection is a strong predictor of if a true-negatives divided by the number of students without the student graduates from a postsecondary institution with a risk, and the false-positive proportion (1-specificity) as the STEM degree (Dutta-Moscato, Gopalakrishnan, Lotze, & false-positives divided by the number of students without the Becich, 2014). Specifically, we examine the accuracy of the risk. Indeed, as noted above in this literature (Swets, 1988), two number of STEM courses (Chen, 2013). We also evaluate the crucial measures of the accuracy of predictors are true-positive accuracy of standardized math score in high school (Chen, proportion (“hits”) and false-positive proportion (“false 2013; Crisp, Nora, & Taggart, 2009) and STEM course GPA alarms”). Any detection system will always have a trade-off (Chen, 2013; Crisp et al., 2009) in predicting if a student between “hits” and “false alarms”, as when one maximizes the graduates with a postsecondary STEM degree. As shown in number of hits by “casting a wider net”, one tends to also Appendix A, the sample size for the outcome variable is 6,936. increase the number of false alarms (Bowers et al, 2013, p. 83). For the outcome variables, "Item legitimate skip/NA" and "Survey component legitimate skip/NA" were coded as missing Area Under the Curve (AUC) data because students having these two options did not receive The focus of the present study is AUC, which has rarely been postsecondary education and should not be included in the addressed in the education diagnostics literature. AUC is a step comparison group. further over ROC analysis, and the purpose of the present study is to encourage the consistent use of AUC for predictors of We are mixing college and high school variables in predicting outcomes in education, with illustrative examples of different occupations at age 26 because previous research literature variables of risk. AUC is calculated by summing the area under (Perna et al., 2009) has shown that variables at both high school the ROC curve, and the bigger the area, the more accurate the and college levels are significant predictors of STEM careers. predictor. Formally, the formula for calculating AUC is We evaluate the accuracy of standardized math score at grade 10 (Robnett & Leaper, 2013), college STEM course number (1) (Perna et al., 2009), and STEM course GPA in college (Clark Blickenstaff, 2005) in predicting the occupation of a student as in either a “hard STEM” or “soft STEM” career by age 26 at where f(x) is the function of the ROC curve. However, since the third follow-up for ELS:2002 in 2012. We draw on the f(x) tends not to have an integratable shape like a parabola, literature in this domain for the definitions of STEM, hard methodologists suggest using approximation methods to STEM and soft STEM. Here we use the “S.M.A.R.T.” calculate AUC. For example, Robin et al. (2011) imputed AUC definition of STEM courses as courses related to technical by connecting “empirical ROC points on linear probability Bowers & Zhou (2019)

6 scales” via straight lines and calculating “the subtended area by samples (Bamber, 1975; DeLong et al., 1988). Since the the trapezoidal rule” (Swets, & Pickett, 1982, p. 31). In other trapezoid rule underestimates AUC when the number of values words, Robin et al. divide the x axis to k equally spaced of a variable is small (DeLong et al., 1988; Hanley & McNeil, intervals, cut the AUC into k trapezoids, and add the areas of 1983; Swets & Pickett, 1982), DeLong et al. (1988) used the the trapezoids to approximate the AUC. For feasibility of generalized U-statistic theory to develop a nonparametric computation, the approximation approach to calculating AUC approach to calculating and comparing AUCs with equation or its variants is what is commonly used as the algorithm in (3). Simulation studies (Fanjul-Hevia & González-Manteiga, ROC AUC software packages. 2018) show that DeLong et al.’s method produces the best power than other resampling methods when one ROC curve Significance Testing for AUC Difference dominates over the other curve. DeLong et al.’s (1988) method To test if an AUC is significantly different from another one, is incorporated into pROC (Robin et al., 2011) and the present we need a measure to test AUC difference. One such statistic study selected DeLong et al.’s (1988) approach in the pROC for significance testing between two continuous predictors is package to compare predictors’ AUCs. discussed in Hanley and McNeil (1983), with the formula for calculating the z score as AUC vs. Kappa In the education data mining and learning analytics literature, (2) researchers tend to use Kappa rather than ROC AUC (Baker, 2014; Ocumpaugh, Baker, Gowda, Heffernan, & Heffernan, 2014). Kappa measures the degree of agreement between where and are the AUC’s of the first and second events and predictors. With no agreement, Kappa equals 0 and the relationship between events and predictors is random. With predictors; where and are the standard errors perfect agreement, Kappa equals 1 (Langenbucher, Labouvie, of and ; and where r is the correlation between & Morgenstern, 1996). and . By comparing the calculated z score with corresponding critical z value of the standard normal For the present study, following the recommendations of the distribution, we will know if the AUC difference of two signal detection theory literature noted above, we used AUC as predictors is significant. the diagnostic measure for evaluating predictors as AUC has advantages over Kappa. First, the meaning of AUC is invariant However, the z score is suitable only for continuous variables for different datasets, so a higher value is always better than a (DeLong et al., 1988), and for non-continuous variables, we lower value (Baker, 2015). Second, AUC provides a more need to use other statistics, such as straightforward statistical interpretation (Baker, 2015). Third, AUC can calculate confidence intervals, allowing researchers to compute whether two AUC’s are significantly different from each other (Baker, 2015), as we show below in the application (3) examples. Fourth, AUC can compare variables that have some

categories dominate over other categories while Kappa cannot where and are true and estimated AUC vectors; handle such variables (Baker, 2015). Fifth, AUC is robust to where is a row vector of coefficients; where p and q are the skewed data, whereas Kappa calculations can be distorted for numbers of positive and negative cases in reality; and where skewed data (Jeni, Cohn, & De La Torre, 2013).

is the estimated covariance matrix for R Packages Equation (3) has a chi-square distribution with the degrees of To provide accessible example applications of the use of ROC AUC for education EWI/EWS we rely on the open source freedom equal to the rank of . Similar statistical R software (R Development Core Team, 2018) and to equation (2), equation (3) has an AUC difference as the the pROC R package (Robin et al, 2011). Among the several numerator and pooled covariance as the denominator. Again, open source R packages for calculation of AUC, we selected we can compare equation (3) to the corresponding value of the the pROC package by Robin et al. (2011) as this R package chi-square distribution to test significance of the AUC plots ROC curves, calculates AUC, and tests significance of difference. In the present study, as we discuss below, we use AUC difference. We provide the R markdown code for the the open source R software with the pROC package (Robin et present study in Online Appendix S1 al, 2011), as pROC adopts this statistic and readers may refer to http://doi.org/10.7916/D8K94RDD. pROC allows for DeLong et al. (1988) for a more in-depth discussion on the significance testing between two predictors and we used derivation of equation (3). OptimalCutpoints (López-Ratón et al., 2014) in R to calculate cutoff points along the ROC curves. Comparison with Mann–Whitney–Wilcoxon Test If approximated by the trapezoid method, then AUC is equivalent to the Mann–Whitney U-statistic, or the Mann– Whitney–Wilcoxon, which compares distributions of different Bowers & Zhou (2019)

7

(A) (B)

Predictor: AUC Predictor: AUC Any one flag: 0.729 Math t score (2002): 0.722 One or more flags: 0.700 Reading t score (2002): 0.706 Any two flags: 0.685 Extrac. activities (2002): 0.672 1st quantile Math t score (2002): 0.647 Absence: 0.647 Absent: 0.643 Suspension: 0.584 Misbehavior : 0.593 Suspension: 0.584 Any three flags: 0.551 All four flags: 0.509

Figure 2: ROC Curves for Predicting Dropout. Panel (A) replicates Balfanz et al. (2007) research on predictors of dropout with ELS:2002 data. The curves in black are for composite flags and those in gray are for raw flags. Panel (B) uses continuous variables to predict dropout with ELS:2002 data.

RESULTS: The purpose of this study is to provide an overview and High School Dropout introduction of the ROC AUC literature to help inform EWI Figure 2 illustrates the ROC curves for predictors of high and EWS research and practice in determining the accuracy of school dropout. Panel (A) replicates Balfanz et al.’s (2007) at-risk predictors, and then provide example applications in research on predictors of dropout with ELS:2002 data. Some education using open source software and publically available curves, such as misbehavior and suspension, are very close and data. In the below results section, we provide four main difficult to distinguish. This is because the predictors are examples of ROC AUC using the NCES ELS:2002 dataset and dichotomous, as we wished to attempt to mirror the Balfanz et the variables noted in the methods from high school and college al. (2007) predictors as closely as we could using the public to compare the accuracy of at-risk predictors for 1) high school ELS:2002 dataset to provide a link and entry point from the dropout, 2) post-secondary college enrollment, 3) STEM 4-year previous dropout indicator literature to the ROC AUC degree attainment, and 4) hard STEM and soft STEM career framework. We can evaluate the accuracy of each predictor outcomes at age 26. through calculating the AUC for each (see methods) and Bowers & Zhou (2019)

8

Table 1: Significance of AUC Difference for Predictors of Continuous Dropout Predictor 1 Predictor 2 Z p-value Reading t score (2002) Math t score (2002) 3.220 0.001 Reading t score (2002) Extracurricular activities (2002) -3.641 <0.001 Reading t score (2002) Absent -33.716 <0.001 Reading t score (2002) Suspension -32.105 <0.001 Math t score (2002) Extracurricular activities (2002) -5.744 <0.001 Math t score (2002) Absent -34.194 <0.001 Math t score (2002) Suspension -34.840 <0.001 Extracurricular activities (2002) Absent -28.758 <0.001 Extracurricular activities (2002) Suspension -28.183 <0.001 Absent Suspension 5.846 <0.001 comparing the results. As detailed in the signal detection theory In comparison to the dichotomous predictors in Figure 2A, in literature discussed above (Bowers et al, 2013; Swets, 1988) in Figure 2 Panel B continuous variables are used to predict a ROC plot such as Figure 2 Panel A, each predictor or dropout with ELS:2002 data. Although predictors of indicator is plotted in two dimensions of 1-specificity versus “suspension” and “absent” are continuous variables, each with sensitivity. The 45 degree line across the plot indicates a 50-50 five and four categories respectively, their ROC curves look random guess, whereas predictors that approach the point 0,1 in like dichotomous variables as students in the dataset are the upper left-hand corner approach a more perfect prediction, suspended or absent on average about once (see Appendix A). as point 0,1 indicates a predictor that had no false positives and As in Figure 2A, the AUC of each curve is provided below the captured only true positives. For our example here in Figure 2, plot of the ROC curves in Figure 2B. The most accurate this would indicate that a predictor perfectly predicted 100% of predictor of dropout in Figure 2B is the Mathematics all students who dropped out, and did not misidentify any standardized assessment score with an AUC of 0.722. Table 1 students who eventually graduated as dropouts. As noted in the provides the pair-wise AUC comparisons, showing that each previous literature (Bowers et al, 2013), the vast majority of the predictor is significantly different from each of the others in predictors in the high school dropout literature fall close to the Figure 2B. 45 degree random guess line. For AUC calculations, this is the area under the curve for each indicator and predictor, which College Enrollment, Postsecondary STEM Degree, and ranges from a random guess of 0.5 (half the area of the plot), to Hard STEM/Soft STEM Occupations 1.0 as a perfect prediction. Predictors with a higher AUC are Next, drawing on the prior research noted in the literature more accurate. As discussed in the methods, we also provide review and methods, we provide a series of example the statistical pair-wise comparisons for each AUC to applications to provide a range of examples of using ROC AUC determine if each indicator is significantly different from the across K-12, post-secondary, and occupational prediction others in the analysis (see Appendix B). outcomes. Our goal is to provide a series of illustrative examples to provide a ready means for K-12 and post- As shown in Figure 2 Panel A, the composite flags in which secondary EWI/EWS researchers and practitioners to find dropout flags are combined with Boolean operators were more accessible examples of the method that may relate to their accurate than individual flags alone by the ROC AUC, as “any practice. Figure 3 illustrates ROC curves predicting college one flag” and “one or more flags” each had an AUC of 0.729 enrollment (Figure 3 Panel A) and postsecondary STEM degree and 0.700 respectively. For the other dropout flags these are attainment (Figure 3 Panel B). Figure 3A compares the listed in decreasing order of accuracy with “all four flags” accuracy for predicting college enrollment with three variables being close to a random guess. These results mirror the including overall high school Grade Point Average (GPA), previous findings for these flags at grade six (Balfanz et al. number of extracurricular activities, and if a student ever was 2007; Bowers et al. 2013), and replicate and extend the results enrolled in an Advanced Placement (AP) course. As to the grade 10 context while calculating the AUC for each demonstrated in Figure 3A, overall high school GPA is the predictor for the first time. Appendix B provides the p-values most accurate predictor when comparing these three early for which predictors are significantly different from each other warning indicators with an AUC of 0.767, which as shown in in Figure 2 Panel A. Most pairwise comparisons among the Table 2 is significantly different than the other two predictors nine predictors are significant, with p values less than 0.001. (p<0.001). Exceptions are those for “absent” and “1st quantile Math t score”, “1st quantile Math t score” and “any two flags”, In Figure 3B we plot the ROC AUC accuracy for predictors of “misbehavior” and “suspension”, and “one or more flags” and if a student graduates with a postsecondary STEM degree using “any one flag”, as each of these is not significantly different three continuous variables, number of STEM courses, math t from the other in their accuracy to predict dropping out of high score (2002), and STEM course Grade Point Average (GPA). school (see Appendix B). As demonstrated in Figure 3B, number college STEM courses is the most accurate predictor when comparing these three early Bowers & Zhou (2019)

9

(A) (B)

Predictor: AUC Predictor: AUC GPA: 0.767 Number of STEM Courses: 0.957 Extracurricular. activities (2004): 0.642 Math t score (2002): 0.663 AP: 5 0.565 STEM course GPA: 0.566

Figure 3: ROC Curves for Predicting College Enrollment and Postsecondary STEM Degree

Table 2: Significance of AUC Difference for College Enrollment Predictors and STEM Career Predictors Predictor 1 Predictor 2 Z p-value College Enrollment Predictors GPA AP 32.679 <0.001 GPA Extracurricular activities (2004) 17.070 <0.001 AP Extracurricular activities (2004) -10.939 <0.001 Postsecondary STEM Degree Predictors Number of STEM Courses STEM course GPA 39.649 <0.001 Number of STEM Courses Math t score (2002) 31.625 <0.001 STEM course GPA Math t score (2002) -8.712 <0.001

Bowers & Zhou (2019)

10

(A) (B)

Predictor: AUC Predictor: AUC

Number of STEM Courses: 0.799 Number of STEM Courses: 0.693 Math t score (2002): 0.712 STEM course GPA: 0.673 STEM course GPA: 0.593 Math t score (2002): 0.611

Figure 4: ROC Curves for Predicting Soft STEM Career and Hard STEM Career

Table 3: Significance of AUC Difference for Soft and Hard STEM Career Predictors Predictor 1 Predictor 2 Z p-value Soft STEM Career Predictors Number of STEM Courses STEM course GPA 15.055 <0.001 Number of STEM Courses Math t score (2002) 8.943 <0.001 STEM course GPA Math t score (2002) -7.401 <0.001 Hard STEM Career Predictors Number of STEM Courses STEM course GPA 0.846 0.397 Number of STEM Courses Math t score (2002) 10.523 <0.001 STEM course GPA Math t score (2002) 11.111 <0.001

Bowers & Zhou (2019)

11 warning indicators with an AUC of 0.957, which as shown in compare the accuracy of the predictors, as how are we to know Table 2 is significantly different than the other two predictors if the predictors and indicators are getting any better without (p<0.001). Additionally, the R package OptimalCutpoints strong accuracy metrics. Our findings in this study show that (López-Ratón et al., 2014) allows us to calculate the optimal the application of signal detection theory through ROC AUC cutoff point along the ROC curve of college STEM course and open source R code works well to help identify which number, at 23 (at the apex of the curve), although some indicators are most accurate for an outcome. Our students took up to 56 STEM courses in college. This shows recommendation is that any research or tool that uses a that the optimal number of courses in predicting a predictor or indicator to identify students at risk provide the postsecondary STEM degree is 23 STEM courses. Also, this ROC plot and AUC data with the significance tests to result shows that in comparison to STEM course grades, the demonstrate that the indicator under discussion is more number of STEM courses is more accurate in predicting overall accurate than the standard practice as well as similar variables. STEM degree completion. A strong recent example in this domain is Sullivan, Marr, & Hu (2017), in which the authors used ROC AUC as the criterion to And finally, in Figure 4 we plot the ROC curves predicting compare logistic regression models and decision tree models in hard and soft STEM career occupation at age 26. Figure 4A terms of predicting standardized test performance in the compares the accuracy for predictors of hard STEM careers of Michigan Educational Assessment Program, determining that three continuous variables including the Number of STEM decision tree models were more accurate with an AUC of 0.92. Courses, math t score (2002), and college STEM course Grade Point Average (GPA). As demonstrated in Figure 4A, the Our findings of the accuracy of predictors for high school Number of STEM Courses is the most accurate predictor when dropout provide a context to understand results from previous comparing these three early warning indicators with an AUC of studies. First, our study concurs with Bowers, Sprott, and Taff 0.799, which as shown in Table 3 is significantly different than (2013) as both demonstrate ROC as an accurate diagnostic that of math t score (2002) (p<0.001). In addition, the AUC of measure for predictors of dropout. However, whereas Bowers, STEM course GPA (0.593) is significantly different than that of Sprott, and Taff (2013) use predictor point-estimate positions math t score (2002) (p<0.001). on the x and y coordinates to compare predictor accuracy, we use AUC to compare predictors, allowing for easy, vivid In Figure 4B we plot the accuracy of predictors of soft STEM comparisons. Second, our study provides a new perspective to careers with three continuous variables, college STEM course evaluate predictors of previous studies (e.g., Balfanz et al., number, math t score (2002), and college STEM course Grade 2007). For instance, we show that “absent” has relatively low Point Average (GPA). In contrast to hard STEM, we show that sensitivity, “any one flag” has low specificity, and “all four there is no statistical difference in the accuracy of using either flags” and “any three flags” have both low sensitivity and low of the predictors of Number of STEM Courses or STEM course specificity. The finding that “any one flag” has the highest GPA in predicting if a student worked in a soft STEM accuracy of the dichotomous predictors replicates the previous occupation at age 26 (Z=0.846, p=0.397). However, these two research (Balfanz et al., 2007; Bowers et al., 2013) and predictors are significantly different from the mathematics t- demonstrates the utility of an ensemble approach to score, each with higher AUC. constructing accurate predictors of dropout using Boolean operators such as “or” versus “and”. DISCUSSION: The purpose of this study is to provide a means for researchers, Third, our findings indicate that some continuous predictors policymakers, and practitioners to evaluate and compare the considered as an accurate means to predict dropout, such as accuracy of educationally relevant predictors of educational absence and suspension (Adelman, 2006; Allensworth, outcomes to help inform the EWI/EWS literature through the Nagaoka, & Johnson, 2018; Balfanz et al., 2007; Bowers, & application of signal detection theory. We demonstrate the Sprott, 2012a, 2012b; Kemple et al., 2013; Quadri & applicability of the technique using open source code in R with Kalyankar, 2010; Soland, 2017), actually have low accuracy in public data from ELS:2002. Our intention is to help build comparison to the other predictors. For these continuous capacity and inform future EWI/EWS research and applications predictors of high school dropout, it is interesting that grade 10 as having a means to know and correctly compare the accuracy mathematics and reading standardized test scores were the two of early warning flags, predictors, and indicators is critical as most accurate predictors, with mathematics having the highest researchers and educators rely evermore on early warning accuracy. Across the research on dropout flags (Bowers et al., systems to help identify which students may need additional 2013), standardized test scores have not received a large supports and resources to help them persist and succeed in amount of attention (Bowers, 2010; Kemple et al., 2013; school. When early warning systems rely on “at risk” indicators Soland, 2013; Soland, 2017), as previously authors have that have low accuracy, students are misidentified, resources focused on a wide constellation of dropout predictors, having are expended inefficiently, and many students who could use previously lacked a useful means to compare the accuracy of the resources are missed, at times as many as 40-50% of the predictors (see literature review). However, here, the NCES students at risk (Bowers et al., 2013). Additionally, it is ELS:2002 standardized mathematics and reading scores are test difficult to improve the indicators and the research for work in scores equated to the national NAEP (National Assessment of the EWI/EWS domain without a rigorous means to assess and Educational Progress) standardized tests (Ingles et al., 2014). Bowers & Zhou (2019)

12

NAEP, known colloquially as “the nation’s report card” is well- and that the optimal number of STEM courses students need to known to be a difficult assessment in reading and mathematics, graduate from a postsecondary institution with a STEM degree and that state standardized test scores have a poor history of is 23, aligning with previous research (Chen, 2013). alignment with NAEP (Bandeira de Mello, Bohrnstedt, Additionally, this finding aligns well with the previous research Blankenship, & Sherman, 2015). Thus, the finding that the in postsecondary STEM degree attainment that has highlighted ELS:2002 grade 10 mathematics and reading test scores the number of STEM courses as a central component of perform well in comparison to the other continuous dropout successful STEM degree attainment (Crisp, Nora, & Taggart, flags tested here does not necessarily mean that most or any 2009; Engberg & Wolniak, 2013). other state standardized test score will perform in the same way. We encourage future research in this area to further Second, for predicting a hard STEM career or a soft STEM include standardized subject assessments in dropout flag and career at age 26, we demonstrate interesting differences in the predictor research. accuracy of the predictors, finding that the total number of courses is the most accurate predictor for hard STEM, but that Our findings of the accuracy of college enrollment predictors in there is no statistical difference in the accuracy of the number terms of ROC AUC provide a means to identify differences of STEM courses versus STEM GPA for predicting soft STEM between the accuracy of these predictors. To date, as noted in careers. Additionally, while grade 10 mathematics standardized the literature review above, a central problem in this research test score sits between number of STEM courses and STEM domain is that researchers tend to depend on regression course GPA for predicting hard STEM careers, mathematics outcomes and the statistical significance of individual standardized test score has the lowest accuracy of these three coefficients, rather than on magnitude of effect, to make claims predictors for a soft STEM career. This work extends the that a variable “significantly predicts” an outcome. Indeed, previous research on STEM college experiences that lead to using this style of a regression coefficient framework, careers that has shown that number of STEM courses, STEM researchers are hampered in their ability to assess the GPA, and mathematics scores are all related to STEM career differences in predictor accuracy for variables such as GPA, outcomes (Chen, 2013; Sithole et al., 2017). Additionally, it is AP, and extracurricular activities, as all three predictors are interesting to note the STEM course GPA AUC between hard significant in predicting if high school students will enter STEM and soft STEM, as studies have shown that for STEM college. Nevertheless, from the perspective of signal detection overall in college, students receive lower GPAs overall than theory and sensitivity and specificity, we find that these three non-STEM students (Chen, 2013; Sithole et al., 2017). Overall, predictors do have significant differences in their accuracy in our results may suggest that while number of STEM courses in predicting college enrollment. For instance, high school GPA college is an important variable in predicting STEM career was significantly more accurate in predicting college outcomes, there may be significant differences for STEM enrollment than students having taken Advanced Placement course GPA and mathematics scores between the hard STEM courses (AP), with extracurricular activities lying between and soft STEM disciplines. Thus, overall, while we provide these two in accuracy. The point that AP courses are the lowest these findings as illustrative examples for the application of of these three makes sense, as given that AP is billed as ROC AUC to EWI/EWS, the reasons for the differences in the preparation for college, taking AP courses may not particularly accuracy of the predictors is of further interest, but outside the differentiate between students enrolling in post-secondary scope of the present study. We encourage future research in institutions (Ackerman, Kanfer, & Calderwood, 2013; Sublett these areas. & Gottfried, 2017). That high school GPA performs well in the accuracy of predicting college enrollment corresponds with the Limitations: research on grades and GPA (Bowers, 2009; Bowers, 2011; While we believe our results are rigorous and applicable, this Bowers, 2019; Brookhart et al., 2016) which indicates that study is limited in three main ways. First, we note that the auc grades represent a valid assessment of engaged participation in function of pROC gives the AUC of a predictor, but sometimes the education system, and thus present a strong signaling effect this AUC is slightly different from the one from the function for advancing through the system or not (Pattison, Grodsky, & roc.test where the AUC of this predictor is being compared Muller, 2013). with the AUC of another predictor. The deviation may be as large as 0.05, unsubstantial in practice, and we have not found Likewise, our findings of using ROC AUC to examine the researchers to have mentioned this point elsewhere. This is accuracy of predictors for postsecondary STEM degrees, hard most likely due to Delong et al.’s trapezoid rule dividing the STEM careers, and soft STEM careers allow for comparing AUC differently when calculating the AUC of a single predictors from the perspective of sensitivity and specificity. predictor from when calculating the AUCs of two correlated First, we identify Number of STEM Courses as a highly predictors. Second, since pROC does listwise deletion on both accurate predictor for postsecondary STEM degrees, with an the predictor and the outcome variable, the sample sizes of the AUC above 0.9. While this result makes sense, as the total ELS:2002 dataset for the outcomes listed in Appendix A were number of STEM courses is strongly related to the shrunk for calculations of the AUC, making the results of the requirements to graduate with a STEM degree in college, the present study generalizable to only the data included in the point we show here is that it is the total number of STEM analyses. In the future, we encourage researchers to delve more courses that is highly accurate, in comparison to STEM GPA, deeply into the missing values issue, such as multiple Bowers & Zhou (2019)

13 imputations, before using pROC to calculate AUC so that the REFERENCES: results will be more generalizable. As we were attempting to Adelman, C. (2006). The toolbox revisited: Paths to degree provide an example in practice with data similar to what a completion from high school through college. Washington, school district may have, this type of missing data imputation is DC: US Department of Education. Retrieved from: outside the scope of the present study. Third, the http://files.eric.ed.gov/fulltext/ED490195.pdf generalizability of the present student is limited as we refrained Agasisti, T., Bowers, A.J. (2017). Data Analytics and Decision- from using the probabilistic weights of the ELS:2002 dataset in Making in Education: Towards the Educational Data either calculating AUC’s or comparing the AUC’s of different Scientist as a Key Actor in Schools and Higher Education predictors. Thus, the current study should be considered as a Institutions. In Johnes, G., Johnes, J., Agasisti, T., López- descriptive rather than an inferential analysis. Again, we Torres, L. (Eds.) Handbook of Contemporary Education encourage future researchers to take our study as a starting Economics (184-210). Cheltenham, UK: Edward Elgar point for ROC AUC analysis and work to incorporate Publishing. https://doi.org/10.7916/D8PR95T2 normalized weights on data similar to ELS:2002 data to Allensworth, E. M. (2006). From High School to the Future: A compute AUC. First Look at Chicago Public School Graduates' College Enrollment, College Preparation, and Graduation from Practical Implications and Conclusion: Four-Year Colleges. Consortium on Chicago School Our findings have three main practical implications. First, with Research. http://files.eric.ed.gov/fulltext/ED499368.pdf the need to report accuracy of outcomes, education researchers Allensworth, E. M. (2013). The use of ninth-grade early can calculate AUC with R packages such as pROC. warning indicators to improve Chicago schools. Journal of Practitioners and analysts can build models on the data in Education for Students Placed at Risk (JESPAR), 18(1), institutions and build ROC plots and calculate AUC for each 68-83. DOI: 10.1080/10824669.2013.745181 predictor, with the R code provided here for reference (see Allensworth, E. M., & Easton, J. Q. (2007). What Matters for Online Appendix S1 http://doi.org/10.7916/D8K94RDD). Staying On-Track and Graduating in Chicago Public Highs Second, ROC AUC allows policymakers to check and confirm Schools: A Close Look at Course Grades, Failures, and the accuracy of predictors currently in use to make decisions on Attendance in the Freshman Year. Research which at-risk predictors are the most accurate to predict Report. Consortium on Chicago School Research. important educational outcomes, and which predictors have low http://files.eric.ed.gov/fulltext/ED498350.pdf accuracy. Third, ROC AUC helps policymakers to identify Allensworth, E. M., Gwynne, J. A., Moore, P., & de la Torre, accurate EWIs for the outcomes that are most important to their M. (2014). Looking Forward to High School and College: communities such that policies may cover the largest possible Middle Grade Indicators of Readiness in Chicago Public proportion of students at risk through using the most accurate Schools. University of Chicago Consortium on Chicago predictors and maximizing the “hits” while minimizing the School Research. “false-alarms”. For education administrators and policymakers, http://files.eric.ed.gov/fulltext/ED553149.pdf this type of accuracy information is crucial in their work to Allensworth, E., & Luppescu, S. (2018). Why do students get provide the limited resources of the education system to help good grades, or bad ones? The influence of the teacher, support specific student needs. class, school, and student Retrieved from Chicago: IL: https://consortium.uchicago.edu/sites/default/files/publicati Overall, we believe that the present research contributes to ons/Why%20Do%20Students%20Get-Apr2018- promoting AUC as an accurate measure to evaluate predictors Consortium.pdf of Early Warning Systems (EWS) and Early Warning Indictors Allensworth, E.M., Nagaoka, J., & Johnson, D.W. (2018). High (EWI), having shown the advantages of AUC in comparing the school graduation and college readiness indicator systems: accuracy of predictors of different education outcomes. The What we know, what we need to know. Chicago, IL: findings of this study suggest that in the future, education University of Chicago Consortium on School Research. researchers should 1) calculate the AUC of each predictor to Bamber, D. (1975). The area above the ordinal dominance give a baseline indication of the predictor’s accuracy; and 2) graph and the area below the receiver operating carry out pairwise significance tests on the AUCs of different characteristic graph. Journal of Mathematical Psychology, predictors to show if the predictors are significantly different in 12, 387-415. DOI: 10.1016/0022-2496(75)90001-2 terms of accuracy. Baker, R. S. (2013). Learning, Schooling, and Data Analytics. In M. Murphy, S. Redding, & J. Twyman (Eds.), Suggested Citation: Handbook on innovations in learning Philadelphia, PA: Bowers, A.J., Zhou, X. (2019) Receiver Operating Center on Innovations in Learning, Temple University; Characteristic (ROC) Area Under the Curve (AUC): A Charlotte, NC: Information Age Publishing. Retrieved Diagnostic Measure for Evaluating the Accuracy of Predictors from http://www.centeril.org/ of Education Outcomes. Journal of Education for Students Baker, R. S. (2014). Educational data mining: An advance for Placed At Risk, 24(1) 20-46. intelligent systems in education. IEEE Intelligent https://doi.org/10.1080/10824669.2018.1523734 systems, 29(3), 78-82. DOI: 10.1109/MIS.2014.42 Baker, R.S. (2015). and Education. 2nd Edition. New York, NY: Teachers College, Columbia University. Bowers & Zhou (2019)

14

Balfanz, R., & Boccanfuso, C. (2007). Falling off the path to Risk (JESPAR), 17(3), 129-148. DOI: graduation: Middle grade indicators in [an unidentified 10.1080/10824669.2012.692071 northeastern city]. Baltimore, MD: Center for Social Bowers, A.J., Sprott, R. (2012b) Examining the Multiple Organization of Schools. Trajectories Associated with Dropping Out of High Balfanz, R., Herzog, L., & Mac Iver, D. J. (2007). Preventing School: A Growth Mixture Model Analysis. The Journal of student disengagement and keeping students on the Educational Research, 105(3), 176-195. DOI: graduation path in urban middle-grades schools: Early 10.1080/00220671.2011.552075 identification and effective interventions. Educational Brookhart, S. M., Guskey, T. R., Bowers, A. J., McMillan, J. Psychologist, 42(4), 223-235. H., Smith, J. K., Smith, L. F., . . . Welsh, M. E. (2016). A http://www.tandfonline.com/doi/pdf/10.1080/00461520701 Century of Grading Research: Meaning and Value in the 621079?needAccess=true. DOI: Most Common Educational Measure. Review of 10.1080/00461520701621079 Educational Research, 86(4), 803-848. Bandeira de Mello, V., Bohrnstedt, G., Blankenship, C., & doi:10.3102/0034654316672069 Sherman, D. (2015). Mapping State Proficiency Standards Carlson, S. E. (2018). Identifying Students at Risk of Dropping Onto NAEP Scales: Results From the 2013 NAEP Reading out: Indicators and Thresholds Using ROC Analysis. and Mathematics Assessments (NCES 2015-046) (NCES http://digitalcommons.georgefox.edu/edd/114 2015-046). Retrieved from Washington, DC: Chen, X. (2013). STEM Attrition: College Students' Paths into https://files.eric.ed.gov/fulltext/ED557749.pdf and out of STEM Fields. Statistical Analysis Report. Becker, J., Schools, P. P., Levinger, B., Schools, P. G. S. C. P., NCES 2014-001. National Center for Education Statistics. Sims, A., & Whittington, A. (2014). Student Success and Clark Blickenstaff, (2005). Women and science careers: leaky College Readiness: Translating Predictive Analytics Into pipeline or gender filter?. Gender and Education, 17(4), Action. http://sdp.cepr.harvard.edu/files/cepr-sdp/files/sdp- 369-386. DOI: 10.1080/09540250500145072 fellowship-capstone-student-success-college-readiness.pdf Crisp, G., Nora, A., & Taggart, A. (2009). Student Bowen, W. G., Chingos, M. M., & McPherson, M. S. characteristics, pre-college, college, and environmental (2009). Crossing the finish line: Completing college at factors as predictors of majoring in and earning a STEM America's public universities. Princeton University Press. degree: An analysis of students attending a Hispanic Bowers, A.J. (2019) Report Card Grades and Educational serving institution. American Educational Research Outcomes. In Guskey, T., Brookhart, S. (Eds.) What We Journal, 46(4), 924-942. DOI: Know About Grading: What Works, What Doesn't, and 10.3102/0002831209349460 What's Next, (p.32-56). Alexandria, VA: Association for Cummings, K. D., & Smolkowski, K. (2015). Selecting Supervision and Curriculum Development. students at risk of academic difficulties. Assessment for Bowers, A.J. (2017) Quantitative Research Methods Training Effective Intervention, 41(1), 55-61. DOI: in Education Leadership and Administration Preparation 10.1177/1534508415590396 Programs as Disciplined Inquiry for Building School D'Agostino, J. V., Rodgers, E., & Mauck, S. (2018). Improvement Capacity. Journal of Research on Leadership Addressing Inadequacies of the Observation Survey of Education, 12(1), p. 72-96. DOI: Early Literacy Achievement. Reading Research Quarterly, 10.1177/194277511665946 53(1), 51-69. doi:10.1002/rrq.181 Bowers, A. J. (2011). What's in a grade? The multidimensional Davis, M., Herzog, L., & Legters, N. (2013). Organizing nature of what teacher-assigned grades assess in high Schools to Address Early Warning Indicators (EWIs): school. Educational Research and Evaluation, 17(3), 141- Common Practices and Challenges. Journal of Education 159. doi:10.1080/13803611.2011.597112 for Students Placed at Risk (JESPAR), 18(1), 84-100. DOI: Bowers, A. J. (2010). Analyzing the longitudinal K-12 grading 10.1080/10824669.2013.745210 histories of entire cohorts of students: Grades,data driven DeLong, E., DeLong, D., & Clarke-Pearson, D. (1988). decision making, dropping out and hierarchical cluster Comparing the Areas under Two or More Correlated analysis. Practical Assessment Research and Evaluation, Receiver Operating Characteristic Curves: A 15(7), 1–18. http://www.pareonline.net/pdf/v15n7.pdf Nonparametric Approach. Biometrics, 44(3), 837-845. Bowers, A. J. (2009). Reconsidering grades as data for decision DOI: 10.2307/2531595. making: More than just academic knowledge. Journal of http://www.jstor.org/stable/2531595 Educational Administration, 47(5), 609-629. Dutta-Moscato, J., Gopalakrishnan, V., Lotze, M. T., & Becich, doi:10.1108/09578230910981080 M. J. (2014). Creating a pipeline of talent for informatics: Bowers, A. J., Sprott, R., & Taff, S. A. (2013). Do we know STEM initiative for high school students in computer who will drop out? A review of the predictors of dropping science, biology, and biomedical informatics. Journal of out of high school: Precision, sensitivity, and Pathology Informatics, 5. DOI: 10.4103/2153- specificity. The High School Journal, 96(2), 77-100. 3539.129448 https://muse.jhu.edu/article/496998/pdf Engberg, M. E., & Wolniak, G. C. (2013). College student Bowers, A.J., Sprott, R. (2012a) Why Tenth Graders Fail to pathways to the STEM disciplines. Teachers College Finish High School: A Dropout Typology Latent Class Record, 115(1), 1-27. Analysis. The Journal of Education for Students Placed at Bowers & Zhou (2019)

15

Fanjul-Hevia, A., & González-Manteiga, W. (2018). A academic performance for students. IBM Journal of comparative study of methods for testing the equality of Research and Development, 59(6), 5:1-5:14. DOI: two or more ROC curves. Computational Statistics, 1-21. 10.1147/JRD.2015.2458631 DOI: 10.1007/s00180-017-0783-6 Ingels, S. J., Pratt, D. J., Alexander, C. P., Jewell, D. M., Lauff, Faria, A. M., Sorensen, N., Heppen, J., Bowdon, J., Taylor, S., E., Mattox, T. L., . . . Christopher, E. (2014). Education Eisner, R., & Foster, S. (2017). Getting Students on Track Longitudinal Study of 2002 (ELS:2002) Third Follow-Up for Graduation: Impacts of the Early Warning Intervention Data File Documentation (NCES 2014-364). Retrieved and Monitoring System after One Year. REL 2017- from Washington, DC: 272. Regional Educational Laboratory Midwest. http://nces.ed.gov/pubs2014/2014364.pdf https://ies.ed.gov/ncee/edlabs/projects/project.asp?projectI Jeni, L. A., Cohn, J. F., & De La Torre, F. (2013, September). D=388 Facing imbalanced data--recommendations for the use of performance metrics. In Affective Computing and Fogarty, J., Baker, R. S., & Hudson, S. E. (2005, May). Case Intelligent Interaction (ACII), 2013 Humaine Association studies in the use of ROC curve analysis for sensor-based Conference on (pp. 245-251). IEEE. DOI: estimates in human computer interaction. In Proceedings 10.1109/ACII.2013.47 of Graphics Interface 2005 (pp. 129-136). Canadian Johnson, E., & Semmelroth, C. (2010). The Predictive Validity Human-Computer Communications Society. of the Early Warning System Tool. NASSP Bulletin, 94(2), Frazelle, S., & Nagel, A. (2015). A practitioner's guide to 120-134. DOI: 10.1177/0192636510380789 implementing early warning systems. Washington, DC: Jordan, N., Glutting, J., Ramineni, C., & Watkins, M. (2010). U.S. Department of Education, Institute of Education Validating a Number Sense Screening Tool for use in Sciences, National Center for Education Evaluation and kindergarten and first grade: Prediction of mathematics Regional Assistance, Regional Educational Laboratory proficiency in third grade. School Psychology Review, Northwest. 39(2), 181-195. Retrieved from Geiser, S., & Santelices, M. V. (2007). Validity of High-School http://ezproxy.cul.columbia.edu/login?url=https://search.pr Grades in Predicting Student Success beyond the Freshman oquest.com/docview/608709943?accountid=10226 Year: High-School Record vs. Standardized Tests as Kemple, J. J., Segeritz, M. D., & Stephenson, N. (2013). Indicators of Four-Year College Outcomes. Research & Building On-Track Indicators for High School Graduation Occasional Paper Series: CSHE. 6.07. Center for studies in and College Readiness: Evidence from New York City. higher education. Journal of Education for Students Placed at Risk http://files.eric.ed.gov/fulltext/ED502858.pdf (JESPAR), 18(1), 7-28. DOI: Gleason, P., & Dynarski, M. (2002). Do we know whom to 10.1080/10824669.2013.747945 serve? Issues in using risk factors to identify Knight, S., & Shum, S. B. (2018). Theory and Learning dropouts. Journal of Education for Students Students Analytics. In C. Lang, G. Siemens, A. Wise, & D. Gašević Placed At Risk, 7(1), 25-41. DOI: (Eds.), Handbook of Learning Analytics (pp. 17-22): 10.1207/S15327671ESPR0701_3 Society for Learning Analytics Research. Gönen, M. (2007). Analyzing Receiver Operating Knowles, J. E. (2015). Of needles and haystacks: Building an Characteristic Curves With SAS. Cary, NC: SAS accurate statewide dropout early warning system in Publishing. Wisconsin. JEDM-Journal of Educational Data Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of Mining, 7(3), 18-67. Retrieved from the area under a receiver operating characteristic (ROC) https://jedm.educationaldatamining.org/index.php/JEDM/a curve. Radiology, 143, 29–36. DOI: rticle/view/JEDM082 10.1148/radiology.143.1.7063747 Koon, S., & Petscher, Y. (2016). Can scores on an interim high Hanley, J. A., & McNeil, B. J. (1983). A method of comparing school reading assessment accurately predict low the areas under receiver operating characteristic curves performance on college readiness exams? . Washington, derived from the same cases. Radiology, 148(3), 839-843. DC: U.S. Department of Education, Institute of Education DOI: 10.1148/radiology.148.3.6878708. Sciences, National Center for Education Evaluation and Horn, A. S., & Lee, G. (2017). Evaluating the Accuracy of Regional Assistance, Regional Educational Laboratory Productivity Indicators in Performance Funding Southeast. Models. Educational Policy. DOI: Krumm, A. E., Means, B., & Bienkowski, M. (2018). Learning 10.1177/0895904817719521 Analytics Goes to School: A Collaborative Approach to Hughes, J., & Petscher, Y. (2016). A guide to developing and Improving Education. New York: Routledge. evaluating a college readiness screener. Washington, DC: Lacefield, W. E., & Applegate, E. B. (2018). U.S. Department of Education, Institute of Education in Public Education: Longitudinal Student-, Intervention-, Sciences, National Center for Education Evaluation and School-, and District-Level Performance Modeling. Paper Regional Assistance, Regional Educational Laboratory presented at the Annual meeting of the American Southeast. Educational Research Association, New York, NY. Ikbal, S., Tamhane, A., Sengupta, B., Chetlur, M., Ghosh, S., & Lacefield, W., Applegate, B., Zeller, P. J., & Carpenter, S. Appleton, J. (2015). On early prediction of risks in (2012). Tracking students' academic progress in data rich Bowers & Zhou (2019)

16

but analytically poor environments. Paper presented at the college prediction model (ACPM). In Proceedings of the Annual meeting of the American Educational Research Seventh International Learning Analytics & Knowledge Association, Vancouver, BC. Conference (pp. 479-488). ACM. DOI: Langenbucher, J., Labouvie, E., & Morgenstern, J. (1996). 10.1145/3027385.3027435 Measuring diagnostic agreement. Journal of Consulting Olver, M. E., Stockdale, K. C., & Wormith, J. S. (2009). Risk and Clinical Psychology, 64(6), 1285. DOI: 10.1037/0022- assessment with young offenders: A meta-analysis of three 006X.64.6.1285 assessment measures. Criminal Justice and Behavior, Laracy, S. D., Hojnoski, R. L., & Dever, B. V. (2016). 36(4), 329-353. DOI: 10.1177/009385480933145 Assessing the classification accuracy of early numeracy Pattison, E., Grodsky, E., & Muller, C. (2013). Is the Sky curriculum-based measures using receiver operating Falling? Grade Inflation and the Signaling Power of characteristic curve analysis. Assessment for Effective Grades. Educational Researcher, 42(5), 259-265. Intervention, 41(3), 172-183. DOI: doi:10.3102/0013189x13481382 10.1177/1534508415621542 Pelánek, R. (2014). A brief overview of metrics for evaluation Liao, H. F., Yao, G., Chien, C. C., Cheng, L. Y., & Hsieh, W. of student models. In Workshop Approaching Twenty S. (2014). Likelihood ratios of multiple cutoff points of the Years of Knowledge Tracing (BKT20y) (p. 6). Taipei City Developmental Checklist for Preschoolers, 2nd Pelánek, R. (2015). Metrics for evaluation of student version. Journal of the Formosan Medical Association, models. JEDM-Journal of Educational Data Mining, 7(2), 113(3), 179-186. DOI: 10.1016/j.jfma.2011.10.55 1-19. López-Ratón, M., Rodríguez-Álvarez, M. X., Suarez, C. C., & Perna, L., Lundy-Wagner, V., Drezner, N. D., Gasman, M., Sampedro, F. G. (2014). OptimalCutpoints: an R package Yoon, S., Bose, E., & Gary, S. (2009). The contribution of for selecting optimal cutpoints in diagnostic tests. Journal HBCUs to the preparation of African American women for of Statistical Software, 61(8), 1-36. STEM careers: A case study. Research in Higher Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to Education, 50(1), 1-23. DOI: 10.1007/s11162-008-9110-y develop an “early warning system” for educators: A proof Phillips, M., Yamashiro, K., Farrukh, A., Lim, C., Hayes, K., of concept. Computers & Education, 54(2), 588-599. DOI: Wagner, N., . . . Chen, H. (2015). Using Research to 10.1016/j.compedu.2009.09.008 Improve College Readiness: A Research Partnership Mahoney, J. L., & Cairns, R. B. (1997). Do extracurricular Between the Los Angeles Unified School District and the activities protect against early school Los Angeles Education Research Institute. Journal of dropout?. Developmental psychology, 33(2), 241. DOI: Education for Students Placed at Risk, 20(1), 141- 10.1037/0012-1649.33.2.241 168. DOI: 10.1080/10824669.2014.990562 National Center on Response to Intervention. (2010). Users Piety, P. J., Hickey, D. T., & Bishop, M. (2014). Educational guide to universal screening tools chart. Washington, DC: data sciences: Framing emergent practices for analytics of National Center on Response to Intervention, Office of learning, organizations, and systems. Paper presented at the Special Education Programs, U.S. Department of Proceedings of the Fourth International Conference on Education. Retrieved from http://www.rti4 Learning Analytics and Knowledge. DOI: success.org/sites/default/files/UniversalScreeningUsersGui 10.1145/2567574.2567582 de.pdf Quadri, M. M., & Kalyankar, N. V. (2010). Drop out feature of Neild, R. C., Balfanz, R., & Herzog, L. (2007). An early student data for academic performance using decision tree warning system. Educational Leadership, 65(2), 28–33. techniques. Global Journal of Computer Science and Retrieved from http://new.every1graduates.org/wp- Technology, 10(2), 2-5. Retrieved from content/uploads/2012/03/Early_Warning_System_Neild_B https://computerresearch.org/index.php/computer/article/vi alfanz_Herzog.pdf ew/891 Nicholls, G., Wolfe, H., Besterfield-Sacre, M., & Shuman, L. R Development Core Team. (2018). R: A language and (2010). Predicting STEM degree outcomes based on eighth environment for statistical computing. Vienna, Austria: R grade data and standard test scores. Journal of Engineering Foundation for Statistical Computing. Retrieved from Education, 99(3), 209-223. DOI: 10.1002/j.2168- http://www.R-project.org 9830.2010.tb01057.x Renzulli, J. S., & Park, S. (2000). Gifted dropouts: The who Noble, J. P., & Sawyer, R. L. (2004). Is high school GPA better and the why.Gifted Child Quarterly, 44(4), 261-271. DOI: than admission test scores for predicting academic success 10.1177/001698620004400407 in college? College and University, 79(4), 17–22. Rice, M. E., & Harris, G. T. (2005). Comparing effect sizes in Ocumpaugh, J., Baker, R. S., Gowda, S., Heffernan, N., & follow-up studies: ROC area, Cohen's d, and r. Law and Heffernan, C. (2014). Population validity for Educational Human Behavior, 29(5), 615-620. DOI: 10.1007/s10979- Data Mining models: A case study in affect 005-6832- detection. British Journal of Educational Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Technology, 45(3), 487-501. DOI: 10.1111/bjet.12156 Sanchez, J. C., & Müller, M. (2011). pROC: an open- Ocumpaugh, J., Baker, R. S., San Pedro, M. O., Hawn, M. A., source package for R and S+ to analyze and compare ROC Heffernan, C., Heffernan, N., & Slater, S. A. (2017, curves. BMC bioinformatics, 12(1), 77. DOI: March). Guidance counselor reports of the ASSISTments 10.1186/1471-2105-12-77 Bowers & Zhou (2019)

17

Robnett, R. D., & Leaper, C. (2013). Friendship groups, Torres, D. D., Bancroft, A., & Stroub, K. (2015). Evaluating personal motivation, and gender in relation to high school High School Dropout Indicators and Assessing Their students' STEM career interest. Journal of Research on Strength. Adolescence, 23(4), 652-664. DOI: 10.1111/jora.12013 U.S. Department of Education. (2014, March 26). Academic Rumberger, R. W., Addis, H., Allensworth, E., Balfanz, R., Competitiveness Grants and National Science and Duardo, D., Dynarksi, M., . . . Tuttle, C. (2017). Preventing Mathematics Access to Retain Talent (SMART) Grants. dropout in secondary schools (NCEE 2017-4028). Retrieved from Retrieved from Washington, DC: https://www2.ed.gov/programs/smart/index.html https://ies.ed.gov/ncee/wwc/Docs/PracticeGuide/wwc_dro Van Inwegen, E. G., Wang, Y., Adjei, S., & Heffernan, N. pout_092617.pdf (n.d.). An Examination of Metrics that Describe User Siemens, G. (2013). Learning Analytics: The Emergence of a Models. Discipline. American Behavioral Scientist, 57(10), 1380- Vivo, J. M., & Franco, M. (2008). How does one assess the 1400. DOI: 10.1177/0002764213498851 accuracy of academic success predictors? ROC analysis Sithole, A., Chiyaka, E. T., McCarthy, P., Mupinga, D. M., applied to university entrance factors. International Bucklein, B. K., & Kibirige, J. (2017). Student Attraction, Journal of Mathematical Education in Science and Persistence and Retention in STEM Programs: Successes Technology, 39(3), 325-340. DOI: and Continuing Challenges. Higher Education Studies, 10.1080/00207390701691566 7(1), 46-59. Whittaker, R. C. (2014). Teaching in context using a mobile Soland, J. (2013). Predicting high school graduation and phone scenario. Master thesis, University of Salford. college enrollment: Comparing early warning indicator Willis. (2013). Higher education in science, technology, data and teacher intuition. Journal of Education for engineering and mathematics: Science and Technology Students Placed at Risk (JESPAR), 18(3-4), 233-262. DOI: Committee Report — Motion to Take Note— in the House 10.1080/10824669.2013.833047 of Lords at 5:21 pm on 21st March 2013. Soland, J. (2017). Combining Academic, Noncognitive, and Wilson, J., Olinghouse, N. G., McCoach, D. B., Santangelo, T., College Knowledge Measures to Identify Students Not on & Andrada, G. N. (2016). Comparing the accuracy of Track For College: A Data-Driven Approach. Research & different scoring methods for identifying sixth graders at Practice in Assessment, 12. risk of failing a state writing assessment. Assessing Stuit, D., O’Cummings, M., Norbury, H., Heppen, J., Dhillon, Writing, 27, 11-23. DOI: 10.1016/j.asw.2015.06.003 S., Lindsay, J., & Zhu, B. (2016). Identifying early Zwieg, M. H., & Campbell, G. (1993). Receiver-operating warning indicators in three Ohio school Districts (REL characteristic (ROC) plots: A fundamental evaluation tool 2016- 118). Washington, DC: U.S. Department of in clinical medicine. Clinical Chemistry, 39(4), 561–577. Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Midwest. Retrieved from About the Authors: http://ies.ed.gov/ncee/edlabs Alex J. Bowers is an associate professor of Education Sullivan, W., Marr, J., & Hu, G. (2017). A Predictive Model for Leadership at Teachers College, Columbia University. His Standardized Test Performance in Michigan Schools. research interests include organizational level data analytics, In Applied Computing and Information Technology (pp. organizational behavior, school and district leadership, data- 31-46). Springer, Cham. driven decision making, high school dropouts and completion, Supovitz, J. A., Foley, E., & Mishook, J. (2012). In search of educational assessment and accountability, education leading indicators in education. Education Policy Analysis technology and school facilities financing. Archives, 20(19), 1-23. ORCID: 0000-0002-5140-6428 Swets, J. A. (1988). Measuring the accuracy of diagnostic systems.Science, 240(4857), 1285-1293. DOI: Xiaoliang Zhou is a PhD candidate of Measurement, 10.1126/science.3287615 Evaluation, and Statistics at Teachers College, Columbia Swets, J. A., Dawes, R. M., & Monahan, J. (2000). University. His research interests include cognitive diagnostic Psychological Science Can Improve Diagnostic Decisions. modeling, item response theory, hierarchical modeling, rater Psychological Science in the Public Interest, 1(1), 1-26. effect models, structural equation modeling, data mining, doi:10.1111/1529-1006.001 , data visualization, natural language Swets, J. A., & Pickett, R. M. (1982). Evaluation of diagnostic processing, and deep learning. ORCID: 0000-0003-0467-6123 systems: methods from signal detection theory. Cognition. Tamhane, A., Ikbal, S., Sengupta, B., Duggirala, M., & Appleton, J. (2014). Predicting student risks through longitudinal analysis. Paper presented at the Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, New York, USA. http://dl.acm.org/citation.cfm?id=2623355. DOI: 10.1145/2623330.2623355 Bowers & Zhou (2019)

18

APPENDIX:

Appendix A: Variable Descriptive Statistics and Labels Variable Name N M SD Min. Max. ELS:2002 Variable Dropout 16,197 0.113 0.317 0 1 F2EVERDO; 1=Evidence of a dropout episode Absent 12,286 0.130 0.336 0 1 BYP52E; 1=School contacted parent about poor attendance 1- 4 times Suspension 14,476 0.081 0.272 0 1 BYS24F; 1=Suspended/put on probation 1- 5 times Misbehavior 12,457 0.072 0.259 0 1 BYP51; 1= Ever had behavior problem at school 1st quantile Math t score (2002) 15,892 0.250 0.433 0 1 BYTXMSTD; 1=Math test standardized score below 1st quantile One or more flags 16,197 0.371 0.483 0 1 1= One or more flags (absent, lowest quantile reading score, suspension, misbehavior) Any one flag 13,506 0.445 0.497 0 1 1= Any one flag Any two flags 12,381 0.124 0.330 0 1 1= Any two flags Any three flags 13,949 0.025 0.155 0 1 1= Any three flags All four flags 15,580 0.004 0.059 0 1 1= All four flags Extracurricular activities (2002) 14,446 4.773 5.700 0 21 BYS42; Hours/week spent on extracurricular activities Math t score (2002) 15,892 50.710 9.912 19.380 86.680 BYTXMSTD; Math test standardized score Reading t score (2002) 15,892 50.526 9.885 22.570 78.760 BYTXRSTD; Reading test standardized score College enrollment 10,511 0.546 0.498 0 1 F2PS0601; 1= Enrolled in a 4-yr institution GPA 14,796 3.912 1.543 0 6 F1RGPP2; GPA for all courses taken in the 9th - 12th grades Extracurricular activities (2004) 14,073 3.182 1.898 1 8 F1S27; Hours/week spent on extracurricular activities AP 14,368 0.182 0.386 0 1 BYS33A; 1= Ever in Advanced Placement program Postsecondary STEM degree 6,936 0.167 0.373 0 1 F3TZSTEM1CRED; 1= Ever earned a postsecondary credential in a STEM field as of June 2013 (SMART grant definition) Number of STEM Courses 11,540 9.257 10.39 0 56 F3TZSTEM1TOT; Transcript: Number of known STEM courses taken (using SMART 6 Grant definition of STEM) STEM course GPA 10,755 2.586 0.938 0 4 F3TZSTEM2GPA; Transcript: GPA for all known STEM courses (using NSF definition of science, engineering, and related fields) Hard STEM career 12,796 0.063 0.243 0 1 F3STEMOCCCUR; 1= Life and Physical Science, Engineering, Mathematics, and Information Technology Occupations Soft STEM career 12,796 0.078 0.268 0 1 F3STEMOCCCUR; 1=Social Science Occupations/ Health Occupations

Bowers & Zhou (2019)

19

Appendix B: Significance of AUC Difference for Balfanz et al. (2007) Dropout Predictors Absent 1st quantile Suspension Misbehavior One or more Any one flag Any two flags Any three flags Math t score flags (2002) 1st quantile Math t score 0.330 Suspension <0.001 <0.001 Misbehavior <0.001 <0.001 0.719 One or more flags <0.001 <0.001 <0.001 <0.001 Any one flag <0.001 <0.001 <0.001 <0.001 1.000 Any two flags <0.001 0.089 <0.001 <0.001 <0.001 <0.001 Any three flags <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 All four flags <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001

Bowers & Zhou (2019)