TEM Journal. Volume 7, Issue 3, Pages 617-625, ISSN 2217-8309, DOI: 10.18421/TEM73-19, August 2018.

Predicting Student Success Using Data Generated in Traditional Educational Environments

Marian Bucos 1, Bogdan Drăgulescu 2

1 Politehnica Timisoara, Vasile Parvan Blvd., Timisoara, Romania

Abstract – Educational (EDM) 1. Introduction techniques offer unique opportunities to discover knowledge from data generated in educational Data Mining, also referred to as knowledge environments. These techniques can assist tutors and discovery from data, is the automatic process of researchers to predict future trends and behavior of extracting implicit, potentially useful information students. This study examines the possibility of only from data, for better decision-making [1]. The using traditional, already available, course report data, primary function of data mining is the application of generated over years by tutors, to apply EDM specific methods to develop models and discover techniques. Based on five algorithms and two cross- previously unknown patterns [2]. The increased validation methods we developed and evaluated five adoption of Data Mining methods (association, classification models in our experiments to identify the classification, clustering, density estimation, one with the best performance. A time segmentation regression, sequential patterns) in has approach and specific course performance attributes, collected in a classical manner from course reports, brought changes in the way tutors, researchers, and were used to determine students' performance. The educational institutions investigate data to make models developed in this study can be used early in optimal informed decisions. The implementation of identifying students at risk and allow tutors to improve these Data Mining methods for analyzing data the academic performance of the students. By available in education defines the field of following the steps described in this paper other Educational Data Mining (EDM). practitioners can revive their old data and use it to According to Baker and Yacef, the areas that most gain insight for their classes in the next academic year. often attract the attention of EDM researchers

Keywords – Classification, Educational Data Mining, include individual from educational predicting student performance, traditional , computer-adaptive testing, computer- educational environments. supported collaborative learning, and the factors that are associated with student failure [3]. On the other hand, the authors summarize the approaches to use

with EDM: prediction, clustering, relationship DOI: 10.18421/TEM73-19 https://dx.doi.org/10.18421/TEM73-19 mining, distillation of data for human judgment, and discovery with models. Predicting student’s Corresponding author: Marian Bucos, performance is one of the oldest and the most Politehnica University Timisoara, Vasile Parvan Blvd., popular application of EDM. Among the techniques Timisoara, Romania that have been used to estimate the student success, Email: [email protected] the following are mentioned: neural networks, Bayesian networks, rule-based systems, regression, Received: 30 June 2018. and correlation analysis [4]. Accepted: 11 August 2018. Peña Ayala, in his analysis of recent EDM works, Published: 27 August 2018. observed that student performance modeling is one © 2018 Marian Bucos, Bogdan Drăgulescu; of the favorite targets of this domain approach. The published by UIKTEN. This work is licensed under the author identifies several indicators of performance Creative Commons Attribution-NonCommercial-NoDerivs that deserve to be modeled: efficiency, evaluation, 3.0 License. achievement, competence, resource consuming, and elapsed time [5]. The article is published with Open Access Higher education institutions are evidently at www.temjournal.com interested in modeling student academic performance to improve quality and efficiency of the traditional

TEM Journal – Volume 7 / Number 3 / 2018. 617 TEM Journal. Volume 7, Issue 3, Pages 617-625, ISSN 2217-8309, DOI: 10.18421/TEM73-19, August 2018. processes [6]. For instance, researchers and tutors The paper is structured as follows. The next could: identify students at risk of failing some of section outlines methodology underlying our their courses to benefit from an intervention [7]; experiments, the data sets, and the classification determine the utility of students’ response time in models. Section 3 describes the experiments carried performance prediction [8]; prescribe activities that out and the results obtained. In Section 4, there is a maximize the knowledge learned as evaluated by discussion of this work, and in Section 5, we expected post-test success [9]; offer individual conclude and discuss future work. support according to students’ academic skills. Romero et al. compared several data mining 2. Methodology approaches for classifying students’ final marks To build reliable models, Knowledge Discovery based on the activities carried out in web-based and Data Mining process was considered. The courses [10]. They focused on seven courses process consists of a number of steps, which can be from Cordoba University. All the algorithms were followed in a knowledge discovery project in order to evaluated using stratified tenfold cross-validation. identify patterns in data [11] [12]. The process of Re-balance preprocessing techniques have also been knowledge discovery included the following steps: applied to the original numerical data to evaluate the , preprocessing, data mining, and models. From the conducted experiments, they interpretation of the results. observed better results for most of the algorithms (17 This study aims to reveal student academic out of 25) with re-balanced data enabled. performance in distributed examination courses, In this paper, we focus on predicting students’ based on models developed using five classification performance by only using traditional course report algorithms implemented in Scikit-learn [13] data, gathered over the years by tutors. We aim to (Decision Tree CART, Extra Trees Classifier, prove that old data can have hidden information that Classifier, Logistic Regression, and can be harnessed by teachers to provide insight for C-Support Vector Classification). Each model was their classes and to provide a good example on how evaluated using two cross-validation methods they can accomplish that. Our data is from one (stratified tenfold cross-validation and leave-one- distributed examination course, gathered from 2009 label-out cross-validation). Students were classified to 2017. A binary classification problem is as either 1 (passed) or 0 (failed), in accordance with addressed. Our approach to solving this problem their performance indicators. The latest version of includes student performance attributes inside the Scikit-learn (Python package for ), course, time segmentation of initial data set, different version 0.19, was used to design the experiments for classification algorithms, and two cross-validation the proposed models. methods. The prediction is based on performance attributes (e.g., examination score, activity grade, 2.1 Data collection class attendance) extracted from course reports at the Politehnica University Timisoara, Faculty of The data used in this study was obtained from the Electronics, Telecommunications, and Information Faculty of Electronics, Telecommunications, and . The output class to be predicted is the Information Technologies, Politehnica University academic status that has two possible values: 1 Timisoara. The data sets include course records (passed), for students who completed the course, and gathered over a period of nine years, from 2009 to 0 (failed), for those who need to repeat the course. 2017, for a distributed examination course, Object The most important aspects that differentiate our Oriented Programming. Therefore, the data has not study from other works are mentioned in the been gathered expressly for educational data mining following. First, we are modeling student academic purposes. In accordance with the Politehnica performance for distributed examination courses, an University Timisoara regulations, course examination strategy specific to several Romanian examination can be achieved through final exam, and , including the Politehnica University distributed examination. For courses with final Timisoara. An approach to data preprocessing that examination could be established optional midterm utilizes course report data, already available for examination during the study period, while the final tutors and researchers, is described. Next, a data set exam is scheduled at the end of semester (final time segmentation is addressed to enable the examination period), after completion of the study algorithms to predict the student course status from period. For distributed examination courses, a early on. Finally, the classification models are number of mandatory examinations are provided evaluated using different cross-validation strategies, during the study period. At least two examinations- thus allowing an analysis of how the results re-examinations sets must be established to test generalize from one year to the next. knowledge, skills and abilities acquired by a student.

618 TEM Journal – Volume 7 / Number 3 / 2018. TEM Journal. Volume 7, Issue 3, Pages 617-625, ISSN 2217-8309, DOI: 10.18421/TEM73-19, August 2018.

Regarding the distributed examination course The course considered for this study, Object considered for this study, students participate in 9 Oriented Programming, takes place in the second optional course sessions, 14 mandatory face-to-face year of study. The attributes that were not relevant to practical activity meetings, and 4 examinations (two the problem were removed from the data sets. For of which are re-examinations). For each example, the student name is not relevant to the examination-re-examination set, the greatest value is student success evaluation. preserved. In Romanian universities, the marking Predicting student performance as soon as possible system is a 10-grading system, where 1 is the lowest is a major challenge for tutors to better meet the grade, and 10 is the highest attainable. For this needs of students. To achieve this, four data sets course, a student needs a mark greater or equal than (DS) were derived by including all attributes 5, in average activity mark and each examination available after each examination, based on four time mark, to complete the course. The final course mark, segments: the first time segment in week 6 (DS1), the at the end of the semester, is calculated as the sum of second time segment in week 8 (DS2), the third time 50% of the average activity mark and 50% the segment in week 12 (DS3), and the fourth time average examination mark. segment in week 14 (DS4). The data sets included six For each of the 1077 student records, 43 attributes common attributes: gender, student membership to were collected, such as: name, gender, student advanced study group, study year, grade point membership to advanced study group, study year, average, number of credits earned by each student in grade point average, the number of credits earned in the previous year, and final course status. In addition, the previous year, attendance for each practical attributes that describe the performance of students in activity meeting (14 attributes), mark for each the course and the number of students from each practical activity meeting (14 attributes), attendance activity group or course were computed through data for each examination (4 attributes), mark for each aggregation, corresponding to each time segment: examination (4 attributes), and final course status. average activity mark, the number of attendances in The obtained data set consists of student records with practical activity meetings, average examination a total of 1077 x 43 data points. mark, the number of examinations, the number of The output class to be predicted in our binary students in one year, and the number of students by classification problem is the academic status that has study group. The last data set (DS4) was not two possible values: 1 (passed), for students who preserved because it corresponds to the end of the completed the course, and 0 (failed), for those who semester. need to repeat the course. Each data set contained a total of 908 x 12 data points, with the following attributes: gender (GEN), 2.2 Preprocessing student membership to advanced study group (MG), study year (SY), grade point average (GPA), number The data preprocessing refers to any type of of credits earned in the previous year (CR), number processing performed on initial data set to prepare it of students in one year (NSY), number of students by for applying a data mining technique. study group (NSG), average activity mark (AA), the During the data collection step, the data was number of attendances in practical activity meetings integrated into a single data set. The data used in this (NA), average examination mark (AE), number of study was taken up from the reports that tutors examinations (NE), and final course status (SP). generate each year for this course. In the initial data In the data preprocessing step, feature selection set were included 1077 student records. A total of based on univariate statistical tests was performed. 169 student records was discarded from this data set We computed chi squared statistical test to select the due to incomplete information (students without class best features from our data sets. During this iterative activity and examination data), or duplicate process those predictor attributes that have the information (the second, the third, or even the fourth strongest relationship with the output class were registration of a failed student in a course). After selected: student membership to advanced study removing these student records, the data set contains group, number of credits earned in the previous year, a total of 908 instances. The data cleaning involved average activity mark, number of attendances in processes like: filling in missing values for some practical activity meetings, average examination numeric predictor attributes by using the attribute mark, and number of examinations. The selected mean; correcting inconsistencies in the data; predictor attributes and the output class were given in converting categorical attributes to numeric value Table 1. for reference. due to the requirements of some classification algorithms (C-Support Vector Classification) [14].

TEM Journal – Volume 7 / Number 3 / 2018. 619 TEM Journal. Volume 7, Issue 3, Pages 617-625, ISSN 2217-8309, DOI: 10.18421/TEM73-19, August 2018.

Table 1. Attributes used in classification models. The second experiment involved the use of another

Attribute Description Values method, leave-one-label-out cross-validation. In this CR number of credits earned numeric value approach, each training set is generated by taking all in the previous year the samples except the ones related to a specific NA number of attendances in numeric value label, while the test set contains the samples left out. practical activity meetings For instance, when the case of how the results NE number of examinations numeric value generalize from one year to another was analyzed, AA average activity mark numeric value the data sets were partitioned into segments AE average examination mark numeric value containing only records for one academic year. This MG student membership to {0=false, way, the data set was partitioned into nine folds, each advanced study group 1=true} of them containing only one data group. During nine SP final course status {0=failed, iterations, the models were trained using student 1=passed} records from eight academic years and evaluated on Data normalization is a good preprocessing the left-out year. Similarly to the previous practice, required by many classification algorithms. experiment, average metrics were computed using To avoid cases in which predictor attributes have values obtained on metrics in each iteration. Both different weight in the decision process standard cross-validation strategies mentioned in this study normalization (z-score) was applied [15]. The were implemented using cross-validation and metrics normalization was performed using the mean (μ) and modules from Scikit-learn package [13]. the standard deviation (σ) for each attribute. The Next, the problem of imbalanced data sets was normalized values were obtained by subtracting the analyzed. Imbalance in the class distribution may mean of each predictor attribute of the initial data cause a significant deterioration in classification and divide the value by the standard deviation. performance [19]. In the first experiment, the class distribution is slightly imbalanced with a percentage Xn = (X - μ) / σ (1) of 30% for the minority class. Table 2. shows the

distribution of the student academic status. The standard normalization ensures that each attribute has a normal distribution with zero mean Table 2. Distribution of the student academic status. and standard deviation equal to one. Class Description #Students %Students In our study, where only a limited amount of data 1 Passed 639 0.70 was available, cross-validation methods were used 0 Failed 269 0.30 for randomly partitioning of the data into segments of training and test data. Each of the data sets The class distribution is quite imbalanced even previously obtained on time segments was divided with the second experiment data sets, with a into the training set, part of the data used for learning percentage from 19% to 45% for the minority class. the models, and the test set, used for evaluating the Table 3. summarizes the number of students by year. performance of the classification models [16]. It shows, for each academic year, the number of Two cross-validation methods were applied: students, the number of students who completed the stratified tenfold cross-validation and leave-one- course, the number of students who need to repeat label-out cross-validation. In the first experiment, the course, the percentage of passed students, and the stratified tenfold cross-validation method was percentage of failed students. employed. Stratification is the process of providing In this study, the class imbalance problem has been for each class equal representation in every fold [17], addressed by assigning weights to classes during generally performing better than regular cross- classification. The values of the output class have validation in terms of the bias and variance [18]. been used to balance weights inversely proportional Using this method, the data was partitioned into ten to class frequencies [13]. approximately equal sized segments or folds. By Table 3. Data set summary by year. splitting the data sets, each of the ten folds would Year #Students %Passed %Failed have roughly 90 records. 2009 101 0.81 0.19 Many performance metrics (true positive rate, true 2010 97 0.77 0.23 negative rate, accuracy, area under the ROC curve, and f-score) were calculated for each model in ten 2011 74 0.64 0.36 iterations. Each time, one of the ten stratified folds 2012 137 0.55 0.45 was served as the test set, and the remaining were 2013 89 0.55 0.45 used as the training set. Thus, for each iteration, a 2014 87 0.75 0.25 student record is either in the training set or in the 2015 110 0.78 0.22 testing set. Finally, the average metrics were 2016 107 0.75 0.25 computed to evaluate the models. 2017 106 0.75 0.25

620 TEM Journal – Volume 7 / Number 3 / 2018. TEM Journal. Volume 7, Issue 3, Pages 617-625, ISSN 2217-8309, DOI: 10.18421/TEM73-19, August 2018.

After preprocessing, according to our data set time classified as failed. Based on confusion matrix segmentation approach, the both experiments benefit values, qualitative metrics can be defined. In this from three data sets, one for each time segment study, the following metrics were used for evaluating (DS1, DS2, and DS3), with six predictor attributes. the performance of the classifiers: For the first experiment, the data sets were • True positive rate (TP rate), also known as partitioned into ten folds, and for the second sensitivity or recall (Rec), is a proportion of experiment, the data sets were divided into nine passing students which are predicted to be folds. passed;

TP rate = TP / (TP + FN) (2) 2.3 Classification models • True negative rate (TN rate), also known as Classification refers to the process of learning a specificity, is a proportion of failed students model that maps an input data set to a predetermined which are predicted to be failed; set of classes. This process consists of two steps, a TN rate = TN / (TN + FP) (3) learning step, where a training set is provided to • Accuracy (Acc) is a proportion of total number build the model, and a classification step, where the of student predictions that were correct; model is evaluated with a different set of data. Acc = (TP + TN) / (TP + TN + FP + FN) (4) According to Tan et al., a classifier is a systematic • F-score (F1) or f-measure is a weighted approach for building classification models from an harmonic average of precision and recall. input data set [20]. F1 = 2 x Pre x Rec / (Pre + Rec) (5) To conduct our experiments, we used the most common classification approaches from Scikit-learn: Another way to evaluate the performance of the decision tree, logistic regression, ensemble classifiers is the area under the ROC curve (AUC). classifiers, and support vector machines. The ROC (Receiver Operating Characteristic) curve Experiments were performed with: a decision tree is generated by plotting the true positive rate against (Decision Tree CART) similar to C4.5 decision tree the false positive rate using different decision algorithm [21]; an ensemble method (Extra Trees thresholds and represents a summary of the Classifier) that randomizes both attribute and cut- qualitative error of a model. The area under the ROC point choice when splitting a node during the curve corresponds to the probability that a randomly construction of the tree [22]; an ensemble classifier chosen positive instance ranks above a randomly (Random Forest Classifier) using the perturb-and- chosen negative instance [26]. For all the classifiers, combine technique for decision trees [23]; a linear these metrics were computed using stratified tenfold model (Logistic Regression Classifier, or logit) for cross-validation for the first experiment and leave- classification using a logistic function [24]; an one-label-out cross-validation for the second implementation of support vector machine (C- experiment.

Support Vector Classification) for classification 3. Results based on libsvm [25]. The purpose of the first experiment was to evaluate Table 4. Confusion matrix. in which time segment the data set best predicts

Predicted students’ performance. The experiments presented in Class 1 Class 0 this study were conducted using Scikit-learn Actual Class 1 True Positive False Negative package. The data sets corresponding to the first Class 0 False Positive True Negative three examinations (time segments) were used. For each time segment (DS1, DS2, and DS3), five The performance of a binary classification model classification algorithms were executed using can be evaluated on the number of test records stratified tenfold cross-validation: Decision Tree correctly and incorrectly predicted by the model. The CART (DT), Extra Trees Classifier (ET), Random following quantities are important for this problem: Forest Classifier (RF), Logistic Regression Classifier true positive (TP), false positive (FP), true negative (logit), and C-Support Vector Classification (SVC). (TN), and false negative (FN). These quantities are The prediction results for each time segment and associated with a confusion matrix (Table 4.). classification algorithm are shown in Table 5., with: In the context of predicting students’ academic true positive rate (TP rate), true negative rate (TN performance: true positive are students which have rate), accuracy (Acc), area under the ROC curve correctly been classified as passed; false positive are (AUC), and f-score (F1). The best results for each failed students which have incorrectly been classified data set are highlighted in bold letters. as passed; true negative are students which have As the results indicate, the five classifiers correctly been classified as failed; false negative are performed reasonably well in predicting students’ those passed students which have incorrectly been performance, with accuracy ranging from 84% to

TEM Journal – Volume 7 / Number 3 / 2018. 621 TEM Journal. Volume 7, Issue 3, Pages 617-625, ISSN 2217-8309, DOI: 10.18421/TEM73-19, August 2018.

86% in week 8 (DS2). Within each time segment, the value was obtained by the DS2 and DS3 data sets, difference between the highest and the lowest value with an AUC of 0.94. of the classifiers accuracy varies slightly: 5% for A second experiment was conducted to specifically DS1, 2% for DS2, and 4% for DS3. From the verify how the models from the first experiment obtained results, we can notice that the best accuracy would generalize to an independent year data set. To in the first time segment is achieved using Random achieve this, a leave-one-label-out cross-validation Forest Classifier, for the second time segment the method was applied. The initial data sets were best accuracy is achieved using any of the following divided into nine folds, each of which preserving classifiers (Random Forest Classifier, Logistic only one academic year records. Each classifier used Regression Classifier, and C-Support Vector eight folds for training, while the validation was Classification), while in the third time segment, performed using the left-out fold. Logistic Regression Classifier produces the best result.

Table 5. Classification results using stratified tenfold cross-validation.

Segment Algorithm TP TN Acc AUC F1 rate rate DS1 DT 0.84 0.56 0.76 0.70 0.83 week 6 ET 0.82 0.62 0.76 0.80 0.83 RF 0.85 0.63 0.79 0.83 0.85 logit 0.72 0.80 0.74 0.85 0.80 SVC 0.72 0.85 0.76 0.85 0.81 DS2 DT 0.89 0.74 0.85 0.82 0.89 week 8 ET 0.87 0.75 0.84 0.89 0.88 RF 0.90 0.75 0.86 0.90 0.90 logit 0.87 0.85 0.86 0.94 0.90 SVC 0.87 0.84 0.86 0.93 0.90 DS3 DT 0.87 0.71 0.82 0.79 0.87 week 12 ET 0.87 0.72 0.83 0.90 0.87 RF 0.87 0.78 0.85 0.92 0.89 logit 0.85 0.88 0.86 0.94 0.89 SVC 0.82 0.90 0.84 0.93 0.88

The best algorithm in predicting passing students (even from the first time segment), with more than 85% in terms of the true positive rate, was Random Forest Classifier. However, the same algorithm offers low performance for the true negative rate. Predicting student academic performance was most often linked to the classification models that will produce the best results in predicting student failure [27]. According to [28], failing to identify students at risk can be as Figure 1. ROC (Receiver Operating Characteristic) far costlier than incorrectly identifying someone as a curves for classifiers on DS2 and DS3 data sets failure. C-Support Vector Classification produced the best results in predicting student failure, with values The metrics were computed as an overall average from 85% to 90% in terms of true negative rate. of the results across the nine iterations. Table 6. Also, Logistic Regression Classifier offers best value presents the metrics obtained for each classification (85%) in predicting student failure with the second algorithm and time segment by using leave-one- time segment. label-out cross-validation. Experimental results were used in Fig. 1. to The results are similar to those obtained in the first illustrate the ROC curves for DS2 and DS3 data sets. experiment. Once more, the models generated by The diagrams show that Logistic Regression applying Random Forest Classifier produced the best Classifier and C-Support Vector Classification accuracy with DS1, while in the third time segment models are better classifiers for this case. Logistic Logistic Regression Classifier produces the best Regression Classifier produced the best results for result. The highest value (86%) was obtained with AUC metric in every time segment. The highest the DS2 and DS3 data sets.

622 TEM Journal – Volume 7 / Number 3 / 2018. TEM Journal. Volume 7, Issue 3, Pages 617-625, ISSN 2217-8309, DOI: 10.18421/TEM73-19, August 2018.

Table 6. Classification results using leave-one-label-out aggregation after each examination, such as: number cross-validation year partition of attendances in practical activity meetings, average

Segment Algorithm TP TN Acc AUC F1 activity mark, number of examinations sustained, rate rate average examination mark. We gave details of how DS1 DT 0.81 0.54 0.74 0.68 0.81 and why such data was collected, cleaned, and week 6 ET 0.82 0.54 0.75 0.80 0.82 normalized. The data collection and aggregation RF 0.85 0.60 0.78 0.80 0.84 steps are of high importance in the overall problem logit 0.71 0.80 0.74 0.86 0.78 of developing a prediction model and can take more SVC 0.70 0.79 0.73 0.83 0.78 than half the time needed to develop the models. DS2 DT 0.89 0.70 0.84 0.80 0.88 Classification models using six attributes were week 8 ET 0.88 0.76 0.84 0.90 0.88 developed in this study. The models were introduced RF 0.90 0.77 0.86 0.91 0.90 based on different algorithms: Decision Tree CART, logit 0.87 0.87 0.86 0.95 0.90 Extra Trees Classifier, Random Forest Classifier, SVC 0.87 0.85 0.86 0.94 0.90 Logistic Regression Classifier, and C-Support Vector DS3 DT 0.87 0.69 0.82 0.78 0.86 Classification. These algorithms were chosen from ET 0.73 0.84 0.91 0.89 week 12 0.89 the available ones in Scikit-learn toolkit. Different RF 0.88 0.75 0.84 0.91 0.88 algorithms may perform better on our data set or on logit 0.85 0.88 0.86 0.94 0.89 the data sets used by other practitioners that want to SVC 0.82 0.90 0.84 0.93 0.88 implement a prediction model. The point here is to Regarding true positive rate, the Random Forest not limit the list of useful algorithms for prediction Classifier based model offers the highest values, models to the ones evaluated in this study. It may be predicting more than 85% of passing students on regarded as a start but not an exhaustive list. To each data set. As in the previous experiment, the best validate the performance of our classifiers, stratified algorithms in predicting falling students were tenfold cross-validation and leave-one-label-out Logistic Regression Classifier and C-Support Vector cross-validation were employed. Classification. The Logistic Regression Classifier In the first experiment, the performance of the model offer the best results for AUC metric on each classifiers was evaluated based on three metrics: true time segment, with value of 0.86 for the first time positive rate, true negative rate, and accuracy. The segment, a value of 0.95 in the second time segment, results of the first experiment are summarized in and 0.94 in the third time segment. Table 5. We have shown that by using data generated in traditional educational environment, not collected for prediction models’ purposes, classification 4. Discussion models can be developed that can predict student In this paper, we set out to show that data academic performance in the early stage with decent generated in traditional educational environment can metrics. As we expected, the predictive performance be used to predict student academic performance in of classifiers is changing from one segment to early stage. The student performance modeling was another (more data is available on the activities of conducted at the Politehnica University Timisoara students in courses), with significant increases in with a new type of course, distributed examination DS2 against DS1, and minor changes in DS3 against course. We consider that in the context of this DS2. Even if the classifiers performed reasonably examination strategy from our university it is well in predicting students’ performance from the possible to build proper classification models to early first time segment (DS1), the best results relative to identify students at risk. the period were offered for the second time segment, The data used in this study was collected in one in week 8 (DS2). These results are correlated with course, Object Oriented Programming. The student the structure of our course. Therefore, other performance modeling approach described in this practitioners must choose the time segments based on study is based on course report data generated over the course structure for which they are building a years by tutors. We consider that this kind of data prediction model. could be summoned by tutors and researchers from The second experiment considered the higher educational institutions to make informed performance of the developed classifiers by using decisions. The first three examinations, carried out leave-one-label-out cross-validation method. during the semester, allowed us to introduce the Similarly, to the previous experiment, the course specific data sets (DS1, DS2, and DS3). The performance of the classifiers was evaluated based fourth examination data set was not preserved due to on the true positive rate, true negative rate, and its correspondence to the end of the semester. accuracy. Our goal was to verify how the models Based on our time segmentation approach, would generalize to an independent year data set. We performance attributes were computed through data have shown that leave-one-label-out cross-validation

TEM Journal – Volume 7 / Number 3 / 2018. 623 TEM Journal. Volume 7, Issue 3, Pages 617-625, ISSN 2217-8309, DOI: 10.18421/TEM73-19, August 2018. method could be used to validate models against The experimental results have shown that students time-based folds and, therefore, evaluate how the at risk can be identified in early stage even from model generalize over the years. It is mandatory for a performance course data that has not been gathered practitioner to evaluate the models for quality over expressly for educational data mining purposes. the years. Some tested algorithms presented a From our results, we found that Logistic Regression significant drop in performance from tenfold cross- Classifier is the best algorithm for this classification validation to leave-one-label-out, for example SVC problem, offering good values in terms of accuracy, for the first time segment DS1 on the metric TN rate true negative rate, and true positive rate. However, it had a drop of 0.06. In the second experiment, other classifiers are performing well, even from the leave-one-label-out cross-validation was employed. first time segment. Although we only used a limited To implement the leave-one-label-out approach, all amount of data, we have shown that classifiers but one of the available academic years were used in performance evaluation may be conducted through the training set and the left-out year records were cross-validation (stratified tenfold cross-validation, used to validate the results. The results obtained for leave-one-label-out cross-validation). each classification algorithm and time segment by Properly used, the classification models developed using leave-one-label-out cross-validation are shown in this study could help us to identify students at risk in Table 6. of failing to benefit from an intervention, or to design After measuring the metrics of the classifiers by student’s activities according to their skills or using both stratified cross-validation and leave-one- interests. By following the steps described in this label-out cross-validation, we can conclude that paper, practitioners that have not used their old data Logistic Regression Classifier produced best values can develop models to help them get insight into for this classification problem. While this study their students’ performance. extends our in predicting student academic Directions for further research emerged during the performance, focusing on applying data mining conduct of this study. The most important direction techniques on traditional report data, we would be to extract data, regarding this course or acknowledge several limitations, some already some other courses, from our university Learning mentioned in the above discussions. A major Management System, based on Moodle. limitation is the use of data generated within one course and, therefore, the possibility of using the References models for other courses. To mix data and build a general model for multiple [1]. Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. courses the following problems must be considered: (2016). Data Mining: Practical machine learning how the models could be validated for courses that tools and techniques. Morgan Kaufmann. have a different structure, courses at the same [2]. Maimon, O., & Rokach, L. (2009). Introduction to university but on different academic tracks, or knowledge discovery and data mining. In Data courses from different universities. If the practitioner Mining and Knowledge Discovery Handbook (pp. 1- has a well organize data from the previous years, the 15). Springer, Boston, MA. more logical approach is to develop the models for [3]. Baker, R. S., & Yacef, K. (2009). The state of each individual course. educational data mining in 2009: A review and future In addition, a downside of the proposed approach visions. JEDM| Journal of Educational Data is that we mainly rely on student performance Mining, 1(1), 3-17. attributes. What to do if there is data collected by [4]. Romero, C., & Ventura, S. (2010). Educational data using electronic means. In this study, we focused on mining: a review of the state of the art. IEEE only using old data collected traditionally to prove Transactions on Systems, Man, and Cybernetics, Part that it can be used to build reliable prediction models. C (Applications and Reviews), 40(6), 601-618. If the practitioner has access to data collected by an e-learning platform, it is logical to mine the data and [5]. Peña-Ayala, A. (2014). Educational data mining: A survey and a data mining-based analysis of recent produce extra attributes for the training sets that can works. Expert systems with applications, 41(4), 1432- improve the performance of the models. 1462.

[6]. Delavari, N., Phon-Amnuaisuk, S., & Beikzadeh, M. 5. Conclusions R. (2008). Data mining application in higher learning institutions. Informatics in Education-International

Journal, 7, 31-54. This paper describes a study in predicting student academic performance by using only report data [7]. Vihavainen, A., Luukkainen, M., & Kurhila, J. (2013, generated over years in one distributed examination July). Using students' programming behavior to course from Politehnica University Timisoara. predict success in an introductory mathematics course. In Educational Data Mining 2013.

624 TEM Journal – Volume 7 / Number 3 / 2018. TEM Journal. Volume 7, Issue 3, Pages 617-625, ISSN 2217-8309, DOI: 10.18421/TEM73-19, August 2018.

[8]. Xiong, X., Pardos, Z., & Heffernan, N. (2011). An [18]. Kohavi, R. (1995). A study of cross-validation and analysis of response time data for improving student bootstrap for accuracy estimation and model selection. performance prediction. KDD 2011 Workshop: International Joint Conference on Artificial Knowledge Discovery in Educational Data. Intelligence, 14(2), 1137-1143.

[9]. Yudelson, M., & Brunskill, E. (2012). Policy building [19]. Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, -- an extension to user modeling. Proceedings of the A. (2007, June). Experimental perspectives on fifth International Conference on Educational Data learning from imbalanced data. In Proceedings of the Mining, (pp. 188-191). 24th international conference on Machine learning (pp. 935-942). ACM. [10]. Romero, C., Ventura, S., Espejo, P., & Hervas, C. (2008). Data mining algorithms to classify students. [20]. Tan, P. N. (2007). Introduction to data mining. Proceedings of Educational Data Mining 2008: 1 st Pearson Education India. International Conference on Educational Data Mining, [21]. Breiman, L., Friedman, J., Stone, C., & Olshen, R. (pp. 8-17). (1984). Classification and regression trees. CRC press. [11]. Fayyad, U. (1996). Data mining and knowledge discovery: making sense out of data. IEEE Expert, [22]. Geurts, P., Ernst, D., & Wehenkel, L. (2006). 11(5), 20–25. Extremely randomized trees. Machine Learning, 63(1), 3-42. [12]. Kurgan, L., & Musilek, P. (2006). A survey of knowledge discovery and data mining process [23]. Breiman, L. (2001). Random forest. Machine models. The Knowledge Engineering Review, 21(1), Learning, 45(1), 5-32. 1-24. [24]. Yu, H., Huang, F., & Lin, C. (2011). Dual coordinate [13]. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, descent methods for logistic regression and maximum V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). entropy models. Machine Learning, 85(1-2), 41-75. Scikit-learn: Machine learning in Python. Journal of [25]. Chang, C., & Lin, C. (2011). LIBSVM: a library for machine learning research, 12(Oct), 2825-2830. support vector machines. ACM Transactions on [14]. Lee, N., & Kim, J. (2010). Conversion of categorical Intelligent Systems and , 2(3), 1-27. variables into numerical variables via Bayesian [26]. Fawcett, T. (2006). An introduction to ROC analysis. network classifiers for binary classifications. Pattern Recognition Letters, 27(8), 861–874. Computational & Data Analysis, 54(5), 1247-1265. [27]. Márquez-Vera, C., Cano, A., Romero, C., & Ventura, S. (2013). Predicting student failure at [15]. Han, J., Kamber, M., & Pei, J. (2011). Data Mining: school using genetic programming and different data concepts and techniques (3Ed.). Morgan Kaufmann mining approaches with high dimensional and Publishers. imbalanced data. Applied Intelligence, 38(3), 315-330. [16]. Arlot, S., & Celisse, A. (2010). A survey of cross- [28]. Aguiar, E., Chawla, N. V., Brockman, J., Ambrose, validation procedures for model selection. Statistics G. A., & Goodrich, V. (2014, March). Engagement vs surveys, 4, 40-79. performance: using electronic portfolios to predict [17]. Hens, A. B., & Tiwari, M. K. (2012). Computational first semester engineering student retention. time reduction for credit scoring: An integrated In Proceedings of the Fourth International approach based on support vector machine and Conference on And stratified sampling method. Expert Systems with Knowledge (pp. 103-112). ACM. Applications, 39(8), 6774-6781.

TEM Journal – Volume 7 / Number 3 / 2018. 625