Chapter 5: Predictive Modelling in Teaching and Learning

Total Page:16

File Type:pdf, Size:1020Kb

Chapter 5: Predictive Modelling in Teaching and Learning Chapter 5: Predictive Modelling in Teaching and Learning Christopher Brooks 1, Craig Thompson 2 1 School of Information, University of Michigan, USA 2 Department of Computer Science, University of Saskatchewan, Canada DOI: 10.18608/hla17.005 ABSTRACT This article describes the process, practice, and challenges of using predictive modelling LQWHDFKLQJDQGOHDUQLQJ,QERWKWKHāHOGVRIHGXFDWLRQDOGDWDPLQLQJ ('0 DQGOHDUQLQJ analytics (LA) predictive modelling has become a core practice of researchers, largely with DIRFXVRQSUHGLFWLQJVWXGHQWVXFFHVVDVRSHUDWLRQDOL]HGE\DFDGHPLFDFKLHYHPHQW,QWKLV chapter, we provide a general overview of considerations when using predictive modelling, the steps that an educational data scientist must consider when engaging in the process, DQGDEULHIRYHUYLHZRIWKHPRVWSRSXODUWHFKQLTXHVLQWKHāHOG Keywords: 3UHGLFWLYHPRGHOLQJPDFKLQHOHDUQLQJHGXFDWLRQDOGDWDPLQLQJ ('0 IHDWXUH selection, model evaluation 3UHGLFWLYHDQDO\WLFVDUHDJURXSRIWHFKQLTXHVXVHG and journals associated with the Society for Learning to make inferences about uncertain future events. Analytics and Research (SoLAR) and the International In the educational domain, one may be interested in (GXFDWLRQDO'DWD0LQLQJ6RFLHW\ ,('06 IRUPRUH predicting a measurement of learning (e.g., student H[DPSOHVRIDSSOLHGHGXFDWLRQDOSUHGLFWLYHPRGHOOLQJ academic success or skill acquisition), teaching (e.g., First, it is important to distinguish predictive mod- WKHLPSDFWRIDJLYHQLQVWUXFWLRQDOVW\OHRUVSHFLāF HOOLQJIURPH[SODQDWRU\PRGHOOLQJ 7,QH[SODQDWRU\ LQVWUXFWRURQDQLQGLYLGXDO RURWKHUSUR[\PHWULFV modelling, the goal is to use all available evidence of value for administrations (e.g., predictions of re- WRSURYLGHDQH[SODQDWLRQIRUDJLYHQRXWFRPH)RU WHQWLRQRUFRXUVHUHJLVWUDWLRQ 3UHGLFWLYHDQDO\WLFV instance, observations of age, gender, and socioeco- in education is a well-established area of research, nomic status of a learner population might be used and several commercial products now incorporate LQDUHJUHVVLRQPRGHOWRH[SODLQKRZWKH\FRQWULEXWH predictive analytics in the learning content manage- to a given student achievement result. The intent of 1 2 PHQWV\VWHP HJ'/ 6WDUāVK5HWHQWLRQ6ROXWLRQV WKHVHH[SODQDWLRQVLVJHQHUDOO\WREHFDXVDO YHUVXV 3 4 Ellucian, and Blackboard )XUWKHUPRUHVSHFLDOL]HG correlative alone), though results presented using these 5 companies (e.g., Blue Canary, Civitas Learning ) now DSSURDFKHVRIWHQHVFKHZH[SHULPHQWDOVWXGLHVDQG provide predictive analytics consulting and products rely on theoretical interpretation to imply causation for higher education. (as described well by Shmueli, 2010). ,QWKLVFKDSWHUZHLQWURGXFHWKHWHUPVDQGZRUNĂRZ In predictive modelling, the purpose is to create a model related to predictive modelling, with a particular that will predict the values (or class if the prediction emphasis on how these techniques are being applied does not deal with numeric data) of new data based on in teaching and learning. While a full review of the REVHUYDWLRQV8QOLNHH[SODQDWRU\PRGHOOLQJSUHGLFWLYH literature is beyond the scope of this chapter, we en- modelling is based on the assumption that a set of known courage readers to consider the conference proceedings data (referred to as training instances in data mining 1 http://www.d2l.com/ 7 Shmueli (2010) notes a third form of modelling, descriptive 2KWWSZZZVWDUāVKVROXWLRQVFRP PRGHOOLQJZKLFKLVVLPLODUWRH[SODQDWRU\PRGHOOLQJEXWLQZKLFK 3 http://www.ellucian.com/ there are no claims of causation. In the higher education literature, 4 http://www.blackboard.com/ we would suggest that causation is often implied, and the majority 5 http://bluecanarydata.com/ of descriptive analyses are actually intended to be used as causal http://www.civitaslearning.com/ HYLGHQFHWRLQĂXHQFHGHFLVLRQPDNLQJ CHAPTER 5 PREDICTIVE MODELLING IN TEACHING & LEARNING PG 61 literature) can be used to predict the value or class of PREDICTIVE MODELLING new data based on observed variables (referred to as WORKFLOW features in predictive modelling literature). Thus the SULQFLSDOGLIIHUHQFHEHWZHHQH[SODQDWRU\PRGHOOLQJ Problem Identification and predictive modelling is with the application of the In the domain of teaching and learning, predictive PRGHOWRIXWXUHHYHQWVZKHUHH[SODQDWRU\PRGHOOLQJ modelling tends to sit within a larger action-oriented does not aim to make any claims about the future, HGXFDWLRQDOSROLF\DQGWHFKQRORJ\FRQWH[WZKHUHLQ- while predictive modelling does. stitutions use these models to react to student needs 0RUHFDVXDOO\H[SODQDWRU\PRGHOOLQJDQGSUHGLFWLYH in real-time. The intent of the predictive modelling modelling often have a number of pragmatic differ- activity is to set up a scenario that would accurately HQFHVZKHQDSSOLHGWRHGXFDWLRQDOGDWD([SODQDWRU\ describe the outcomes of a given student assuming PRGHOOLQJLVDSRVWKRFDQGUHĂHFWLYHDFWLYLW\DLPHG no new intervention. For instance, one might use a at generating an understanding of a phenomenon. predictive model to determine when a given individual 3UHGLFWLYHPRGHOOLQJLVDQLQVLWXDFWLYLW\LQWHQGHGWR is likely to complete their academic degree. Applying make systems responsive to changes in the underlying this model to individual students will provide insight data. It is possible to apply both forms of modelling into when they might complete their degrees assuming to technology in higher education. For instance, Lonn no intervention strategy is employed. Thus, while it is and Teasley (2014) describe a student-success system important for a predictive model to generate accurate EXLOWRQH[SODQDWRU\PRGHOVZKLOH%URRNV7KRPSVRQ scenarios, these models are not generally deployed and Teasley (2015) describe an approach based upon without an intervention or remediation strategy in mind. predictive modelling. While both methods intend Strong candidate problems for a successful predictive to inform the design of intervention systems, the modelling approach are those in which there are quan- former does so by building software based on theory WLāDEOHFKDUDFWHULVWLFVRIWKHVXEMHFWEHLQJPRGHOOHG GHYHORSHGGXULQJWKHUHYLHZRIH[SODQDWRU\PRGHOVE\ a clear outcome of interest, the ability to intervene in H[SHUWVZKLOHWKHODWWHUGRHVVRXVLQJGDWDFROOHFWHG situ, and a large set of data. Most importantly, there IURPKLVWRULFDOORJāOHV LQWKLVFDVHFOLFNVWUHDPGDWD must be a recurring need, such as a class being ordered The largest methodological difference between the two year after year, where the historical data on learners modelling approaches is in how they address the issue (the training set) is indicative of future learners (the RIJHQHUDOL]DELOLW\,QH[SODQDWRU\PRGHOOLQJDOORIWKH testing set). data collected from a sample (e.g., students enrolled in Conversely, several factors make predictive modelling a given course) is used to describe a population more PRUHGLIāFXOWRUOHVVDSSURSULDWH)RUH[DPSOHERWK generally (e.g., all students who could or might enroll in sparse and noisy data present challenges when trying DJLYHQFRXUVH 7KHLVVXHVUHODWHGWRJHQHUDOL]DELOLW\ WRFUHDWHDFFXUDWHSUHGLFWLYHPRGHOV'DWDVSDUVLW\RU are largely based on sampling techniques. Ensuring the missing data, can occur for a variety of reasons, such as sample represents the general population by reducing students choosing not to provide optional information. VHOHFWLRQELDVRIWHQWKURXJKUDQGRPRUVWUDWLāHGVDP - Noisy data occurs when a measurement fails to capture pling, and determining the amount of power needed the intended data accurately, such as determining a to ensure an appropriate sample, through an analysis VWXGHQWÚVORFDWLRQIURPWKHLU,3DGGUHVVZKHQVRPH RISRSXODWLRQVL]HDQGOHYHOVRIHUURUWKHLQYHVWLJDWRU VWXGHQWVDUHXVLQJYLUWXDOSULYDWHQHWZRUNV SUR[LHV is willing to accept. In a predictive model, a hold out used to circumvent region restrictions, a not uncommon dataset is used to evaluate the suitability of a model practice in countries such as China). Finally, in some IRUSUHGLFWLRQDQGWRSURWHFWDJDLQVWWKHRYHUāWWLQJ domains, inferences produced by predictive models of models to data being used for training. There are may be at odds with ethical or equitable practice, several different strategies for producing hold out such as using models of student at-risk predictions datasets, including k-fold cross validation, leave-one- WROLPLWWKHDGPLVVLRQVRIVDLGVWXGHQWV H[HPSOLāHG RXWFURVVYDOLGDWLRQUDQGRPL]HGVXEVDPSOLQJDQG LQ6WULSOLQJHWDO DSSOLFDWLRQVSHFLāFVWUDWHJLHV Data Collection With these comparisons made, the remainder of this In predictive modelling, historical data is used to gen- chapter will focus on how predictive modelling is erate models of relationships between features. One being used in the domain of teaching and learning, RIWKHāUVWDFWLYLWLHVIRUDUHVHDUFKHULVWRLGHQWLI\WKH and provide an overview of how researchers engage outcome variable (e.g., grade or achievement level) as in the predictive modelling process. well as the suspected correlates of this variable (e.g., JHQGHUHWKQLFLW\DFFHVVWRJLYHQUHVRXUFHV *LYHQ the situational nature of the modelling activity, it is PG 62 HANDBOOK OF LEARNING ANALYTICS important to choose only those correlates available at categorical, and interval and ratio are considered as or before the time in which an intervention might be numeric. Categorical values may be binary (such as HPSOR\HG)RULQVWDQFHDPLGWHUPH[DPLQDWLRQJUDGH predicting whether a student will pass or fail a course) PLJKWEHSUHGLFWLYHRIDāQDOJUDGHLQWKHFRXUVHEXW or multivalued (such as predicting which of a given set of if the intent is to intervene before the midterm, this possible practice questions would be most appropriate data value should be left out of the modelling activity. IRUDVWXGHQW
Recommended publications
  • Health Outcomes from Home Hospitalization: Multisource Predictive Modeling
    JOURNAL OF MEDICAL INTERNET RESEARCH Calvo et al Original Paper Health Outcomes from Home Hospitalization: Multisource Predictive Modeling Mireia Calvo1, PhD; Rubèn González2, MSc; Núria Seijas2, MSc, RN; Emili Vela3, MSc; Carme Hernández2, PhD; Guillem Batiste2, MD; Felip Miralles4, PhD; Josep Roca2, MD, PhD; Isaac Cano2*, PhD; Raimon Jané1*, PhD 1Institute for Bioengineering of Catalonia (IBEC), Barcelona Institute of Science and Technology (BIST), Universitat Politècnica de Catalunya (UPC), CIBER-BBN, Barcelona, Spain 2Hospital Clínic de Barcelona, Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Universitat de Barcelona (UB), Barcelona, Spain 3Àrea de sistemes d'informació, Servei Català de la Salut, Barcelona, Spain 4Eurecat, Technology Center of Catalonia, Barcelona, Spain *these authors contributed equally Corresponding Author: Isaac Cano, PhD Hospital Clínic de Barcelona Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS) Universitat de Barcelona (UB) Villarroel 170 Barcelona, 08036 Spain Phone: 34 932275540 Email: [email protected] Abstract Background: Home hospitalization is widely accepted as a cost-effective alternative to conventional hospitalization for selected patients. A recent analysis of the home hospitalization and early discharge (HH/ED) program at Hospital Clínic de Barcelona over a 10-year period demonstrated high levels of acceptance by patients and professionals, as well as health value-based generation at the provider and health-system levels. However, health risk assessment was identified as an unmet need with the potential to enhance clinical decision making. Objective: The objective of this study is to generate and assess predictive models of mortality and in-hospital admission at entry and at HH/ED discharge. Methods: Predictive modeling of mortality and in-hospital admission was done in 2 different scenarios: at entry into the HH/ED program and at discharge, from January 2009 to December 2015.
    [Show full text]
  • Review on Predictive Modelling Techniques for Identifying Students at Risk in University Environment
    MATEC Web of Conferences 255, 03002 (2019) https://doi.org/10.1051/matecconf/20192 5503002 EAAI Conference 2018 Review on Predictive Modelling Techniques for Identifying Students at Risk in University Environment Nik Nurul Hafzan Mat Yaacob1,*, Safaai Deris1,3, Asiah Mat2, Mohd Saberi Mohamad1,3, Siti Syuhaida Safaai1 1Faculty of Bioengineering and Technology, Universiti Malaysia Kelantan, Jeli Campus, 17600 Jeli, Kelantan, Malaysia 2Faculty of Creative and Heritage Technology, Universiti Malaysia Kelantan,16300 Bachok, Kelantan, Malaysia 3Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan, City Campus, Pengkalan Chepa, 16100 Kota Bharu, Kelantan, Malaysia Abstract. Predictive analytics including statistical techniques, predictive modelling, machine learning, and data mining that analyse current and historical facts to make predictions about future or otherwise unknown events. Higher education institutions nowadays are under increasing pressure to respond to national and global economic, political and social changes such as the growing need to increase the proportion of students in certain disciplines, embedding workplace graduate attributes and ensuring that the quality of learning programs are both nationally and globally relevant. However, in higher education institution, there are significant numbers of students that stop their studies before graduation, especially for undergraduate students. Problem related to stopping out student and late or not graduating student can be improved by applying analytics. Using analytics, administrators, instructors and student can predict what will happen in future. Administrator and instructors can decide suitable intervention programs for at-risk students and before students decide to leave their study. Many different machine learning techniques have been implemented for predictive modelling in the past including decision tree, k-nearest neighbour, random forest, neural network, support vector machine, naïve Bayesian and a few others.
    [Show full text]
  • From Exploratory Data Analysis to Predictive Modeling and Machine Learning Gilbert Saporta
    50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning Gilbert Saporta To cite this version: Gilbert Saporta. 50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning. Christos Skiadas; James Bozeman. Data Analysis and Applications 1. Clus- tering and Regression, Modeling-estimating, Forecasting and Data Mining, ISTE-Wiley, 2019, Data Analysis and Applications, 978-1-78630-382-0. 10.1002/9781119597568.fmatter. hal-02470740 HAL Id: hal-02470740 https://hal-cnam.archives-ouvertes.fr/hal-02470740 Submitted on 11 Feb 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 50 years of data analysis: from exploratory data analysis to predictive modelling and machine learning Gilbert Saporta CEDRIC-CNAM, France In 1962, J.W.Tukey wrote his famous paper “The future of data analysis” and promoted Exploratory Data Analysis (EDA), a set of simple techniques conceived to let the data speak, without prespecified generative models. In the same spirit J.P.Benzécri and many others developed multivariate descriptive analysis tools. Since that time, many generalizations occurred, but the basic methods (SVD, k-means, …) are still incredibly efficient in the Big Data era.
    [Show full text]
  • Data Mining and Exploration Michael Gutmann the University of Edinburgh
    Lecture Notes Data Mining and Exploration Michael Gutmann The University of Edinburgh Spring Semester 2017 February 27, 2017 Contents 1 First steps in exploratory data analysis1 1.1 Distribution of single variables....................1 1.1.1 Numerical summaries.....................1 1.1.2 Graphs.............................6 1.2 Joint distribution of two variables................... 10 1.2.1 Numerical summaries..................... 10 1.2.2 Graphs............................. 14 1.3 Simple preprocessing.......................... 15 1.3.1 Simple outlier detection.................... 15 1.3.2 Data standardisation...................... 15 2 Principal component analysis 19 2.1 PCA by sequential variance maximisation.............. 19 2.1.1 First principal component direction............. 19 2.1.2 Subsequent principal component directions......... 21 2.2 PCA by simultaneous variance maximisation............ 23 2.2.1 Principle............................ 23 2.2.2 Sequential maximisation yields simultaneous maximisation∗ 23 2.3 PCA by minimisation of approximation error............ 25 2.3.1 Principle............................ 26 2.3.2 Equivalence to PCA by variance maximisation....... 27 2.4 PCA by low rank matrix approximation............... 28 2.4.1 Approximating the data matrix................ 28 2.4.2 Approximating the sample covariance matrix........ 30 2.4.3 Approximating the Gram matrix............... 30 3 Dimensionality reduction 33 3.1 Dimensionality reduction by PCA.................. 33 3.1.1 From data points........................ 33 3.1.2 From inner products...................... 34 3.1.3 From distances......................... 35 3.1.4 Example............................. 36 3.2 Dimensionality reduction by kernel PCA............... 37 3.2.1 Idea............................... 37 3.2.2 Kernel trick........................... 38 3.2.3 Example............................. 39 3.3 Multidimensional scaling........................ 40 iv CONTENTS 3.3.1 Metric MDS..........................
    [Show full text]
  • Comparing Out-Of-Sample Predictive Ability of PLS, Covariance, and Regression Models Completed Research Paper
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by AIS Electronic Library (AISeL) Comparing Out-of-Sample Predictive Ability of PLS, Covariance, and Regression Models Completed Research Paper Joerg Evermann Mary Tate Memorial University of Newfoundland Victoria University of Wellington St. John’s, Canada Wellington, New Zealand [email protected] [email protected] Abstract Partial Least Squares Path Modelling (PLSPM) is a popular technique for estimating structural equation models in the social sciences, and is frequently presented as an alternative to covariance-based analysis as being especially suited for predictive modeling. While existing research on PLSPM has focused on its use in causal- explanatory modeling, this paper follows two recent papers at ICIS 2012 and 2013 in examining how PLSPM performs when used for predictive purposes. Additionally, as a predictive technique, we compare PLSPM to traditional regression methods that are widely used for predictive modelling in other disciplines. Specifically, we employ out-of- sample k-fold cross-validation to compare PLSPM to covariance-SEM and a range of a- theoretical regression techniques in a simulation study. Our results show that PLSPM offers advantages over covariance-SEM and other prediction methods. Keywords: Research Methods, Statistical methods, Structural equation modeling (SEM), Predictive modeling, Partial Least Squares, Covariance Analysis, Regression Introduction Explanation and prediction are two main purposes of theories and statistical methods (Gregor, 2006). Explanation is concerned with the identification of causal mechanisms underlying a phenomenon. On the statistical level, explanation is primarily concerned with testing the faithful representation of causal mechanisms by the statistical model and the estimation of true population parameter values from samples.
    [Show full text]
  • Deciding the Best Machine Learning Algorithm for Customer Attrition Prediction for a Telecommunication Company
    American Journal of Engineering Research (AJER) 2020 American Journal of Engineering Research (AJER) e-ISSN: 2320-0847 p-ISSN : 2320-0936 Volume-9, Issue-2, pp-1-7 www.ajer.org Research Paper Open Access Deciding the Best Machine Learning Algorithm for Customer Attrition Prediction for a Telecommunication Company Omijeh Bourdillon O. 1, Abarugo Goodness Chidinma2 1(Electronic and Computer Engineering,University of Port-Harcourt, Rivers State, Nigeria. 2Centre for Information and Telecommunications Engineering (CITE),University of Port-Harcourt, Rivers State, Nigeria.) Corresponding Author: Abarugo Goodness Chidinma. ABSTRACT : Customers are so important in business that every firm should put great effort into retaining them. To achieve that with some measure of success, the firm needs to be able to predict the behaviour of their customers with respect to churn or attrition. There are many machine learning algorithms that may be used to predict attrition, but this paper considers only four of them. Logistic regression, k-nearest neighbour, random forest and XGBoost machine learning algorithms were applied in different ways to the dataset gotten from Kaggle in order to decide the best algorithm to suggest to the company for customer attrition prediction. Results showed that the logistic regression or random forest algorithm may be adopted by the telecom company to predict which of their customers may leave in the future based on their recall and precision scores as well as the AUC values of approximately 75%. The logistic regression algorithm gave metrics of 73% accuracy and 59% f1-score while the random forest algorithm yielded 70% accuracy and a 58% f1-score.
    [Show full text]
  • Machine Learning and Deep Learning Methods for Predictive Modelling from Raman Spectra in Bioprocessing
    Semion Rozov Machine Learning and Deep Learning methods for predictive modelling from Raman spectra in bioprocessing Master Thesis Institute for Chemical and Bioengineering Swiss Federal Institute of Technology (ETH) Zurich arXiv:2005.02935v1 [q-bio.QM] 6 May 2020 Supervision Dr. Michael Sokolov Dr. Fabian Feidl Prof. Dr. Gonzalo Guillen Gosalbez February 2020 Abstract In chemical processing and bioprocessing, conventional online sensors are limited to measure only basic process variables like pressure and temperature, pH, dissolved O and CO2 and viable cell density (VCD). The concentration of other chemical species is more difficult to measure, as it usu- ally requires an at-line or off-line approach. Such approaches are invasive and slow compared to on-line sensing. It is known that different molecules can be distinguished by their interaction with monochromatic light, producing different profiles for the resulting Raman spectrum, depending on the concentration. Given the availability of reference measurements for the target variable, regres- sion methods can be used to model the relationship between the profile of the Raman spectra and the concentration of the analyte. This work focused on pretreatment methods of Raman spectra for the facilitation of the regres- sion task using Machine Learning and Deep Learning methods, as well as the development of new regression models based on these methods. In the majority of cases, this allowed to outperform conventional Raman models in terms of pre- diction error and prediction robustness. 1 2 Contents Abstract 1 1 Introduction 11 1.1 Goals of the project . 12 2 Experimental Section and Setup 13 2.1 Evaluation criteria and model selection .
    [Show full text]
  • Predictive Modelling Applied to Propensity to Buy Personal Accidents Insurance Products
    Predictive Modelling Applied to Propensity to Buy Personal Accidents Insurance Products Esdras Christo Moura dos Santos Internship report presented as partial requirement for obtaining the Master’s degree in Advanced Analytics i 2017 Title: Predictive Models Applied to Propensity to Buy Personal Accidents Student: Esdras Christo Moura dos Santos Insurance Products MAA i i NOVA Information Management School Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa PREDICTIVE MODELLING APPLIED TO PROPENSITY TO BUY PERSONAL ACCIDENTS INSURANCE PRODUCTS by Esdras Christo Moura dos Santos Internship report presented as partial requirement for obtaining the Master’s degree in Advanced Analytics Advisor: Mauro Castelli ii February 2018 DEDICATION Dedicated to my beloved family. iii ACKNOWLEDGEMENTS I would like to express my gratitude to my supervisor, Professor Mauro Castelli of Information Management School of Universidade Nova de Lisboa for all the mentoring and assistance. I also want to show my gratitude for the data mining team at Ocidental, Magdalena Neate and Franklin Minang. I deeply appreciate all the guidance, patience and support during this project. iv ABSTRACT Predictive models have been largely used in organizational scenarios with the increasing popularity of machine learning. They play a fundamental role in the support of customer acquisition in marketing campaigns. This report describes the development of a propensity to buy model for personal accident insurance products. The entire process from business understanding to the deployment of the final model is analyzed with the objective of linking the theory to practice. KEYWORDS Predictive models; data mining; supervised learning; propensity to buy; logistic regression; decision trees; artificial neural networks; ensemble models.
    [Show full text]
  • 50 Years of Data Analysis from EDA to Predictive Modelling and Machine Learning
    50 Years of Data Analysis from EDA to predictive modelling and machine learning Gilbert Saporta CEDRIC- CNAM, 292 rue Saint Martin, F-75003 Paris [email protected] http://cedric.cnam.fr/~saporta Data analysis vs mathematical statistics • An international movement in reaction to the abuses of formalization • Let the data speak • Computerized statistics ASMDA 2017 2 • « He (Tukey) seems to identify statistics with the grotesque phenomenon generally known as mathematical statistics and find it necessary to replace statistics by data analysis » (Anscombe, 1967). John Wilder Tukey (1915-2000) ASMDA 2017 3 • « Statistics is not probability, under the name of mathematical statistics was built a pompous discipline based on theoretical assumptions that are rarely met in practice » (Benzécri, 1972) • « Data analysis is a tool to release from the gangue of the data the pure diamond of the true nature » Jean-Paul Benzécri (1932- ) ASMDA 2017 4 Japan, Netherlands, Canada, Italy… Chikio Hayashi (1918-2002) Jan de Leeuw (1945-) Shizuiko Nishisato (1935-) Carlo Lauro (1943-) ASMDA 2017 5 Meetings • Data Analysis and Informatics • ASMDA from 1981 from 1977 Edwin Diday Ludovic Lebart Jacques Janssen ASMDA 2017 6 Part 1 Exploratory data analysis « Data Analysis »: a collection of unsupervised methods for dimension reduction • « Factor » analysis • PCA, • CA, MCA • Cluster analysis • k-means partitioning • Hierarchical clustering ASMDA 2017 8 and came the time of syntheses: • All (factorial) methods are particular cases of: PCA canonical correlation analysis 1976 J.Douglas Carroll (1939 -2011) ASMDA 2017 9 Even more methods are particular cases of the maximum association principle: p MaxYj(,) Y X j1 M.Tenenhaus (1977), J.F.
    [Show full text]
  • The Development of a Predictive Model of On-Time High School Graduation in British Columbia
    The Development of a Predictive Model of On-Time High School Graduation in British Columbia May 1, 2019 Ross Finnie Eda Suleymanoglu Ashley Pullman Michael Dubois Table of Contents Executive Summary ........................................................................................................................ 1 1. Introduction .............................................................................................................................. 4 2. Literature Review .................................................................................................................... 5 3. Data .......................................................................................................................................... 7 3.1 Outcome Variables of Interest ............................................................................................ 8 3.2 Predictor Variables .............................................................................................................. 8 3.1 Sample Selection ................................................................................................................. 9 4. Methodology .......................................................................................................................... 10 4.1 Predictive Model ............................................................................................................... 10 4.2 Predictive Accuracy .........................................................................................................
    [Show full text]
  • Quantification of Predictive Uncertainty in Hydrological Modelling by Harnessing the Wisdom of the Crowd: Methodology Development and Investigation Using Toy Models
    Quantification of predictive uncertainty in hydrological modelling by harnessing the wisdom of the crowd: Methodology development and investigation using toy models Georgia Papacharalampous1,*, Demetris Koutsoyiannis2, and Alberto Montanari3 1 Department of Water Resources and Environmental Engineering, School of Civil Engineering, National Technical University of Athens, Heroon Polytechneiou 5, 157 80 Zographou, Greece; [email protected]; https://orcid.org/0000-0001-5446-954X 2 Department of Water Resources and Environmental Engineering, School of Civil Engineering, National Technical University of Athens, Heroon Polytechneiou 5, 157 80 Zographou, Greece; [email protected]; https://orcid.org/0000-0002-6226-0241 3 Department of Civil, Chemical, Environmental and Materials Engineering (DICAM), University of Bologna, via del Risorgimento 2, 40136 Bologna, Italy; [email protected]; https://orcid.org/0000-0001-7428-0410 * Correspondence: [email protected], tel: +30 69474 98589 This is the accepted manuscript of an article published in Advances in Water Resources. Please cite the article as: Papacharalampous GA, Koutsoyiannis D, Montanari A (2020) Quantification of predictive uncertainty in hydrological modelling by harnessing the wisdom of the crowd: Methodology development and investigation using toy models. Advances in Water Resources 136:103471. https://doi.org/10.1016/j.advwatres.2019.103471 Abstract: We introduce an ensemble learning post-processing methodology for probabilistic hydrological modelling. This methodology generates numerous point predictions by applying a single hydrological model, yet with different parameter values drawn from the respective simulated posterior distribution. We call these predictions “sister predictions”. Each sister prediction extending in the period of interest is converted into a probabilistic prediction using information about the hydrological model’s errors.
    [Show full text]
  • Exploration of Customer Churn Routes Using Machine Learning Probabilistic Models
    UNIVERSITAT POLITECNICA` DE CATALUNYA Exploration of Customer Churn Routes Using Machine Learning Probabilistic Models DOCTORAL THESIS Author: David L. Garc´ıa Advisors: Dra. Angela` Nebot and Dr. Alfredo Vellido Soft Computing Research Group Departament de Llenguatges i Sistemes Informatics` Universitat Politecnica` de Catalunya. BarcelonaTech Barcelona, 10th April 2014 A bird doesn’t sing because it has an answer, it sings because it has a song. Maya Angelou AMonica,´ Olivia, Pol i Jana 1 Acknowledgments Siempre he pensado que las cosas, en la vida, no ocurren por casualidad; son el conjunto de pequenas˜ (y grandes) aportaciones, casualidades, ideas, retos, azar, amor, exigencia y apoyo, animo´ y desanimo,´ es- fuerzo, inspiracion,´ reflexiones y cr´ıticas y, en definitiva, de todo aquello que nos configura como personas. Y la presente Tesis Doctoral no es una excepcion.´ Sin la aportacion´ en pequenas˜ dosis de estos ingredientes por parte de muchas personas esta Tesis no hubiera visto nunca la luz. A todos ellos mi publica´ gratitud. A los Dres. Angela` Nebot y Alfredo Vellido, en primer lugar por reconocer algun´ tipo de ’valor’ difer- encial en un perfil como el m´ıo y aceptarme en el Soft Computing Group como estudiante de doctorado. Y en segundo lugar, y mas´ importante, por ser el alma de este trabajo. Si os fijais´ observareis´ su presencia e ideas a lo largo de la Tesis Doctoral. Para ellos mi gratitud y admiracion.´ A mis clientes y colegas en el mundo empresarial, con los que he compartido infinitas horas y problemas y cuya exigencia ha puesto a prueba, cada d´ıa, el sentido y utilidad practica´ de las ideas y metodolog´ıas aqu´ı presentadas.
    [Show full text]