Exploration of Customer Churn Routes Using Machine Learning Probabilistic Models

Total Page:16

File Type:pdf, Size:1020Kb

Exploration of Customer Churn Routes Using Machine Learning Probabilistic Models UNIVERSITAT POLITECNICA` DE CATALUNYA Exploration of Customer Churn Routes Using Machine Learning Probabilistic Models DOCTORAL THESIS Author: David L. Garc´ıa Advisors: Dra. Angela` Nebot and Dr. Alfredo Vellido Soft Computing Research Group Departament de Llenguatges i Sistemes Informatics` Universitat Politecnica` de Catalunya. BarcelonaTech Barcelona, 10th April 2014 A bird doesn’t sing because it has an answer, it sings because it has a song. Maya Angelou AMonica,´ Olivia, Pol i Jana 1 Acknowledgments Siempre he pensado que las cosas, en la vida, no ocurren por casualidad; son el conjunto de pequenas˜ (y grandes) aportaciones, casualidades, ideas, retos, azar, amor, exigencia y apoyo, animo´ y desanimo,´ es- fuerzo, inspiracion,´ reflexiones y cr´ıticas y, en definitiva, de todo aquello que nos configura como personas. Y la presente Tesis Doctoral no es una excepcion.´ Sin la aportacion´ en pequenas˜ dosis de estos ingredientes por parte de muchas personas esta Tesis no hubiera visto nunca la luz. A todos ellos mi publica´ gratitud. A los Dres. Angela` Nebot y Alfredo Vellido, en primer lugar por reconocer algun´ tipo de ’valor’ difer- encial en un perfil como el m´ıo y aceptarme en el Soft Computing Group como estudiante de doctorado. Y en segundo lugar, y mas´ importante, por ser el alma de este trabajo. Si os fijais´ observareis´ su presencia e ideas a lo largo de la Tesis Doctoral. Para ellos mi gratitud y admiracion.´ A mis clientes y colegas en el mundo empresarial, con los que he compartido infinitas horas y problemas y cuya exigencia ha puesto a prueba, cada d´ıa, el sentido y utilidad practica´ de las ideas y metodolog´ıas aqu´ı presentadas. Uno de estos ultimos,´ Francesc Massip ha resultado de inestimable ayuda en operativizar estas ideas en algunas de las grandes compan˜´ıas en las que hemos trabajado. Desde aqu´ı mi reconocimiento a su trabajo infatigable, que hago extensivo al interminable grupo de colaboradores al lado de los cuales, desde hace mas´ de 20 anos,˜ llevo haciendo consulta. Gracias por dejarme compartir vuestra exigencia y excelencia en el trabajo. Y, por ultimo,´ a mi familia (empezando por mi esposa Monica´ y su infinita paciencia, mis hijos Olivia, Pol y Jana. Y tambien´ a mi madre Pepita, mi hermano Jordi y mi cunada˜ Montse) lo mas´ importante en mi vida, que siempre me han animado a continuar y que por fin vuelven a respirar al verme liberado de la esclavitud del autor. Sin su amor, ejemplo y apoyo, nunca habr´ıa sido capaz de completar este estudio de doctorado. 2 Abstract The ongoing processes of globalization and deregulation are changing the competitive frame- work in the majority of economic sectors. The appearance of new competitors and technolo- gies entails a sharp increase in competition and a growing preoccupation among service pro- viding companies with creating stronger bonds with customers. Many of these companies are shifting resources away from the goal of capturing new customers and are instead focusing on retaining existing ones. In this context, anticipating the customer’s intention to abandon, a phenomenon also known as churn, and facilitating the launch of retention-focused actions represent clear elements of competitive advantage. Data mining, as applied to market surveyed information, can provide assistance to churn management processes. In this thesis, we mine real market data for churn analysis, placing a strong emphasis on the applicability and interpretability of the results. Statistical Machine Learning models for simultaneous data clustering and visualization lay the foundations for the analyses, which yield an interpretable segmentation of the surveyed markets. To achieve interpretability, much attention is paid to the intuitive visualization of the experimental results. Given that the modelling techniques under consideration are nonlinear in nature, this represents a non-trivial challenge. Newly developed techniques for data visualization in nonlinear latent models are presented. They are inspired in geographical representation methods and suited to both static and dynamic data representation. 3 Contents 1 Introduction 12 1.1 Main goals of the thesis . 13 1.1.1 Market analysis goals of the thesis . 13 1.1.2 Data modelling goals of the thesis . 14 1.2 Structure of the Thesis . 14 2 Customer Continuity Management as a foundation for churn Data Mining 16 2.1 Customer churn: the business case . 16 2.1.1 Customer continuity management . 18 2.2 Customer churn prevention: loyalty construction . 21 2.2.1 Basic concepts . 21 2.2.2 Models of loyalty . 41 3 Predictive models in churn management 53 3.1 Building predictive models of abandonment . 54 3.1.1 Stage 1: Identifying and obtaining the best data . 54 3.1.2 Stage 2: Selection of attributes . 56 3.1.3 Stage 3: Development of a predictive model . 56 3.1.4 Stage 4: Validation of results . 59 3.2 A summarized review of the literature . 60 4 Manifold learning: visualizing and clustering data 75 4.1 Latent variable models . 75 4.2 Self Organizing Maps . 76 4.2.1 Basic SOM . 76 4.2.2 The batch-SOM algorithm . 77 4.3 Generative Topographic Mapping . 77 4.3.1 The GTM Standard Model . 77 4.3.2 The Expectation-Maximization (EM) algorithm . 79 4.3.3 Data visualization and clustering through GTM . 79 4.3.4 Magnification Factors for the GTM . 80 5 Supervised customer loyalty analysis 82 5.1 Petrol station customer satisfaction, loyalty and switching barriers . 82 5.2 Methods . 83 5.2.1 Bayesian ANN with ARD . 83 5.2.2 Orthogonal Search-based Rule Extraction . 84 5.3 Experiments . 85 5.3.1 Results and discussion . 85 5.4 Conclusions . 88 6 Unsupervised churn analysis in a telecommunications company 90 4 6.1 Problem Description . 91 6.2 Telecommunications customer data . 92 6.3 Methods . 93 6.3.1 The FRD-GTM . 93 6.3.2 Two-Tier market segmentation . 93 6.4 Experiments . 95 6.4.1 Visualization and clustering using GTM . 96 6.4.2 Visualization and clustering using FRD-GTM . 97 6.5 Conclusions . 100 7 Distortion visualization in GTM 102 7.1 Visualization of multivariate data using cartograms . 102 7.1.1 Methods . 104 7.1.1.1 Density-equalizing cartograms . 104 7.1.2 Cartogram visualization of the GTM magnification factors . 105 7.2 Experiments . 106 7.2.1 Experiments with artificial data . 106 7.2.1.1 Preliminary experiment with 3-D artificial data . 106 7.2.1.2 Further experiments with artificial data . 110 7.2.2 Experiments with real data . 116 7.3 Conclusions . 123 8 Distortion and Flow Maps visualization for churn analysis 124 8.1 Methods and Materials . 125 8.1.1 Flow Maps for the visualization of customer migrations in GTM . 125 8.1.2 Brazilian telecommunication company . 125 8.1.3 Spanish pay-per-view television company . 125 8.2 Experiments . 126 8.2.1 Brazilian telecommunication company . 127 8.2.1.1 Experimental Setup . 127 8.2.1.2 Results . 127 8.2.2 Discussion . 131 8.2.3 Spanish pay-per-view company . 133 8.2.3.1 Results and discussion . 133 8.3 Conclusions . 136 9 Conclusions and future research 137 9.1 Conclusions . 137 9.2 Suggestions for future research . 139 Publications 141 References 142 5 List of Figures 2 Customer Continuity Management as a foundation for churn Data Mining 2.1 Commercial development alternatives in mature markets. 17 2.2 Stages in a client’s lifecycle. 17 2.3 Customer lifecycle management model: Customer Continuity Management. 19 2.4 Illustration of the effect of developing satisfaction excellence policies and procedures and positive switching barriers. 20 2.5 Models analyzed by Cronin . 42 2.6 Model proposed by Jones . 43 2.7 Model proposed by Chen . 43 2.8 Model proposed by Ranaweera . 44 2.9 Model proposed by Patterson . 45 2.10 Model proposed by Gounaris . 45 2.11 Model proposed by Kim . 46 2.12 Model proposed by Lam . 46 2.13 Integrated model of retail service relationships proposed by Fullerton . 47 2.14 Second variant of integrated model proposed by Fullerton . 47 2.15 Model proposed by Vazquez´ . 48 2.16.
Recommended publications
  • Evolutionary Data Mining Approach to Creating Digital Logic
    EVOLUTIONARY DATA MINING APPROACH TO CREATING DIGITAL LOGIC James F. Smith III, ThanhVu H. Nguyen Code 5741, Naval Research Laboratory, Washington, DC, 20375-5320, USA Keywords: Optimization problems in signal processing, signal reconstruction, system identification, time series and system modeling. Abstract: A data mining based procedure for automated reverse engineering has been developed. The data mining algorithm for reverse engineering uses a genetic program (GP) as a data mining function. A genetic program is an algorithm based on the theory of evolution that automatically evolves populations of computer programs or mathematical expressions, eventually selecting one that is optimal in the sense it maximizes a measure of effectiveness, referred to as a fitness function. The system to be reverse engineered is typically a sensor. Design documents for the sensor are not available and conditions prevent the sensor from being taken apart. The sensor is used to create a database of input signals and output measurements. Rules about the likely design properties of the sensor are collected from experts. The rules are used to create a fitness function for the genetic program. Genetic program based data mining is then conducted. This procedure incorporates not only the experts’ rules into the fitness function, but also the information in the database. The information extracted through this process is the internal design specifications of the sensor. Significant mathematical formalism and experimental results related to GP based data mining for reverse engineering will be provided. 1 INTRODUCTION The rules are used to create a fitness function for the genetic program. Genetic program based data An engineer must design a signal that will yield a mining is then conducted (Bigus 1996, Smith 2003a, particular type of output from a sensor device (SD).
    [Show full text]
  • Health Outcomes from Home Hospitalization: Multisource Predictive Modeling
    JOURNAL OF MEDICAL INTERNET RESEARCH Calvo et al Original Paper Health Outcomes from Home Hospitalization: Multisource Predictive Modeling Mireia Calvo1, PhD; Rubèn González2, MSc; Núria Seijas2, MSc, RN; Emili Vela3, MSc; Carme Hernández2, PhD; Guillem Batiste2, MD; Felip Miralles4, PhD; Josep Roca2, MD, PhD; Isaac Cano2*, PhD; Raimon Jané1*, PhD 1Institute for Bioengineering of Catalonia (IBEC), Barcelona Institute of Science and Technology (BIST), Universitat Politècnica de Catalunya (UPC), CIBER-BBN, Barcelona, Spain 2Hospital Clínic de Barcelona, Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Universitat de Barcelona (UB), Barcelona, Spain 3Àrea de sistemes d'informació, Servei Català de la Salut, Barcelona, Spain 4Eurecat, Technology Center of Catalonia, Barcelona, Spain *these authors contributed equally Corresponding Author: Isaac Cano, PhD Hospital Clínic de Barcelona Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS) Universitat de Barcelona (UB) Villarroel 170 Barcelona, 08036 Spain Phone: 34 932275540 Email: [email protected] Abstract Background: Home hospitalization is widely accepted as a cost-effective alternative to conventional hospitalization for selected patients. A recent analysis of the home hospitalization and early discharge (HH/ED) program at Hospital Clínic de Barcelona over a 10-year period demonstrated high levels of acceptance by patients and professionals, as well as health value-based generation at the provider and health-system levels. However, health risk assessment was identified as an unmet need with the potential to enhance clinical decision making. Objective: The objective of this study is to generate and assess predictive models of mortality and in-hospital admission at entry and at HH/ED discharge. Methods: Predictive modeling of mortality and in-hospital admission was done in 2 different scenarios: at entry into the HH/ED program and at discharge, from January 2009 to December 2015.
    [Show full text]
  • A Genetic Algorithm Solution Approach Introd
    A new optimization model for market basket analysis with allocation considerations: A genetic algorithm solution approach Majeed HEYDARI University of Zanjan, Zanjan, Iran [email protected] Amir YOUSEFLI Imam Khomeini International University, Qazvin, Iran [email protected] Abstract. Nowadays market basket analysis is one of the interested research areas of the data mining that has received more attention by researchers. But, most of the related research focused on the traditional and heuristic algorithms with limited factors that are not the only influential factors of the basket market analysis. In this paper to efficient modeling and analysis of the market basket data, the optimization model is proposed with considering allocation parameter as one of the important and effectual factors of the selling rate. The genetic algorithm approach is applied to solve the formulated non-linear binary programming problem and a numerical example is used to illustrate the presented model. The provided results reveal that the obtained solutions seem to be more realistic and applicable. Keywords: market basket analysis, association rule, non-linear programming, genetic algorithm, optimization. Please cite the article as follows: Heydary, M. and Yousefli, A. (2017), “A new optimization model for market basket analysis with allocation considerations: A genetic algorithm solution approach”, Management & Marketing. Challenges for the Knowledge Society, Vol. 12, No. 1, pp. 1- 11. DOI: 10.1515/mmcks-2017-0001 Introduction Nowadays, data mining is widely used in several aspects of science such as manufacturing, marketing, CRM, retail trade etc. Data mining or knowledge discovery is a process for data analyzing to extract information from large databases.
    [Show full text]
  • Optimization-Based Evolutionary Data Mining Techniques for Structural Health Monitoring
    Journal of Civil Engineering and Construction 2020;9(1):14-23 https://doi.org/10.32732/jcec.2020.9.1.14 Optimization-Based Evolutionary Data Mining Techniques for Structural Health Monitoring Meisam Gordan, Zubaidah Binti Ismail, Hashim Abdul Razak, Khaled Ghaedi, Haider Hamad Ghayeb Department of Civil Engineering, University of Malaya, 50603, Kuala Lumpur, Malaysia E-mail: [email protected] Received: 27 July 2019; Accepted: 24 August 2019; Available online: 20 November 2019 Abstract: In recent years, data mining technology has been employed to solve various Structural Health Monitoring (SHM) problems as a comprehensive strategy because of its computational capability. Optimization is one the most important functions in Data mining. In an engineering optimization problem, it is not easy to find an exact solution. In this regard, evolutionary techniques have been applied as a part of procedure of achieving the exact solution. Therefore, various metaheuristic algorithms have been developed to solve a variety of engineering optimization problems in SHM. This study presents the most applicable as well as effective evolutionary techniques used in structural damage identification. To this end, a brief overview of metaheuristic techniques is discussed in this paper. Then the most applicable optimization-based algorithms in structural damage identification are presented, i.e. Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Imperialist Competitive Algorithm (ICA) and Ant Colony Optimization (ACO). Some related examples are also detailed in order to indicate the efficiency of these algorithms. Keywords: Structural damage detection; Data mining; Metaheuristic algorithms; Optimization. 1. Introduction In-service aerospace, mechanical, and civil structures are damage-prone during their service life.
    [Show full text]
  • Review on Predictive Modelling Techniques for Identifying Students at Risk in University Environment
    MATEC Web of Conferences 255, 03002 (2019) https://doi.org/10.1051/matecconf/20192 5503002 EAAI Conference 2018 Review on Predictive Modelling Techniques for Identifying Students at Risk in University Environment Nik Nurul Hafzan Mat Yaacob1,*, Safaai Deris1,3, Asiah Mat2, Mohd Saberi Mohamad1,3, Siti Syuhaida Safaai1 1Faculty of Bioengineering and Technology, Universiti Malaysia Kelantan, Jeli Campus, 17600 Jeli, Kelantan, Malaysia 2Faculty of Creative and Heritage Technology, Universiti Malaysia Kelantan,16300 Bachok, Kelantan, Malaysia 3Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan, City Campus, Pengkalan Chepa, 16100 Kota Bharu, Kelantan, Malaysia Abstract. Predictive analytics including statistical techniques, predictive modelling, machine learning, and data mining that analyse current and historical facts to make predictions about future or otherwise unknown events. Higher education institutions nowadays are under increasing pressure to respond to national and global economic, political and social changes such as the growing need to increase the proportion of students in certain disciplines, embedding workplace graduate attributes and ensuring that the quality of learning programs are both nationally and globally relevant. However, in higher education institution, there are significant numbers of students that stop their studies before graduation, especially for undergraduate students. Problem related to stopping out student and late or not graduating student can be improved by applying analytics. Using analytics, administrators, instructors and student can predict what will happen in future. Administrator and instructors can decide suitable intervention programs for at-risk students and before students decide to leave their study. Many different machine learning techniques have been implemented for predictive modelling in the past including decision tree, k-nearest neighbour, random forest, neural network, support vector machine, naïve Bayesian and a few others.
    [Show full text]
  • Multi Objective Optimization of Classification Rules Using Cultural Algorithms
    Available online at www.sciencedirect.com Procedia Engineering Procedia Engineering 00 (2011) 000–000 Procedia Engineering 30 (2012) 457 – 465 www.elsevier.com/locate/procedia International Conference on Communication Technology and System Design 2011 Multi Objective Optimization of classification rules using Cultural Algorithms Sujatha Srinivasana, Sivakumar Ramakrishnana, a* aDept. of Computer Science, AVVM Sri Pushpam College, Poondi, Tanjore-613503, Tamil Nadu, India Abstract Classification rule mining is the most sought out by users since they represent highly comprehensible form of knowledge. The rules are evaluated based on objective and subjective metrics. The user must be able to specify the properties of the rules. The rules discovered must have some of these properties to render them useful. These properties may be conflicting. Hence discovery of rules with specific properties is a multi objective optimization problem. Cultural Algorithm (CA) which derives from social structures, and which incorporates evolutionary systems and agents, and uses five knowledge sources (KS’s) for the evolution process better suits the need for solving multi objective optimization problem. In the current study a cultural algorithm for classification rule mining is proposed for multi objective optimization of rules. © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of ICCTSD 2011 Keywords: Multi-objective-optimization; Classification; Cultural Algorithm; Data mining. 1. Introduction Classification rule mining is a class of problems which is the most sought out by decision makers since they produce comprehensible form of knowledge. The rules produced by the rule mining approach are evaluated using various metrics which are called the properties of the rule.
    [Show full text]
  • From Exploratory Data Analysis to Predictive Modeling and Machine Learning Gilbert Saporta
    50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning Gilbert Saporta To cite this version: Gilbert Saporta. 50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning. Christos Skiadas; James Bozeman. Data Analysis and Applications 1. Clus- tering and Regression, Modeling-estimating, Forecasting and Data Mining, ISTE-Wiley, 2019, Data Analysis and Applications, 978-1-78630-382-0. 10.1002/9781119597568.fmatter. hal-02470740 HAL Id: hal-02470740 https://hal-cnam.archives-ouvertes.fr/hal-02470740 Submitted on 11 Feb 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 50 years of data analysis: from exploratory data analysis to predictive modelling and machine learning Gilbert Saporta CEDRIC-CNAM, France In 1962, J.W.Tukey wrote his famous paper “The future of data analysis” and promoted Exploratory Data Analysis (EDA), a set of simple techniques conceived to let the data speak, without prespecified generative models. In the same spirit J.P.Benzécri and many others developed multivariate descriptive analysis tools. Since that time, many generalizations occurred, but the basic methods (SVD, k-means, …) are still incredibly efficient in the Big Data era.
    [Show full text]
  • Data Mining and Exploration Michael Gutmann the University of Edinburgh
    Lecture Notes Data Mining and Exploration Michael Gutmann The University of Edinburgh Spring Semester 2017 February 27, 2017 Contents 1 First steps in exploratory data analysis1 1.1 Distribution of single variables....................1 1.1.1 Numerical summaries.....................1 1.1.2 Graphs.............................6 1.2 Joint distribution of two variables................... 10 1.2.1 Numerical summaries..................... 10 1.2.2 Graphs............................. 14 1.3 Simple preprocessing.......................... 15 1.3.1 Simple outlier detection.................... 15 1.3.2 Data standardisation...................... 15 2 Principal component analysis 19 2.1 PCA by sequential variance maximisation.............. 19 2.1.1 First principal component direction............. 19 2.1.2 Subsequent principal component directions......... 21 2.2 PCA by simultaneous variance maximisation............ 23 2.2.1 Principle............................ 23 2.2.2 Sequential maximisation yields simultaneous maximisation∗ 23 2.3 PCA by minimisation of approximation error............ 25 2.3.1 Principle............................ 26 2.3.2 Equivalence to PCA by variance maximisation....... 27 2.4 PCA by low rank matrix approximation............... 28 2.4.1 Approximating the data matrix................ 28 2.4.2 Approximating the sample covariance matrix........ 30 2.4.3 Approximating the Gram matrix............... 30 3 Dimensionality reduction 33 3.1 Dimensionality reduction by PCA.................. 33 3.1.1 From data points........................ 33 3.1.2 From inner products...................... 34 3.1.3 From distances......................... 35 3.1.4 Example............................. 36 3.2 Dimensionality reduction by kernel PCA............... 37 3.2.1 Idea............................... 37 3.2.2 Kernel trick........................... 38 3.2.3 Example............................. 39 3.3 Multidimensional scaling........................ 40 iv CONTENTS 3.3.1 Metric MDS..........................
    [Show full text]
  • Efficient Evolutionary Data Mining Algorithms Applied to the Insurance Fraud Prediction
    International Journal of Machine Learning and Computing, Vol. 2, No. 3, June 2012 Efficient Evolutionary Data Mining Algorithms Applied to the Insurance Fraud Prediction Jenn-Long Liu, Chien-Liang Chen, and Hsing-Hui Yang instances of insurance cases for data mining. The 5000 Abstract—This study proposes two kinds of Evolutionary instances are divided into 4000 instances to be the training set Data Mining (EvoDM) algorithms to the insurance fraud and 1000 instances to be the test set. Furthermore, this work prediction. One is GA-Kmeans by combining K-means applies K-means, GA-Kmeans and MPSO-Kmeans algorithm with genetic algorithm (GA). The other is algorithms to evaluate the fraud or not from the training set MPSO-Kmeans by combining K-means algorithm with Momentum-type Particle Swarm Optimization (MPSO). The and also evaluate the accuracy of fraud prediction for the test dataset used in this study is composed of 6 attributes with 5000 set. instances for car insurance claim. These 5000 instances are divided into 4000 training data and 1000 test data. Two II. CRISP-DM different initial cluster centers for each attribute are set by means of (a) selecting the centers randomly from the training CRISP-DM (Cross Industry Standard Process for Data set and (b) averaging all data of training set, respectively. Mining) is a data mining process model that describes Thereafter, the proposed GA-Kmeans and MPSO-Kmeans are commonly used approaches for expert data miners use to employed to determine the optimal weights and final cluster solve problems. CRISP-DM was conceived in late 1996 by centers for attributes, and the accuracy of prediction for test set SPSS (then ISL), NCR and DaimlerChrysler (then is computed based on the optimal weights and final cluster Daimler-Benz).
    [Show full text]
  • Evolutionary Data Mining for Link Analysis: Preliminary Experiments on a Social Network Test Bed
    Evolutionary Data Mining for Link Analysis: Preliminary Experiments on a Social Network Test Bed William H. Hsu Andrew L. King Martin S. R. Paradesi Tejaswi Pydimarri Tim Weninger Department of Computing and Information Sciences, Kansas State University 234 Nichols Hall, Manhattan, KS 66506 (785) 532-6350 {bhsu | aking | pmsr | tejaswi | weninger}@ksu.edu ABSTRACT between communities and users they denote accepted members In this paper, we address the problem of graph feature extraction and moderator privileges. Specific recommendation targets for and selection for link analysis in weblogs and similar social weblogs include links (new friends and subscriberships), security networks. First, we present an approach based on collaborative levels (link strength), requests for reciprocal links (trust, recommendation using the link structure of a social network and membership and moderatorship applications), and definition of content-based recommendation using mutual declared interests. security levels (filters). There are analogous applicatons of this Next, we describe the application of this approach to a small functionality in social networks such as citation and collaboration representative subset of a large real-world social network: the networks. user/community network of the blog service LiveJournal. We We investigate the problem of link recommendation in such then discuss the ground features available in LiveJournal’s public weblog-based social networks and describe an annotated graph- user information pages and describe some graph algorithms for based representation for such networks. Our approach uses graph analysis of the social network along with a feature set for feature analysis to recommend links (u, v) given structural classifying users as friends or non-friends.
    [Show full text]
  • Comparing Out-Of-Sample Predictive Ability of PLS, Covariance, and Regression Models Completed Research Paper
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by AIS Electronic Library (AISeL) Comparing Out-of-Sample Predictive Ability of PLS, Covariance, and Regression Models Completed Research Paper Joerg Evermann Mary Tate Memorial University of Newfoundland Victoria University of Wellington St. John’s, Canada Wellington, New Zealand [email protected] [email protected] Abstract Partial Least Squares Path Modelling (PLSPM) is a popular technique for estimating structural equation models in the social sciences, and is frequently presented as an alternative to covariance-based analysis as being especially suited for predictive modeling. While existing research on PLSPM has focused on its use in causal- explanatory modeling, this paper follows two recent papers at ICIS 2012 and 2013 in examining how PLSPM performs when used for predictive purposes. Additionally, as a predictive technique, we compare PLSPM to traditional regression methods that are widely used for predictive modelling in other disciplines. Specifically, we employ out-of- sample k-fold cross-validation to compare PLSPM to covariance-SEM and a range of a- theoretical regression techniques in a simulation study. Our results show that PLSPM offers advantages over covariance-SEM and other prediction methods. Keywords: Research Methods, Statistical methods, Structural equation modeling (SEM), Predictive modeling, Partial Least Squares, Covariance Analysis, Regression Introduction Explanation and prediction are two main purposes of theories and statistical methods (Gregor, 2006). Explanation is concerned with the identification of causal mechanisms underlying a phenomenon. On the statistical level, explanation is primarily concerned with testing the faithful representation of causal mechanisms by the statistical model and the estimation of true population parameter values from samples.
    [Show full text]
  • Deciding the Best Machine Learning Algorithm for Customer Attrition Prediction for a Telecommunication Company
    American Journal of Engineering Research (AJER) 2020 American Journal of Engineering Research (AJER) e-ISSN: 2320-0847 p-ISSN : 2320-0936 Volume-9, Issue-2, pp-1-7 www.ajer.org Research Paper Open Access Deciding the Best Machine Learning Algorithm for Customer Attrition Prediction for a Telecommunication Company Omijeh Bourdillon O. 1, Abarugo Goodness Chidinma2 1(Electronic and Computer Engineering,University of Port-Harcourt, Rivers State, Nigeria. 2Centre for Information and Telecommunications Engineering (CITE),University of Port-Harcourt, Rivers State, Nigeria.) Corresponding Author: Abarugo Goodness Chidinma. ABSTRACT : Customers are so important in business that every firm should put great effort into retaining them. To achieve that with some measure of success, the firm needs to be able to predict the behaviour of their customers with respect to churn or attrition. There are many machine learning algorithms that may be used to predict attrition, but this paper considers only four of them. Logistic regression, k-nearest neighbour, random forest and XGBoost machine learning algorithms were applied in different ways to the dataset gotten from Kaggle in order to decide the best algorithm to suggest to the company for customer attrition prediction. Results showed that the logistic regression or random forest algorithm may be adopted by the telecom company to predict which of their customers may leave in the future based on their recall and precision scores as well as the AUC values of approximately 75%. The logistic regression algorithm gave metrics of 73% accuracy and 59% f1-score while the random forest algorithm yielded 70% accuracy and a 58% f1-score.
    [Show full text]