Cross Validation Method on Weighted Isotonic Regression for Nonparametric Regression Fitting

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by ScholarBank@NUS CROSS VALIDATION METHOD ON WEIGHTED ISOTONIC REGRESSION FOR NONPARAMETRIC REGRESSION FITTING ZHANG YIWEN NATIONAL UNIVERSITY OF SINGAPORE 2017 CROSS VALIDATION METHOD ON WEIGHTED ISOTONIC REGRESSION FOR NONPARAMETRIC REGRESSION FITTING ZHANG YIWEN (A0123849X) SUPERVISOR: ASSOCIATE PROFESSOR YU TAO EXAMINERS: ASSOCIATE PROFESSOR LI JIALIANG DR YAO ZHIGANG A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF STATISTICS AND APPLIED PROBABILITY NATIONAL UNIVERSITY OF SINGAPORE 2017 Dedication I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has also not been submitted for any degree in any university previously. Zhang Yiwen June 2017 1 Acknowledgement I would like to express my deepest appreciation to my supervisor Associate Professor Yu Tao. He is knowledgeable, nice and patient. When I had trouble in the project, he could always provide constructive advice and professional guidance. Without his support and guide, I could not finish the project and the thesis in the end. Besides, I feel grateful for my friends who have accompanied and helped me a lot during my study in NUS. I want to thank Yue Mu for her time and patience, Yaxian for her recommendation, Wen Jun for his sharing notes, my house mates Xueqi and Li Yun for the support and care, Chin Hau for his advice, Huang Yi, Zhang Liang, Gaoshuang, Jiejie and others for their encouragement and help. Moreover, I want to express my greatest gratitude to my parents for their long lasting support and understanding. At last, I appreciate the examiners, Associate Professor Li Jialiang, and As- sistant Professor Yao Zhigang. Their professional and detailed amendment comments make me learn more and improve further. 2 Contents 1 Introduction 9 1.1 Motivation . 9 1.2 Organization . 12 2 Literature Review 13 2.1 Local Polynomial Regression . 13 2.2 Classical and Generalized Continuous Isotonic Regression . 14 2.3 Pool Adjacent Violators Algorithm (PAVA) . 16 2.4 Bandwidth Selection for Nonparametric Regression . 17 3 Methodology 21 3.1 Local Polynomial Method Estimation . 21 3.2 Continuous Isotonic Method Estimation . 22 3.3 Discussion on the Weight Function . 23 3.4 Computing Isotonic Regression Estimation . 24 3.5 Cross-Validation Bandwidth Selection . 25 4 Simulation Study 28 4.1 Fitting the Model . 28 4.2 Numerical Performance Comparisons . 33 4.2.1 L1 Distance Comparison . 33 4.2.2 L1 Percentage Improvement . 38 3 4.2.3 Confidence Interval . 39 4.3 Numerical performance for other simulations . 41 4.3.1 Simulated Data from Uniform Distribution . 41 4.3.2 Numerical Performance for Mixture Uniform Distribu- tion . 45 4.4 Numerical performance for other functions . 48 4.5 Numerical performance for increasing functions . 54 5 Application to Real World Data Set 61 6 Conclusions 64 4 Summary This thesis discusses the local polynomial regression method and the isotonic regression method to solve the nonparametric regression problem Y = g(X) + , subject to the condition that g(·) is a smooth and non-decreasing function. In order to solve the continuous isotonic regression problem, Pool Adjacent Violators Algorithm (PAVA) is used by updating the local polynomial regression estimates of a large amount of quantiles of X. Considering the importance of the bandwidth selection for nonparametric fitting, we propose a new cross validation method that incorporates the monotone constraint that g(·) is non-decreasing. Furthermore, the simulations are conducted to evaluate the performance of this cross validation method in comparison with the plug-in method for bandwidth selection in the local polynomial regression and the weighted isotonic regression estimation. By obtaining the out- comes of L1 distance, L1 distance percentage improvement, and confidence interval bands, we conclude that the proposed cross validation method for bandwidth selection improves the performance if sufficient data is provided for the nonparametric regression estimations. The proposed cross-validation bandwidth selection method in the weighted isotonic regression estimation outperforms others in this nonparametric estimation problem because they both contribute to factor in the monotone constraint. 5 List of Figures 4.1 True function g(x) ........................ 29 4.2 Scatter Plot of Simulated Data (Normal Distribution) . 30 4.3 Local Polynomial Estimatesg ~(x) . 31 4.4 Isotonic Regression Estimatesg ^(x) . 32 4.5 Boxplot of L1 Distance 1 (Normal Distribution) . 35 4.6 Scatter Plot of Simulated Data (Truncated Normal Distribution) 37 4.7 Boxplot of L1 Distance 2 (Truncated Normal Distribution) . 37 4.8 Partial plot of confidence bands 1 . 40 4.9 Scatter Plot of Simulated Data from Uniform Distribution . 42 4.10 Boxplot of L1 Distance 3 (Uniform distribution) . 43 4.11 Partial plot of confidence bands 2 . 45 4.12 Scatter Plot of Simulated Data (Mixture Uniform Distribution) 46 4.13 Boxplot of L1 Distance 4 (Mixture Uniform Distribution) . 47 4.14 Partial plot of confidence bands 3 . 48 4.15 True function f(x) ........................ 49 4.16 True function h(x) ........................ 50 4.17 Boxplot of L1 Distance 5 . 51 4.18 Boxplot of L1 Distance 6 . 52 4.19 Partial plot of confidence bands 4 . 53 4.20 Partial plot of confidence bands 5 . 54 4.21 True function l(x)......................... 55 6 4.22 True function e(x) ........................ 56 4.23 Boxplot of L1 Distance 7 . 57 4.24 Boxplot of L1 Distance 8 . 58 4.25 Partial plot of confidence bands 6 . 59 4.26 Partial plot of confidence bands 7 . 60 5.1 Two Regressions Comparison Case 1 . 62 5.2 Two Regressions Comparison Case 2 . 63 7 List of Tables 4.1 Improvement Percentage (Truncated Normal Distribution)4.1 . 38 4.2 Improvement Percentage (Uniform Distribution)4.2 . 43 4.3 Improvement Percentage (Mixture Uniform Distribution)4.3 . 47 4.4 Improvement Percentage 4.4 for f(x) . 53 4.5 Improvement Percentage 4.5 for h(x) . 53 4.6 Improvement Percentage 4.6 for l(x) . 58 4.7 Improvement Percentage 4.7 for e(x) . 59 8 Chapter 1 Introduction 1.1 Motivation Regression analysis, as one of the most popular techniques in statistics, is of great importance to explore the relationship between the dependent and independent variables. There are a large number of regression models defined and proposed in the research studies so far. From the perspective of parameters, the regression can be divided into three categories, parametric regression, non-parametric regression, and semi-parametric regression. Parametric regression depends on the assumptions about the shape of the distribution. That is to say, the function form of parametric model is specified beforehand. The purpose is to estimate the parameters of the function. In contrast, non-parametric regression relaxes the assumptions. Few assumptions would be assumed about the distribution of the observed data in nonparametric regression. The predictive model requires large data sample size to directly estimate the regression function. Semi-parametric regression, as the name shows, combines both parametric and non-parametric regression methods. 9 In various statistical contexts, local modelling techniques are data-analytic in which data determines the regression functions. Local polynomial regression is one of the most popular methods in solving non-parametric problems. The non-parametric problem can simply be considered as Y = g(X) + , (1.1) where i's are independent and identically distributed (i.i.d) random variables from a certain distribution, and the mean E() = 0 and g(·) is a smooth function. n Based on the bi-variate data, (Xi;Yi)i=1, in the observation with the sample size n, the local polynomial regression builds a model to fit the data. By using the strip of data around X, the regression method finds the minimizer of a low-order weighted least squares estimates Y . In this way, the relationship between the explanatory variable X and response variable Y is connected by the local polynomial regression estimation. The estimate value is denoted as g~(X)(Fan and Gijbels, 1996). One motivation in this thesis is that although the local polynomial regression is widely used, it may not be the best solution if observed data is under some condition. For instance, if a condition that g(·) is non-decreasing is added, the estimationg ~(X) would be probably not preferable because the local polynomial regression does not fulfill the requirement of monotonicity. Considering the monotone constraint, the isotonic regression can be intro- duced because it can achieve a non-decreasing fit. Plus different weights assigned the observed data, a weighted continuous isotonic method is proposed based on the local polynomial method to solve the non-parametric problem Y = g(X) + subject to the condition that g(·) is a non-decreasing 10 function. Since the fitted non-parametric regression is based on the data in the local neighborhood, it is very essential to choose the appropriate size of the local neighborhood, which is also called bandwidth. As the most critical smoothing parameter, the bandwidth is usually chosen either by data or by Xi−x experienced data analysts. Then the kernel function K( h ) is a weight to the point (Xi;Yi). It decays fast to eliminate the impact of the remote data points. For each given x, fit a linear model for the data points contained in the strip x ± h, using the kernel weight function. The whole regression curve is obtained by estimating the regression function in a strip of data points(Fan and Gijbels, 1996). The technique of Cross-Validation(CV) method is quite useful to solve a number of statistical problems.

Load more