Nonlinear Mixed Effects Models for Longitudinal Data Master Thesis in Statistics and Data Mining

LINKÖPING UNIVERSITY Nonlinear mixed effects models for longitudinal data Master thesis in Statistics and Data Mining Ra'id Mahbouba 6/3/2015 Abstract The main objectives of this master thesis are to explore the effectiveness of nonlinear mixed effects model for longitudinal data. Mixed effect models allow to investigate the nature of relationship between the time-varying covariates and the response while also capturing the variations of subjects. I investigate the robustness of the longitudinal models by building up the complexity of the models starting from multiple linear models and ending up with additive nonlinear mixed models. I use a dataset where firms’ leverage are explained by four explanatory variables in addition to a grouping factor that is the firm factor. The models are compared using comparison statistics such as AIC, BIC and by a visual inspection of residuals. Likelihood ratio test has been used in some nested models only. The models are estimated by maximum likelihood and restricted maximum likelihood estimation. The most efficient model is the nonlinear mixed effects model which has lowest AIC and BIC. The multiple linear regression model failed to explain the relation and produced unrealistic statistics. Keywords: Longitudinal data, machine learning techniques, splines, mixed effects, leverage. i ii Acknowledgements First and foremost I would like to thank the great country of Sweden for everything. I am fortunate enough to have decided to come to Sweden, and it was the best decision I ever made. So, thank you very much Sweden and many thanks to the Swedish people. I would also like to express my deepest appreciation to my beloved family, your support, encouragement and love pushed me forward towards success. I would like to greatly thank my supervisor Professor Mattias Villani, who has helped and guided me throughout my thesis work and other courses, especially with R-programming. Thank you Mattias, you are to me a friend more than a teacher, and you mean a lot to me. I would also like to thank Professors Anders Grimvall, Oleg Sysoev, Anders Nordgaard and Lotta Hallberg for having been a great support throughout my Masters programme at Linkoping University. Furthermore I would also like to thank my opponent Jithu Viswanath for his suggestions for improvement and the discussions about the thesis. Thank you Jithu, they were really good comments. Last I would like to thank my friends who have supported me in different ways during this thesis work, both with encouragement and with opinions about my work. Thank you all. iii iv Table of Contents 1. Introduction.............................................................................................................. 3 Background ................................................................................................................. 4 Objective ..................................................................................................................... 5 2. Data .......................................................................................................................... 6 Cleansing of data ......................................................................................................... 6 Variables ..................................................................................................................... 6 3. Methods ................................................................................................................... 8 Multiple linear regression ........................................................................................... 8 Mixed effects models .................................................................................................. 8 Non-linear Mixed Models ......................................................................................... 11 Interclass correlation ................................................................................................. 12 Estimation methods ................................................................................................... 13 Model Comparison .................................................................................................... 13 4. Results.................................................................................................................... 15 Multiple linear model vs longitudinal model ............................................................ 15 Linear Mixed Model ................................................................................................. 17 Non-linear Mixed Models ......................................................................................... 23 5. Discussion .............................................................................................................. 25 Multiple linear model vs longitudinal model ............................................................ 25 Linear Mixed Model ................................................................................................. 25 Non-linear Mixed Models ......................................................................................... 26 6. Conclusions and future work ................................................................................. 28 Bibliographyx................................................................................................................ 29 Appendix A Tables ....................................................................................................... 31 Appendix B Figures ...................................................................................................... 38 1 We are what we repeatedly do. Excellence, therefore, is not an act, but a habit. – Aristotle. 2 1. Introduction In a single level dataset, investigating changes of relation between covariates over time includes studying the relation between and within subjects. The repeated measurements of financial performance are taken from firms and will attract our focus to investigate two types of information: information about changes between subjects (i.e. firms in this study), and changes within individual firms. In order to achieve this investigation, we need to explain some important expressions, such as longitudinal and cross-sectional. According to the Oxford Dictionary, the term longitudinal means “running lengthwise rather than across”. Longitudinal information is therefore a synonym to “within-subject information” i.e., information that comes from repeated measures collected on one firm over time. Another important term is Cross-sectional which is according to the Oxford Dictionary has a meaning of “cut through something, especially at right angles to an axis”. Cross-sectional information is therefore a synonym to “between-firm information” i.e., information that comes from measures collected from different firms at any given time point. We will, however, call the collection of these two approaches as “Longitudinal Analysis” and the models used are longitudinal models, following all other researchers that worked on the same type of analysis. A longitudinal data approach has several advantages when compared to a cross-section approach, see e.g. Chen & Hammes (2003). It is often used in empirical financial research because with increasing number of data points, degrees of freedom are increased and collinearity among explanatory variables is reduced, thus the efficiency of econometric estimates is improved. Another advantage of longitudinal data is that it can control for individual heterogeneity due to hidden factors, which, if neglected in time-series or cross-section estimations leads to biased results. In this thesis heterogeneity is captured by firm specific/random effects depending on the characteristics of the data set. In Chapter 2, I describe the dataset and how I clean it. In Chapter 3 I detail the models and methods that are used in the thesis. The models are multiple linear regression, mixed effects model and non-linear mixed models. Each section involves a detailed description of each method in addition to a section that describes how the model is estimated in the R language. Chapter 3 also consists of additional sections that describes concepts such as the interclass correlation, parameter estimation, the likelihood ratio, the AIC, BIC criterion and the standard error. Chapter 4 involves a description of the results. The most challenging part is to visualize the performance of the longitudinal models, because of the huge number of groups/ firms. I therefore, mainly depend AIC, BIC and residual plots when comparing models. All the results in this thesis was produced in R. 3 Background The motivation for analysing leverage comes from a research was based on measurements of financial performance taken from the previous study by Rajan et al. (1995). That study involved a comparative and investigative approach aimed to determine whether capital structure in other countries is related to factors similar to those appearing to influence the capital structure of the USA. Rajan and Zingales were the first people who think and collect the information from firms from seven countries. Later, other researchers just continue utilizing and updating the dataset. Rajan & Zingales (1995) did empirical testing for the period 1990 through 1995 on firms in Canada, Denmark, Germany, Italy, Sweden, UK, and the USA. Other researchers refer to this dataset as Rajan & Zingales dataset, and I will use the same terminology in this thesis. Villani et al. (2012) analyzed

Load more