Partial Variable Selection and Its Applications in Biostatistics Jingwen Gu1, Ao Yuan1,2*, Chunxiao Zhou2, Leighton Chan2 and Ming T Tan1*

Total Page:16

File Type:pdf, Size:1020Kb

Partial Variable Selection and Its Applications in Biostatistics Jingwen Gu1, Ao Yuan1,2*, Chunxiao Zhou2, Leighton Chan2 and Ming T Tan1* ISSN: 2469-5831 Gu et al. Int J Clin Biostat Biom 2018, 4:017 DOI: 10.23937/2469-5831/1510017 Volume 4 | Issue 1 International Journal of Open Access Clinical Biostatistics and Biometrics RESEARCH ARTICLE Partial Variable Selection and its Applications in Biostatistics Jingwen Gu1, Ao Yuan1,2*, Chunxiao Zhou2, Leighton Chan2 and Ming T Tan1* 1Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, USA Check for updates 2Epidemiology and Biostatistics Section, Rehabilitation Medicine Department, Clinical Center, National Institutes of Health, USA *Corresponding author: Ao Yuan, Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, Washington DC 20057, USA; Epidemiology and Biostatistics Section, Rehabilitation Medicine Department, Clinical Center, National Institutes of Health, Bethesda MD 20892, USA, E-mail: [email protected]; Ming T Tan, Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, Washington DC 20057, USA, E-mail: [email protected] addressed variable selection in artificial neural network Abstract models, Mehmood, et al. [8] gave a review for variable We propose and study a method for partial covariates selec- selection with partial least squares model. Wang, et al. tion, which only select the covariates with values fall in their effective ranges. The coefficients estimates based on the re- [9] addressed variable selection in generalized additive sulting data is more interpretable based on the effective covari- partial linear models. Liu, et al. [10] addressed variable ates. This is in contrast to the existing method of variable se- selection in semiparametric additive partial linear mod- lection, in which some variables are selected/deleted in whole. els. The Lasso [6,11] and its variation [12,13] are used to To test the validity of the partial variable selection, we extend- ed the Wilks theorem to handle this case. Simulation studies select some few significant variables in the presence of are conducted to evaluate the performance of the proposed a large number of covariates. method, and it is applied to a real data analysis as illustration. However, existing methods only select the whole Keywords variable(s) to enter the model, which may not the most Covariate, Effective range, Partial variable selection, Linear desirable in some bio-medical practice. For example, in model, Likelihood ratio test two heart disease studies [14,15] there are more than ten risk factors identified by medical researchers in their Introduction long time investigations, with the existing variable se- lection methods, some of the risk factors will be deleted Variables selection is a common practice in biosta- wholly from the investigation, this is not desirable, since tistics and there is vast literature on this topic. Com- risk factors will be really risky only when they fall into monly used methods include the likelihood ratio test some risk ranges. Thus deleting the whole variable(s) in [1], Akaike information criterion, AIC [2] Bayesian in- this case seems not reasonable, while a more reason- formation criterion, BIC [3], the minimum description able way is to find the risk ranges of these variables, length [4,5] stepwise regression and Lasso [6], etc. The and delete the variable values in the un-risky ranges. In principal components model linear combinations of the some other studies, some of the covariates values may original covariates, reduces large number of covariates just random errors which do not contribute to the influ- to a handful of major principal components, but the ence of the responses, and remove these covariates val- result is not easy to interpret in terms of the original ues will make the model interpretation more accurate. covariates. The stepwise regression starts from the full In this sense we select the variables when their value model and deletes the covariate one by one according falls within some range. To our knowledge, method for to some statistical significance measure. May, et al. [7] this kind of partial variable selection hasn’t been seen in Citation: Gu J, Yuan A, Zhou C, Chan L, Tan MT (2018) Partial Variable Selection and its Applications in Biostatistics. Int J Clin Biostat Biom 4:017. doi.org/10.23937/2469-5831/1510017 Received: January 08, 2018: Accepted: April 12, 2018: Published: April 14, 2018 Copyright: © 2018 Gu J, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 1 of 10 • DOI: 10.23937/2469-5831/1510017 ISSN: 2469-5831 the literature, which is the goal of our study here. Note ter(s). For simple of discussion we assume there are no that in existing method of variable selection, some vari- unknown parameters. Then the log-likelihood is n ables are selected/deleted, while in our method, some = − ' β n ( β) ∑log f (yi xi ). variable(s) are partially selected/deleted, i.e., only some i=1 proportions of some variable observations are select- Let βˆ be the Maximum Likelihood Estimate (MLE) ed/deleted. The latter is very different from the exist- of β (when f (⋅) is the standard normal density, βˆ is ing methods. In summary, traditional variable selection just the least squares estimate). If we delete kd(≤ ) col- methods, such as stepwise or Lasso, some covariate(s) umns of Xn and the corresponding components of β, will be removed either wholly or none from the anal- denote the remaining covariate matrix as X− and the n ysis. This is not very reasonable, since some of the re- resulting β as β -, and the corresponding MLE as βˆ -. moved covariates may be partially effective, removing Then under the hypothesis H0 : the deleted columns of all their values may yield miss-leading results, or at least X has no effects, or equivalently the deleted compo- n cost information loss; while for the variables remaining nents of β are all zeros, then asymptotically [1]. in the model, not all their values are necessarily effec- D 2 βˆ − βˆ - → χ 2 , tive for the analysis. With the proposed method, only n ( ) n ( ) k the non-effective values of the covariates are removed, where χ 2 is the chi-squared distribution with k k and the effective values of the covariates are kept in -degrees of freedom. For a given nominal level α , let the analysis. This is more reasonable than the existing 2 χ 2 χαd (1− ) be the (1−α ) -th upper quantile of the k methods of removing all or nothing. 2 βˆ − βˆ − ≥ χ 2 (1−α ) distribution, if n ( ) n ( ) d , then H0 In the existing method of deleting whole variable(s), the validity of such selection can be justified using the is rejected at significance level α , and its not good to X Wilks result, under the null hypothesis of no effect of delete these columns of n ; otherwise we accept H0 X the deleted variable(s), the resulting two times log-like- and delete these columns of n . lihood ratio will be asymptotically chi-squared distrib- There are some other methods to select columns of uted. We extended the Wilks theorem to the case for Xn , such as AIC, BIC and their variants, as in the model the proposed partial variable deletion, and use it to jus- selection field. In these methods, the optimal deletion tify the partial deletion procedure. Simulation studies of columns of Xn corresponds to the best model selec- are conducted to evaluate the performance of the pro- tion, which maximize the AIC or BIC. These methods are posed method, and it is applied to analyze a real data not as solid as the above one, as may sometimes de- set as illustration. pending on eye inspection to choose the model which The Proposed Method maximize the AIC or BIC. = All the above methods require the models under The observed data is ( yiii,x )( 1,..., n) , where yi is d consideration be nested within each other, i.e., one is the response and xi ∈ R is the covariates, of the i -th ' a sub-model of the other. Another more general model subject. Denote y =( yy,, … ) 'and Xn =(x1n , … ,x ') '. nn1 selection criterion is the Minimum Description Length Consider the linear model (MDL) criterion, a measure of complexity, developed by yn = Xn β + εn , (1) Kolmogorov [4], Wallace and Boulton (1968) [16], etc. where ββ=( ,, … β) 'is the vector of regression pa- The Kolmogorov complexity has close relationship with 1 d the entropy, it is the output of a Markov information rameter, εn = (ε1,…,ε n )' is the vector of random errors, or residual departure from the linear model assumption. source, normalized by the length of the output. It con- verges almost surely (as the length of the output goes Without loss of generality we consider the case the εi’s are to infinity) to the entropy of the source. Let =g ( ⋅⋅, ) independently identically distributed (iid), i.e. with vari- { } ance matrix Var (ε ) = σ 2 I , where I is the n -dimensional be a finite set of candidate models under consideration, n n = = … identity matrix. When theε ’s are not iid, often it is assumed and Θ {θ j : j 1, ,h} be the set of parameters of in- i terest. θ may or may not be nested within some other Var (ε ) = Ω for some known positive-definite Ω, then make i −1/2 −1/2 −1/2 = = θj, or θi and both in Θ may have the same dimen- the transformationy n Ω yn, Xn = Ω Xn and ε Ω ε θ j sion but with different parametrization. Next consid- , then we get the model y n = Xn β + ε, and the ε ’s are iid i er a fixed density , with parameter running with Var (ε) = I .
Recommended publications
  • BIOGRAPHICAL SKETCH Garrett M. Fitzmaurice Associate Professor Of
    Principal Investigator/Program Director (Last, First, Middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel and other significant contributors in the order listed on Form Page 2. Follow this format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Garrett M. Fitzmaurice Associate Professor of Biostatistics eRA COMMONS USER NAME EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) DEGREE INSTITUTION AND LOCATION YEAR(s) FIELD OF STUDY (if applicable) National University of Ireland BA, MA 1983, 1987 Psychology University of London MSc 1986 Quantitative Methods Harvard University ScD 1993 Biostatistics Professional Experience: 1986-1989 Statistician, Department of Psychology, New York University 1989-1990 Teaching Assistant, Department of Biostatistics, Harvard School of Public Health 1990-1993 Teaching Fellow in Biostatistics, Department of Biostatistics, Harvard SPH 1993-1994 Post-doctoral Research Fellow, Department of Biostatistics, Harvard SPH 1994-1997 Research Fellow, Nuffield College, Oxford University, United Kingdom 1997-1999 Assistant Professor of Biostatistics, Harvard School of Public Health 1999-present Associate Professor of Biostatistics, Harvard School of Public Health 2004-present Associate Professor of Medicine (Biostatistics), Harvard Medical School 2004-present Biostatistician, Division of General Medicine, Brigham and Women’s Hospital, Boston 2006-present Foreign Adjunct Professor of Biostatistics,
    [Show full text]
  • TUTORIAL in BIOSTATISTICS: the Self-Controlled Case Series Method
    STATISTICS IN MEDICINE Statist. Med. 2005; 0:1–31 Prepared using simauth.cls [Version: 2002/09/18 v1.11] TUTORIAL IN BIOSTATISTICS: The self-controlled case series method Heather J. Whitaker1, C. Paddy Farrington1, Bart Spiessens2 and Patrick Musonda1 1 Department of Statistics, The Open University, Milton Keynes, MK7 6AA, UK. 2 GlaxoSmithKline Biologicals, Rue de l’Institut 89, B-1330 Rixensart, Belgium. SUMMARY The self-controlled case series method was developed to investigate associations between acute outcomes and transient exposures, using only data on cases, that is, on individuals who have experienced the outcome of interest. Inference is within individuals, and hence fixed covariates effects are implicitly controlled for within a proportional incidence framework. We describe the origins, assumptions, limitations, and uses of the method. The rationale for the model and the derivation of the likelihood are explained in detail using a worked example on vaccine safety. Code for fitting the model in the statistical package STATA is described. Two further vaccine safety data sets are used to illustrate a range of modelling issues and extensions of the basic model. Some brief pointers on the design of case series studies are provided. The data sets, STATA code, and further implementation details in SAS, GENSTAT and GLIM are available from an associated website. key words: case series; conditional likelihood; control; epidemiology; modelling; proportional incidence Copyright c 2005 John Wiley & Sons, Ltd. 1. Introduction The self-controlled case series method, or case series method for short, provides an alternative to more established cohort or case-control methods for investigating the association between a time-varying exposure and an outcome event.
    [Show full text]
  • Biometrics & Biostatistics
    Hanley and Moodie, J Biomet Biostat 2011, 2:5 Biometrics & Biostatistics http://dx.doi.org/10.4172/2155-6180.1000124 Research Article Article OpenOpen Access Access Sample Size, Precision and Power Calculations: A Unified Approach James A Hanley* and Erica EM Moodie Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Canada Abstract The sample size formulae given in elementary biostatistics textbooks deal only with simple situations: estimation of one, or a comparison of at most two, mean(s) or proportion(s). While many specialized textbooks give sample formulae/tables for analyses involving odds and rate ratios, few deal explicitly with statistical considera tions for slopes (regression coefficients), for analyses involving confounding variables or with the fact that most analyses rely on some type of generalized linear model. Thus, the investigator is typically forced to use “black-box” computer programs or tables, or to borrow from tables in the social sciences, where the emphasis is on cor- relation coefficients. The concern in the – usually very separate – modules or stand alone software programs is more with user friendly input and output. The emphasis on numerical exactness is particularly unfortunate, given the rough, prospective, and thus uncertain, nature of the exercise, and that different textbooks and software may give different sample sizes for the same design. In addition, some programs focus on required numbers per group, others on an overall number. We present users with a single universal (though sometimes approximate) formula that explicitly isolates the impacts of the various factors one from another, and gives some insight into the determinants for each factor.
    [Show full text]
  • JEROME P. REITER Department of Statistical Science, Duke University Box 90251, Durham, NC 27708 Phone: 919 668 5227
    JEROME P. REITER Department of Statistical Science, Duke University Box 90251, Durham, NC 27708 phone: 919 668 5227. email: [email protected]. September 26, 2021 EDUCATION Ph.D. in Statistics, Harvard University, 1999. A.M. in Statistics, Harvard University, 1996. B.S. in Mathematics, Duke University, 1992. DISSERTATION \Estimation in the Presence of Constraints that Prohibit Explicit Data Pooling." Advisor: Donald B. Rubin. POSITIONS Academic Appointments Professor of Statistical Science and Bass Fellow, Duke University, 2015 - present. Mrs. Alexander Hehmeyer Professor of Statistical Science, Duke University, 2013 - 2015. Mrs. Alexander Hehmeyer Associate Professor of Statistical Science, Duke University, 2010 - 2013. Associate Professor of Statistical Science, Duke University, 2009 - 2010. Assistant Professor of Statistical Science, Duke University, 2006 - 2008. Assistant Professor of the Practice of Statistics and Decision Sciences, Duke University, 2002 - 2006. Lecturer in Statistics, University of California at Santa Barbara, 2001 - 2002. Assistant Professor of Statistics, Williams College, 1999 - 2001. Other Appointments Chair, Department of Statistical Science, Duke University, 2019 - present. Associate Chair, Department of Statistical Science, Duke University, 2016 - 2019. Mathematical Statistician (part-time), U. S. Bureau of the Census, 2015 - present. Associate/Deputy Director of Information Initiative at Duke, Duke University, 2013 - 2019. Social Sciences Research Institute Data Services Core Director, Duke University, 2010 - 2013. Interim Director, Triangle Research Data Center, 2006. Senior Fellow, National Institute of Statistical Sciences, 2002 - 2005. 1 ACADEMIC HONORS Keynote talk, 11th Official Statistics and Methodology Symposium (Statistics Korea), 2021. Fellow of the Institute of Mathematical Statistics, 2020. Clifford C. Clogg Memorial Lecture, Pennsylvania State University, 2020 (postponed due to covid-19).
    [Show full text]
  • The American Statistician
    This article was downloaded by: [T&F Internal Users], [Rob Calver] On: 01 September 2015, At: 02:24 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: 5 Howick Place, London, SW1P 1WG The American Statistician Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/utas20 Reviews of Books and Teaching Materials Published online: 27 Aug 2015. Click for updates To cite this article: (2015) Reviews of Books and Teaching Materials, The American Statistician, 69:3, 244-252, DOI: 10.1080/00031305.2015.1068616 To link to this article: http://dx.doi.org/10.1080/00031305.2015.1068616 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes.
    [Show full text]
  • Model Selection for Optimal Prediction in Statistical Machine Learning
    Model Selection for Optimal Prediction in Statistical Machine Learning Ernest Fokou´e Introduction science with several ideas from cognitive neuroscience and At the core of all our modern-day advances in artificial in- psychology to inspire the creation, invention, and discov- telligence is the emerging field of statistical machine learn- ery of abstract models that attempt to learn and extract pat- ing (SML). From a very general perspective, SML can be terns from the data. One could think of SML as a field thought of as a field of mathematical sciences that com- of science dedicated to building models endowed with bines mathematics, probability, statistics, and computer the ability to learn from the data in ways similar to the ways humans learn, with the ultimate goal of understand- Ernest Fokou´eis a professor of statistics at Rochester Institute of Technology. His ing and then mastering our complex world well enough email address is [email protected]. to predict its unfolding as accurately as possible. One Communicated by Notices Associate Editor Emilie Purvine. of the earliest applications of statistical machine learning For permission to reprint this article, please contact: centered around the now ubiquitous MNIST benchmark [email protected]. task, which consists of building statistical models (also DOI: https://doi.org/10.1090/noti2014 FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 155 known as learning machines) that automatically learn and Theoretical Foundations accurately recognize handwritten digits from the United It is typical in statistical machine learning that a given States Postal Service (USPS). A typical deployment of an problem will be solved in a wide variety of different ways.
    [Show full text]
  • Econometrica 0012-9682 Biostatistics 1465-4644 J R
    ECONOMETRICA 0012-9682 BIOSTATISTICS 1465-4644 J R STAT SOC B 1369-7412 ANN APPL STAT 1932-6157 J AM STAT ASSOC 0162-1459 ANN STAT 0090-5364 STAT METHODS MED RES 0962-2802 STAT SCI 0883-4237 STAT MED 0277-6715 BIOMETRICS 0006-341X CHEMOMETR INTELL LAB 0169-7439 IEEE ACM T COMPUT BI 1545-5963 J BUS ECON STAT 0735-0015 J QUAL TECHNOL 0022-4065 FUZZY SET SYST 0165-0114 STAT APPL GENET MOL 1544-6115 PHARM STAT 1539-1604 MULTIVAR BEHAV RES 0027-3171 PROBAB THEORY REL 0178-8051 J COMPUT BIOL 1066-5277 STATA J 1536-867X J COMPUT GRAPH STAT 1061-8600 J R STAT SOC A STAT 0964-1998 INSUR MATH ECON 0167-6687 J CHEMOMETR 0886-9383 BIOMETRIKA 0006-3444 BRIT J MATH STAT PSY 0007-1102 STAT COMPUT 0960-3174 J AGR BIOL ENVIR ST 1085-7117 ANN APPL PROBAB 1050-5164 ANN PROBAB 0091-1798 SCAND J STAT 0303-6898 AM STAT 0003-1305 ECONOMET REV 0747-4938 FINANC STOCH 0949-2984 ELECTRON J PROBAB 1083-6489 OPEN SYST INF DYN 1230-1612 COMPUT STAT DATA AN 0167-9473 BIOMETRICAL J 0323-3847 PROBABILIST ENG MECH 0266-8920 TECHNOMETRICS 0040-1706 STOCH PROC APPL 0304-4149 J R STAT SOC C-APPL 0035-9254 ENVIRON ECOL STAT 1352-8505 J STAT SOFTW 1548-7660 BERNOULLI 1350-7265 INFIN DIMENS ANAL QU 0219-0257 J BIOPHARM STAT 1054-3406 STOCH ENV RES RISK A 1436-3240 TEST 1133-0686 ASTIN BULL 0515-0361 ADV APPL PROBAB 0001-8678 J TIME SER ANAL 0143-9782 MATH POPUL STUD 0889-8480 COMB PROBAB COMPUT 0963-5483 LIFETIME DATA ANAL 1380-7870 METHODOL COMPUT APPL 1387-5841 ECONOMET J 1368-4221 STATISTICS 0233-1888 J APPL PROBAB 0021-9002 J MULTIVARIATE ANAL 0047-259X ENVIRONMETRICS 1180-4009
    [Show full text]
  • Curriculum Vitae Jianwen Cai
    CURRICULUM VITAE JIANWEN CAI Business Address: Department of Biostatistics, CB # 7420 The University of North Carolina at Chapel Hill Chapel Hill, NC 27599-7420 TEL: (919) 966-7788 FAX: (919) 966-3804 Email: [email protected] Education: 1992 - Ph.D. (Biostatistics), University of Washington, Seattle, Washington. 1989 - M.S. (Biostatistics), University of Washington, Seattle, Washington. 1985 - B.S. (Mathematics), Shandong University, Jinan, P. R. China. Positions: Teaching Assistant: Department of Biostatistics, University of Washington, 1988 - 1990. Research Assistant: Cardiovascular Health Study, University of Washington, 1988 - 1990. Fred Hutchinson Cancer Research Center, 1990 - 1992. Research Associate: Fred Hutchinson Cancer Research Center, 1992 - 1992. Assistant Professor: Department of Biostatistics, University of North Carolina at Chapel Hill, 1992 - 1999. Associate Professor: Department of Biostatistics, University of North Carolina at Chapel Hill, 1999 - 2004. Professor: Department of Biostatistics, University of North Carolina at Chapel Hill, 2004 - present. Interim Chair: Department of Biostatistics, University of North Carolina at Chapel Hill, 2006. Associate Chair: Department of Biostatistics, University of North Carolina at Chapel Hill, 2006-present. Professional Societies: American Statistical Association (ASA) International Biometric Society (ENAR) Institute of Mathematical Statistics (IMS) International Chinese Statistical Association (ICSA) 1 Research Interests: Survival Analysis and Regression Models, Design and Analysis of Clinical Trials, Analysis of Correlated Responses, Cancer Biology, Cardiovascular Disease Research, Obesity Research, Epidemiological Models. Awards/Honors: UNC School of Public Health McGavran Award for Excellence in Teaching, 2004. American Statistical Association Fellow, 2005 Institute of Mathematical Statistics Fellow, 2009 Professional Service: Associate Editor: Biometrics: 2000-2010. Lifetime Data Analysis: 2002-. Statistics in Biosciences: 2009-. Board Member: ENAR Regional Advisory Board, 1996-1998.
    [Show full text]
  • Ying Zhang, Phd ______
    Ying Zhang, PhD ________________________________________________________________ Home and Campus Addresses 19204 Sahler St. Elkhorn, NE 68022 Department of Biostatistics College of Public Health University of Nebraska Medical Center 984375 Nebraska Medical Center, Omaha, NE 68198-4375 Phone: (402) 836-9380 Email: [email protected] Education BS Computational Mathematics, Fundan University, China 1985 MS Computational Mathematics, Fundan University, China 1988 MS Applied Mathematics, Florida State University 1994 PhD Statistics, University of Washington 1998 Academic Appointments 07/2019 – current Professor & Chair, Department of Biostatistics, University of Nebraska Medical Center 08/2020 – current Professor & Acting Chair, Department of Epidemiology, University of Nebraska Medical Center 01/2014 – 06/2019 Professor & Director of Graduate Education, Department of Biostatistics, Indiana University 07/2010 - 06/2014 Professor, Department of Biostatistics, Program of Health Informatics and Program of Applied Mathematics, The University of Iowa 07/2004 – 06/2010 Associate Professor, Department of Biostatistics, The University of Iowa College of Public Health 08/1998 – 06/2004 Assistant Professor, Department of Statistics and Actuarial Science, University of Central Florida 06/1997 – 09/1997 Applied Statistician, Applied Statistics Group, The Boeing Company 09/1988 – 07/1991 Lecturer, Department of Mathematics, Fudan University, China Honors and Awards 2018 Elected Fellow of the American Statistical Association (ASA) 2008 “Quality of Life Award” by the Oncology Nursing Society Press for the paper “Transition from Treatment to Survivorship: Effects of a Psychoeducational Intervention on Quality of Life in 1 Breast Cancer Survivors” 2008 “Faculty Teaching Award” by the College of Public Health, University of Iowa 1997 “Z.W. Birnbaum award for excellence in dissertation proposal”; awarded by the Department of Statistics, University of Washington, 10/1997.
    [Show full text]
  • Linear Regression: Goodness of Fit and Model Selection
    Linear Regression: Goodness of Fit and Model Selection 1 Goodness of Fit I Goodness of fit measures for linear regression are attempts to understand how well a model fits a given set of data. I Models almost never describe the process that generated a dataset exactly I Models approximate reality I However, even models that approximate reality can be used to draw useful inferences or to prediction future observations I ’All Models are wrong, but some are useful’ - George Box 2 Goodness of Fit I We have seen how to check the modelling assumptions of linear regression: I checking the linearity assumption I checking for outliers I checking the normality assumption I checking the distribution of the residuals does not depend on the predictors I These are essential qualitative checks of goodness of fit 3 Sample Size I When making visual checks of data for goodness of fit is important to consider sample size I From a multiple regression model with 2 predictors: I On the left is a histogram of the residuals I On the right is residual vs predictor plot for each of the two predictors 4 Sample Size I The histogram doesn’t look normal but there are only 20 datapoint I We should not expect a better visual fit I Inferences from the linear model should be valid 5 Outliers I Often (particularly when a large dataset is large): I the majority of the residuals will satisfy the model checking assumption I a small number of residuals will violate the normality assumption: they will be very big or very small I Outliers are often generated by a process distinct from those which we are primarily interested in.
    [Show full text]
  • Visualizing Research Collaboration in Statistical Science: a Scientometric
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Library Philosophy and Practice (e-journal) Libraries at University of Nebraska-Lincoln 9-13-2019 Visualizing Research Collaboration in Statistical Science: A Scientometric Perspective Prabir Kumar Das Library, Documentation & Information Science Division, Indian Statistical Institute, 203 B T Road, Kolkata - 700108, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/libphilprac Part of the Library and Information Science Commons Das, Prabir Kumar, "Visualizing Research Collaboration in Statistical Science: A Scientometric Perspective" (2019). Library Philosophy and Practice (e-journal). 3039. https://digitalcommons.unl.edu/libphilprac/3039 Viisualliiziing Research Collllaborattiion iin Sttattiisttiicall Sciience:: A Sciienttomettriic Perspecttiive Prrabiirr Kumarr Das Sciienttiiffiic Assssiissttantt – A Liibrrarry,, Documenttattiion & IInfforrmattiion Sciience Diiviisiion,, IIndiian Sttattiisttiicall IInsttiittutte,, 203,, B.. T.. Road,, Kollkatta – 700108,, IIndiia,, Emaiill:: [email protected] Abstract Using Sankhyā – The Indian Journal of Statistics as a case, present study aims to identify scholarly collaboration pattern of statistical science based on research articles appeared during 2008 to 2017. This is an attempt to visualize and quantify statistical science research collaboration in multiple dimensions by exploring the co-authorship data. It investigates chronological variations of collaboration pattern, nodes and links established among the affiliated institutions and countries of all contributing authors. The study also examines the impact of research collaboration on citation scores. Findings reveal steady influx of statistical publications with clear tendency towards collaborative ventures, of which double-authored publications dominate. Small team of 2 to 3 authors is responsible for production of majority of collaborative research, whereas mega-authored communications are quite low.
    [Show full text]
  • Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis
    Scalable model selection for spatial additive mixed modeling: application to crime analysis Daisuke Murakami1,2,*, Mami Kajita1, Seiji Kajita1 1Singular Perturbations Co. Ltd., 1-5-6 Risona Kudan Building, Kudanshita, Chiyoda, Tokyo, 102-0074, Japan 2Department of Statistical Data Science, Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan * Corresponding author (Email: [email protected]) Abstract: A rapid growth in spatial open datasets has led to a huge demand for regression approaches accommodating spatial and non-spatial effects in big data. Regression model selection is particularly important to stably estimate flexible regression models. However, conventional methods can be slow for large samples. Hence, we develop a fast and practical model-selection approach for spatial regression models, focusing on the selection of coefficient types that include constant, spatially varying, and non-spatially varying coefficients. A pre-processing approach, which replaces data matrices with small inner products through dimension reduction dramatically accelerates the computation speed of model selection. Numerical experiments show that our approach selects the model accurately and computationally efficiently, highlighting the importance of model selection in the spatial regression context. Then, the present approach is applied to open data to investigate local factors affecting crime in Japan. The results suggest that our approach is useful not only for selecting factors influencing crime risk but also for predicting crime events. This scalable model selection will be key to appropriately specifying flexible and large-scale spatial regression models in the era of big data. The developed model selection approach was implemented in the R package spmoran. Keywords: model selection; spatial regression; crime; fast computation; spatially varying coefficient modeling 1.
    [Show full text]