Daniel Zelterman Applied Multivariate Statistics with R Statistics for Biology and Health
Total Page:16
File Type:pdf, Size:1020Kb
Statistics for Biology and Health Daniel Zelterman Applied Multivariate Statistics with R Statistics for Biology and Health Series Editors Mitchell Gail Jonathan M. Samet Anastasios Tsiatis Wing Wong More information about this series at http://www.springer.com/series/2848 Daniel Zelterman Applied Multivariate Statistics with R 123 Daniel Zelterman School of Public Health Yale University New Haven, CT, USA ISSN 1431-8776 ISSN 2197-5671 (electronic) Statistics for Biology and Health ISBN 978-3-319-14092-6 ISBN 978-3-319-14093-3 (eBook) DOI 10.1007/978-3-319-14093-3 Library of Congress Control Number: 2015942244 Springer Cham Heidelberg New York Dordrecht London © Springer International Publishing Switzerland 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publica- tion does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www. springer.com) Barry H Margolin 1943–2009 Teacher, mentor, and friend Permissions R is Copyright c 2014 The R Foundation for Statistical Computing. R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing Vienna, Austria. URL: http://www. R-project.org/. We are grateful for the use of data sets, used with permission as follows: Public domain data in Table 1.4 was obtained from the U.S. Cancer Statistics Working Group. United States Cancer Statistics: 1999–2007 Inci- dence and Mortality Web-based Report. Atlanta: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention and Na- tional Cancer Institute; 2010. Available online at www.cdc.gov/uscs. The data of Table 1.5 is used with permission of S&P Dow Jones. The data of Table 1.6 was collected from www.fatsecret.com and used with permission. The idea for this table was suggested by the article originally appearing in The Daily Beast, available online at http://www.thedailybeast.com/galleries/2010/06/17/ 40-unhealthiest-burgers.html. The data in Table 7.1 is used with permission of S&P Dow Jones Indices. The data in Table 7.4 was gathered from www.fatsectret.com and used with permission. The idea for this table was suggested by the article originally appearing in The Daily Beast, available online at http://www.thedailybeast.com/galleries/2010/10/18/ halloween-candy.html. The data in Table 7.5 was obtained from https://health.data.ny.gov/. The data in Table 8.2 is used with permission of Dr. Deepak Nayaran, Yale University School of Medicine. The data in Table 8.3 was reported by the National Vital Statistics Sys- tem (NVSS): http://www.cdc.gov/nchs/deaths.htm. The data in Table 9.3 was generated as part of theMedicareCurrentBen- eficiary Survey: http://www.cms.gov/mcbs/ andavailableonthecdc.gov website. viii The data in Table 9.4 was collected by the United States Department of Labor, Mine Safety and Health Administration. This data is available online at http://www.msha.gov/stats/centurystats/coalstats.asp. The data of Table 9.6 is used with permission of Street Authority and David Sterman. The data in Table 11.1 is adapted from Statistics Canada (2009) “Health Care Professionals and Official-Language Minorities in Canada 2001 and 2006,” Catalog no. 91-550-X, Text Table 1.1, appearing on page 13. The data source in Table 11.5 is copyright 2010, Morningstar, Inc., Morn- ingstar Bond Market Commentary September, 2010. All Rights Reserved. Used with permission. The data in Table 11.6 iscitedfromUSDepartment of Health and Human Services, Administration on Aging. The data is available online at http://www.aoa.gov/AoARoot/Aging\_Statistics/index.aspx. The website http://www.gastonsanchez.com provided some of the code appearing in Output 11.1. Global climate data appearing in Table 12.2 is used with permission: Climate Charts & Graphs, courtesy of Kelly O’Day. Tables 13.1 and 13.6 are reprinted from Stigler (1994) and used with permission of the Institute of Mathematical Statistics. In Addition Several data sets referenced and used as examples were obtained from the UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science (Bache and Lichman 2013). Grant Support The author acknowledges support from grants from the National Institute of Mental Health, National Cancer Institute, and the National Institute of Environmental Health Sciences. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health. Preface ULTIVARIATE STATISTICS is a mature field with many different M methods. Many of these are mathematical. Fortunately, these meth- ods have been programmed so you should be able to run these on your com- puter without much difficulty. This book is targeted to a graduate-level practitioner who may need to use these methods but does not necessarily know about the mathematical derivations. For example, we use the sample average of the multivariate distribution to estimate the population mean but do not need to prove op- timal properties of such an estimator when sampled from a normal parent population. Readers may want to analyze their data, motivated by discipline- specific questions. They will discover ways to get at some important results without a degree in statistics. Similarly, those well trained in statistics will likely be familiar with many of the univariate topics covered here, but now can learn about new methods. The reader should have taken at least one course in statistics previously and have some familiarity with such topics at t-test, degrees of freedom (df), p-values, statistical significance, and the chi-squared test of independence in a2× 2 table. He or she should also know the basic rules of probability such as independence and conditional probability. The reader should have some basic computing skills including data editing. It is not necessary to have experience with R or with programming languages although these are good skills to develop. We will assume that the reader has a rudimentary acquaintance of the univariate normal distribution. We begin a discussion of multivariate models with an introduction of the bivariate normal distribution in Chap. 6.These are used to make the leap from the scalar notation to the use of vectors and matrices used in Chap. 7 on the multivariate normal distribution. A brief review of linear algebra appears in Chap. 4, including the correspond- ing computations in R. Other multivariate distributions include models for extremes, described in Sect. 13.3. We frequently include the necessary software to run the programs in R because we need to be able to perform these methods with real data. In some cases we need to manipulate the data in order to get it to fit into the proper format. Readers may want to produce some of the graphical displays ix Preface x given in Chap. 3 for their own data. For these readers, the full programs to produce the figures are listed in that chapter. The field of statistics has developed many useful methods for analyzing data, and many of these methods are already programmed for you and readily available in R.What’smore,R is free, widely available, open source, flexible, and the current fashion in statistical computing. Authors of new statistical methods are regularly contributing to the many libraries in R so many new results are included as well. As befitting the Springer series in Life Sciences, Medicine & Health, a large portion of the examples given here are health related or biologically oriented. There are also a large number of examples from other disciplines. There are several reasons for this, including the abundance of good exam- ples that are available. Examples from other disciplines do a good job of illustrating the method without a great deal of background knowledge of the data. For example, Chap. 9 on multivariable linear regression methods begins with an example of data for different car models. Because the measurements on the cars are readily understood by the reader with little or no additional explanation, we can concentrate on the statistical methods rather than spend- ing time on the example details. In contrast, the second example, presented in Sect. 9.3, is about a large health survey and requires a longer introduction for the reader to appreciate the data. But at that point in Chap. 9,weare already familiar with the statistical tools and can address issues raised by the survey data. New Haven, CT, USA Daniel Zelterman Acknowledgments Special thanks are due to Michael Kane and Forrest Crawford, who together taught me enough R to fill a book, and Rob Muirhead, who taught multi- variate statistics out of TW Anderson’s text.