Statistical Modeling: the Two Cultures

Total Page:16

File Type:pdf, Size:1020Kb

Statistical Modeling: the Two Cultures Statistical Science 2001, Vol. 16, No. 3, 199–231 Statistical Modeling: The Two Cultures Leo Breiman Abstract. There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated bya given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical communityhas been committed to the almost exclusive use of data models. This commit- ment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current prob- lems. Algorithmic modeling, both in theoryand practice, has developed rapidlyin fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move awayfrom exclusive dependence on data models and adopt a more diverse set of tools. 1. INTRODUCTION The values of the parameters are estimated from Statistics starts with data. Think of the data as the data and the model then used for information being generated bya black box in which a vector of and/or prediction. Thus the black box is filled in like input variables x (independent variables) go in one this: side, and on the other side the response variables y linear regression come out. Inside the black box, nature functions to y logistic regression x associate the predictor variables with the response Cox model variables, so the picture is like this: Model validation. Yes–no using goodness-of-fit tests and residual examination. y nature x Estimated culture population. 98% of all statisti- cians. There are two goals in analyzing the data: The Algorithmic Modeling Culture Prediction. To be able to predict what the responses are going to be to future input variables; The analysis in this culture considers the inside of Information. To extract some information about the box complex and unknown. Their approach is to how nature is associating the response variables find a function fx—an algorithm that operates on to the input variables. x to predict the responses y. Their black box looks like this: There are two different approaches toward these goals: y unknown x The Data Modeling Culture The analysis in this culture starts with assuming decision trees a stochastic data model for the inside of the black neural nets box. For example, a common data model is that data are generated byindependent draws from Model validation. Measured bypredictive accuracy. Estimated culture population. 2% of statisticians, response variables = f(predictor variables, manyin other fields. random noise, parameters) In this paper I will argue that the focus in the statistical communityon data models has: Leo Breiman is Professor, Department of Statistics, University of California, Berkeley, California 94720- • Led to irrelevant theoryand questionable sci- 4735 (e-mail: [email protected]). entific conclusions; 199 200 L. BREIMAN • Kept statisticians from using more suitable between inputs and outputs than data models. This algorithmic models; is illustrated using two medical data sets and a • Prevented statisticians from working on excit- genetic data set. A glossaryat the end of the paper ing new problems; explains terms that not all statisticians maybe familiar with. I will also review some of the interesting new developments in algorithmic modeling in machine learning and look at applications to three data sets. 3. PROJECTS IN CONSULTING As a consultant I designed and helped supervise 2. ROAD MAP surveys for the Environmental Protection Agency It maybe revealing to understand how I became a (EPA) and the state and federal court systems. Con- member of the small second culture. After a seven- trolled experiments were designed for the EPA, and year stint as an academic probabilist, I resigned and I analyzed traffic data for the U.S. Department of went into full-time free-lance consulting. After thir- Transportation and the California Transportation teen years of consulting I joined the Berkeley Statis- Department. Most of all, I worked on a diverse set tics Department in 1980 and have been there since. of prediction projects. Here are some examples: Myexperiences as a consultant formed myviews Predicting next-dayozone levels. about algorithmic modeling. Section 3 describes two Using mass spectra to identifyhalogen-containing of the projects I worked on. These are given to show compounds. how myviews grew from such problems. Predicting the class of a ship from high altitude When I returned to the universityand began radar returns. reading statistical journals, the research was dis- Using sonar returns to predict the class of a sub- tant from what I had done as a consultant. All marine. articles begin and end with data models. Myobser- Identityof hand-sent Morse Code. vations about published theoretical research in Toxicityof chemicals. statistics are in Section 4. On-line prediction of the cause of a freewaytraffic Data modeling has given the statistics field many breakdown. successes in analyzing data and getting informa- Speech recognition tion about the mechanisms producing the data. But The sources of delayin criminal trials in state court there is also misuse leading to questionable con- systems. clusions about the underlying mechanism. This is reviewed in Section 5. Following that is a discussion To understand the nature of these problems and (Section 6) of how the commitment to data modeling the approaches taken to solve them, I give a fuller has prevented statisticians from entering new sci- description of the first two on the list. entific and commercial fields where the data being 3.1 The Ozone Project gathered is not suitable for analysis by data models. In the past fifteen years, the growth in algorith- In the mid- to late 1960s ozone levels became a mic modeling applications and methodologyhas serious health problem in the Los Angeles Basin. been rapid. It has occurred largelyoutside statis- Three different alert levels were established. At the tics in a new community—often called machine highest, all government workers were directed not learning—that is mostlyyoungcomputer scientists to drive to work, children were kept off playgrounds (Section 7). The advances, particularlyover the last and outdoor exercise was discouraged. five years, have been startling. Three of the most The major source of ozone at that time was auto- important changes in perception to be learned from mobile tailpipe emissions. These rose into the low these advances are described in Sections 8, 9, and atmosphere and were trapped there byan inversion 10, and are associated with the following names: layer. A complex chemical reaction, aided by sun- light, cooked awayand produced ozone two to three Rashomon: the multiplicityof good models; hours after the morning commute hours. The alert Occam: the conflict between simplicityand warnings were issued in the morning, but would be accuracy; more effective if theycould be issued 12 hours in Bellman: dimensionality—curse or blessing? advance. In the mid-1970s, the EPA funded a large Section 11 is titled “Information from a Black effort to see if ozone levels could be accuratelypre- Box” and is important in showing that an algo- dicted 12 hours in advance. rithmic model can produce more and more reliable Commuting patterns in the Los Angeles Basin information about the structure of the relationship are regular, with the total variation in anygiven STATISTICAL MODELING: THE TWO CULTURES 201 daylight hour varying only a few percent from field. The molecules of the compound split and the one weekdayto another. With the total amount of lighter fragments are bent more bythe magnetic emissions about constant, the resulting ozone lev- field than the heavier. Then the fragments hit an els depend on the meteorologyof the preceding absorbing strip, with the position of the fragment on days. A large data base was assembled consist- the strip determined bythe molecular weight of the ing of lower and upper air measurements at U.S. fragment. The intensityof the exposure at that posi- weather stations as far awayas Oregon and Ari- tion measures the frequencyof the fragment. The zona, together with hourlyreadings of surface resultant mass spectra has numbers reflecting fre- temperature, humidity, and wind speed at the quencies of fragments from molecular weight 1 up to dozens of air pollution stations in the Basin and the molecular weight of the original compound. The nearbyareas. peaks correspond to frequent fragments and there Altogether, there were dailyand hourlyreadings are manyzeroes. The available data base consisted of over 450 meteorological variables for a period of of the known chemical structure and mass spectra seven years, with corresponding hourly values of of 30,000 compounds. ozone and other pollutants in the Basin. Let x be The mass spectrum predictor vector x is of vari- the predictor vector of meteorological variables on able dimensionality. Molecular weight in the data the nth day. There are more than 450 variables in base varied from 30 to over 10,000. The variable to x since information several days back is included. be predicted is Let y be the ozone level on the n + 1st day. Then the problem was to construct a function fx such y = 1: contains chlorine, that for anyfuture dayand future predictor vari- y = 2: does not contain chlorine. ables x for that day, fx is an accurate predictor of the next day’s ozone level y. The problem is to construct a function fx that To estimate predictive accuracy, the first five is an accurate predictor of y where x is the mass years of data were used as the training set. The spectrum of the compound. last two years were set aside as a test set.
Recommended publications
  • Statistics Making an Impact
    John Pullinger J. R. Statist. Soc. A (2013) 176, Part 4, pp. 819–839 Statistics making an impact John Pullinger House of Commons Library, London, UK [The address of the President, delivered to The Royal Statistical Society on Wednesday, June 26th, 2013] Summary. Statistics provides a special kind of understanding that enables well-informed deci- sions. As citizens and consumers we are faced with an array of choices. Statistics can help us to choose well. Our statistical brains need to be nurtured: we can all learn and practise some simple rules of statistical thinking. To understand how statistics can play a bigger part in our lives today we can draw inspiration from the founders of the Royal Statistical Society. Although in today’s world the information landscape is confused, there is an opportunity for statistics that is there to be seized.This calls for us to celebrate the discipline of statistics, to show confidence in our profession, to use statistics in the public interest and to champion statistical education. The Royal Statistical Society has a vital role to play. Keywords: Chartered Statistician; Citizenship; Economic growth; Evidence; ‘getstats’; Justice; Open data; Public good; The state; Wise choices 1. Introduction Dictionaries trace the source of the word statistics from the Latin ‘status’, the state, to the Italian ‘statista’, one skilled in statecraft, and on to the German ‘Statistik’, the science dealing with data about the condition of a state or community. The Oxford English Dictionary brings ‘statistics’ into English in 1787. Florence Nightingale held that ‘the thoughts and purpose of the Deity are only to be discovered by the statistical study of natural phenomena:::the application of the results of such study [is] the religious duty of man’ (Pearson, 1924).
    [Show full text]
  • School of Social Sciences Economics Division University of Southampton Southampton SO17 1BJ, UK
    School of Social Sciences Economics Division University of Southampton Southampton SO17 1BJ, UK Discussion Papers in Economics and Econometrics Professor A L Bowley’s Theory of the Representative Method John Aldrich No. 0801 This paper is available on our website http://www.socsci.soton.ac.uk/economics/Research/Discussion_Papers ISSN 0966-4246 Key names: Arthur L. Bowley, F. Y. Edgeworth, , R. A. Fisher, Adolph Jensen, J. M. Keynes, Jerzy Neyman, Karl Pearson, G. U. Yule. Keywords: History of Statistics, Sampling theory, Bayesian inference. Professor A. L. Bowley’s Theory of the Representative Method * John Aldrich Economics Division School of Social Sciences University of Southampton Southampton SO17 1BJ UK e-mail: [email protected] Abstract Arthur. L. Bowley (1869-1957) first advocated the use of surveys–the “representative method”–in 1906 and started to conduct surveys of economic and social conditions in 1912. Bowley’s 1926 memorandum for the International Statistical Institute on the “Measurement of the precision attained in sampling” was the first large-scale theoretical treatment of sample surveys as he conducted them. This paper examines Bowley’s arguments in the context of the statistical inference theory of the time. The great influence on Bowley’s conception of statistical inference was F. Y. Edgeworth but by 1926 R. A. Fisher was on the scene and was attacking Bayesian methods and promoting a replacement of his own. Bowley defended his Bayesian method against Fisher and against Jerzy Neyman when the latter put forward his concept of a confidence interval and applied it to the representative method. * Based on a talk given at the Sample Surveys and Bayesian Statistics Conference, Southampton, August 2008.
    [Show full text]
  • History of Health Statistics
    certain gambling problems [e.g. Pacioli (1494), Car- Health Statistics, History dano (1539), and Forestani (1603)] which were first of solved definitively by Pierre de Fermat (1601–1665) and Blaise Pascal (1623–1662). These efforts gave rise to the mathematical basis of probability theory, The field of statistics in the twentieth century statistical distribution functions (see Sampling Dis- (see Statistics, Overview) encompasses four major tributions), and statistical inference. areas; (i) the theory of probability and mathematical The analysis of uncertainty and errors of mea- statistics; (ii) the analysis of uncertainty and errors surement had its foundations in the field of astron- of measurement (see Measurement Error in omy which, from antiquity until the eighteenth cen- Epidemiologic Studies); (iii) design of experiments tury, was the dominant area for use of numerical (see Experimental Design) and sample surveys; information based on the most precise measure- and (iv) the collection, summarization, display, ments that the technologies of the times permit- and interpretation of observational data (see ted. The fallibility of their observations was evident Observational Study). These four areas are clearly to early astronomers, who took the “best” obser- interrelated and have evolved interactively over the vation when several were taken, the “best” being centuries. The first two areas are well covered assessed by such criteria as the quality of obser- in many histories of mathematical statistics while vational conditions, the fame of the observer, etc. the third area, being essentially a twentieth- But, gradually an appreciation for averaging observa- century development, has not yet been adequately tions developed and various techniques for fitting the summarized.
    [Show full text]
  • Milton Friedman Papers 1931-2006
    http://oac.cdlib.org/findaid/ark:/13030/tf7t1nb2hx Online items available Register of the Milton Friedman Papers 1931-2006 Processed by Linda Bernard, Dana M. Harris, and Elizabeth Konzak Hoover Institution Archives Stanford University Stanford, California 94305-6010 Phone: (650) 723-3563 Fax: (650) 725-3445 Email: [email protected] URL: http://www.hoover.org/library-and-archives © 1997, 2008 Hoover Institution Archives. All rights reserved. Register of the Milton Friedman 77011 1 Papers 1931-2006 Register of the Milton Friedman Papers 1931-1991 Hoover Institution Archives Stanford University Stanford, California Contact Information Hoover Institution Archives Stanford University Stanford, California 94305-6010 Phone: (650) 723-3563 Fax: (650) 725-3445 Email: [email protected] URL: http://www.hoover.org/library-and-archives/ Processed by: Linda Bernard, Dana M. Harris, and Elizabeth Konzak Date Completed: 1992, revised 2008 Encoded by: Brooke Dykman Dockter and Elizabeth Konzak © 1997, 2008 Hoover Institution Archives. All rights reserved. Descriptive Summary Title: Milton Friedman papers Date (inclusive): 1931-2006 Collection number: 77011 Creator: Friedman, Milton, 1912-2006 Creator: Friedman, Rose D. Physical Description: 228 manuscript boxes, 2 oversize boxes, 4 card file boxes, 1 slide box, 1 envelope(94 linear feet) Repository: Hoover Institution Archives Stanford, California 94305-6010 Abstract: Speeches and writings, correspondence, notes, statistics, printed matter, sound recordings, videotapes, and photographs, relating to economic theory, economic conditions in the United States, and governmental economic policy. Digitized copies of many of the sound and video recordings in this collection, as well as some of Friedman's writings, are available at http://hoohila.stanford.edu/friedman/ .
    [Show full text]
  • Professor A. L. Bowley's Theory of the Representative Method *
    Key names: Arthur L. Bowley, F. Y. Edgeworth, , R. A. Fisher, Adolph Jensen, J. M. Keynes, Jerzy Neyman, Karl Pearson, G. U. Yule. Keywords: History of Statistics, Sampling theory, Bayesian inference. Professor A. L. Bowley’s Theory of the Representative Method * John Aldrich Economics Division School of Social Sciences University of Southampton Southampton SO17 1BJ UK e-mail: [email protected] Abstract Arthur. L. Bowley (1869-1957) first advocated the use of surveys–the “representative method”–in 1906 and started to conduct surveys of economic and social conditions in 1912. Bowley’s 1926 memorandum for the International Statistical Institute on the “Measurement of the precision attained in sampling” was the first large-scale theoretical treatment of sample surveys as he conducted them. This paper examines Bowley’s arguments in the context of the statistical inference theory of the time. The great influence on Bowley’s conception of statistical inference was F. Y. Edgeworth but by 1926 R. A. Fisher was on the scene and was attacking Bayesian methods and promoting a replacement of his own. Bowley defended his Bayesian method against Fisher and against Jerzy Neyman when the latter put forward his concept of a confidence interval and applied it to the representative method. * Based on a talk given at the Sample Surveys and Bayesian Statistics Conference, Southampton, August 2008. I am grateful to Andrew Dale for his comments on an earlier draft. December 2008 1 Introduction “I think that if practical statistics has acquired something valuable in the represen- tative method, this is primarily due to Professor A.
    [Show full text]
  • David Donoho's Gauss Award
    Volume 47 • Issue 7 IMS Bulletin October/November 2018 David Donoho’s Gauss Award David Donoho is the recipient of the 2018 Carl Friedrich Gauss Prize, the major prize CONTENTS in applied mathematics awarded jointly by the International Mathematical Union 1 Donoho receives Gauss (IMU) and the German Mathematical Union. Award at ICM Bestowed every four years since 2006, the prize 2–3 Members’ news: COPSS honors scientists whose mathematical research Awards: Richard Samworth, has generated important applications beyond the Bin Yu, Susan Murphy; CR Rao, mathematical field—in technology, in business, Herman Chernoff, DJ Finney or in people’s everyday lives—and this award 4 Interview: Richard Samworth acknowledges David’s impact on a whole genera- tion of mathematical scientists. 5 COPSS Award nominations David Donoho was commended by the IMU David Donoho. Photo: IMU 6 How to nominate a Fellow President Shigefumi Mori for his “fundamental 7 Other IMS Award contribution to mathematics” during the opening ceremony of ICM 2018 in Rio de nominations Janeiro, Brazil. After the award was announced, David spoke of the joy he has experienced when 9 More nominations: Parzen Prize, Newbold Prize theories he has developed earlier in his career are applied to everyday life. “There are things I’ve done decades ago, and when I see things happen in the real world, it makes 10 Report on WNAR/IMS me so proud. The power we have in moving the world gives me a great deal of satisfac- Meeting tion in my career choice.” 11 Student Puzzle Corner He said that a career in math is not limited to pure math, and publication in 12 Recent papers: Annals of journals.
    [Show full text]
  • 45 June 2008 Editor: David W
    International Society for Clinical News Biostatistics Number 45 June 2008 Editor: David W. Warne Executive Committee 2007-08 Editorial Officers It’s not long until 2008’s conference in Copenhagen, Denmark, and this issue contains lots of information about President: Emmanuel Lesaffre (B) the conference from Philip Hougaard and Bjarne Nielsen. Vice-President: Norbert Victor (D) Next year’s conference will take place in Prague in Secretary: Harbajan Chadha-Boreham (CH) the Czech Republic and there’s an update from Zdenek Valenta. Treasurer: Koos Zwinderman (NL) John Whitehead has prepared a list of positions which will need to be filled for 2009-10, so please do consider Members volunteering to serve on the ExCom. The deadline is the end News Editor: David W. Warne (CH) of August. As announced at last year’s AGM and in the Webmaster: Bjarne Nielsen (DK) December News, we are planning to revise the Constitution: the proposed changes were placed on the Society’s website for 2007-08: Adriano Decarli (I) consultation and review from January-February 2008 and KyungMann Kim (USA) were generally considered to be a good idea. They will be put Rumana Omar (UK) to the membership for approval later this year at the same time as the postal vote for the ExCom. Catherine Quantin (F) Thanks to the contributors to this News: Emmanuel Jenö Reiczigel (H) Lesaffre, Harbajan Chadha-Boreham, Julia Singer, KyungMann Kim, Mike Campbell and Rumana Omar, Koos Marie Reilly (S) Zwinderman, John Whitehead, Sylvain Larroque, Bjarne Martin Schumacher (D) Nielsen, Philip Hougaard, Andrew Dunning, Ewa Kawalec, Vana Sypsa (GR) Zdenek Valenta and the 6 book reviewers.
    [Show full text]
  • 84483 Stats News May.Indd
    statistical society of australia incorporated newsletter May 2005 Number 111 Registered by Australia Post Publication No. NBH3503 issn 0314-6820 ‘Truth, damned truth and statistics’ It’s not every day that we can see the head of Australia’s official statistical agency in action under the scrutiny of Australia’s political press gallery. Dennis confidently faced the blowtorch for the Australian Bureau of Statistics – and the profession – in March. His talk kicks off ABS’s centenary celebrations and gave him an opportunity to air several issues. He argued that ABS has a strong and key role in a democracy by providing impartial, relevant and comprehensive information to inform public debate and decision making. As he said, ‘I am always gratified to see public debate that uses ABS statistics without qualification or question. For the fact is, the public and Australia can have strong faith in their official statistics. The same cannot be said for many other countries where pressure and influence can impact on what is collected, how it is collected and how it is released.’ Later, he said, ‘Because of this trust [in ABS’s data], discussions can focus on what the statistics mean for policy rather than on the integrity of the statistics themselves’. The reliability of ABS statistics was raised in the questions that followed his talk because the Reserve Bank had just raised interest rates for the first time after a long period, and in doing so publicly doubted the ABS statistics that the economy was already slowing. Dennis Dennis Trewin, the Australian Statistician, addresses the National Press Club displayed a keen grasp of the issues, In this issue Editorial ...........................................4 ANZJS Discussion Paper ................7 ACSPRI Winter Program ..............10 Statistics and Schooling ................
    [Show full text]
  • AP® Statistics
    About AP Statistics AP® Statistics Teacher’s Guide Michael Legacy Greenhill School Addison, Texas ™® connect to college success www.collegeboard.com i 72770-00003 • AP Statistics TG • INDC2 • Fonts: Helvetica, Minion, Serifa, Univers, Universal Greek & Math, Wingdings • D1 5/6/08 RI60953 • D1-1 5/30/08 RI60953 • D1-2 6/17/08 RI60953 • D1-3 7/10/08 RI60953 • D2 8/22/08 RI60953 • D3 9/19/08 RI60953 • D5 10/1/08 RI60953 • D6 10/7/08 RI60953 The College Board: Connecting Students to College Success The College Board is a not-for-profit membership association whose mission is to connect students to college success and opportunity. Founded in 1900, the association is composed of more than 5,400 schools, colleges, universities, and other educational organizations. Each year, the College Board serves seven million students and their parents, 23,000 high schools, and 3,500 colleges through major programs and services in college admissions, guidance, assessment, financial aid, enrollment, and teaching and learning. Among its best-known programs are the SAT®, the PSAT/NMSQT®, and the Advanced Placement Program® (AP®). The College Board is committed to the principles of excellence and equity, and that commitment is embodied in all of its programs, services, activities, and concerns. For further information visit www.collegeboard.com. © 2008 The College Board. All rights reserved. College Board, Advanced Placement Program, AP, AP Central, AP Vertical Teams, connect to college success, Pre-AP, SAT, and the acorn logo are registered trademarks of the College Board. AP Potential is a trademark owned by the College Board.
    [Show full text]
  • Model Selection and Model Averaging
    P1: SFK/UKS P2: SFK/UKS QC: SFK/UKS T1: SFK CUUK244-Claeskens 978-0-521-85225-8 January 13, 2008 8:41 Model Selection and Model Averaging Given a data set, you can fit thousands of models at the push of a button, but how do you choose the best? With so many candidate models, overfitting is a real danger. Is the monkey who typed Hamlet actually a good writer? Choosing a suitable model is central to all statistical work with data. Selecting the variables for use in a regression model is one important example. The past two decades have seen rapid advances both in our ability to fit models and in the theoretical understanding of model selection needed to harness this ability, yet this book is the first to provide a synthesis of research from this active field, and it contains much material previously difficult or impossible to find. In addition, it gives practical advice to the researcher confronted with conflicting results. Model choice criteria are explained, discussed and compared, including Akaike’s information criterion AIC, the Bayesian information criterion BIC and the focused information criterion FIC. Importantly, the uncertainties involved with model selec- tion are addressed, with discussions of frequentist and Bayesian methods. Finally, model averaging schemes, which combine the strength of several candidate models, are presented. Worked examples on real data are complemented by derivations that provide deeper insight into the methodology. Exercises, both theoretical and data-based, guide the reader to familiarity with the methods. All data analyses are compati- ble with open-source R software, and data sets and R code are available from a companion website.
    [Show full text]
  • Measuring the Human Gut Microbiome: New Tools and Non Alcoholic Fatty Liver Disease
    Western University Scholarship@Western Electronic Thesis and Dissertation Repository 6-13-2016 12:00 AM Measuring the human gut microbiome: new tools and non alcoholic fatty liver disease Ruth G. Wong The University of Western Ontario Supervisor Dr. Gregory B. Gloor The University of Western Ontario Graduate Program in Biochemistry A thesis submitted in partial fulfillment of the equirr ements for the degree in Master of Science © Ruth G. Wong 2016 Follow this and additional works at: https://ir.lib.uwo.ca/etd Part of the Bioinformatics Commons Recommended Citation Wong, Ruth G., "Measuring the human gut microbiome: new tools and non alcoholic fatty liver disease" (2016). Electronic Thesis and Dissertation Repository. 3839. https://ir.lib.uwo.ca/etd/3839 This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact [email protected]. Abstract With the advent of next generation DNA and RNA sequencing, scientists can obtain a more comprehensive snapshot of the bacterial communities on the human body (known as the ‘human microbiome’), leading to information about the bacterial composition, what genes are present, and what proteins are produced. The scientific community is in a phase of developing the experiments and accompanying statistical techniques to investigate the mechanisms by which the human microbiome affects health and disease. In this thesis, I explore alternatives to the standard weighted and unweighted UniFrac distance metric that measure the difference between microbiome samples. These alternative weightings allow for the extraction of subtle differences between samples and identification of outliers not visible with traditional methods.
    [Show full text]
  • The University of North Carolina System Author(S): E
    Statistical Training and Research: The University of North Carolina System Author(s): E. Shepley Nourse, Bernard G. Greenberg, Gertrude M. Cox, David D. Mason, James E. Grizzle, Norman L. Johnson, Lyle V. Jones, John Monroe, Gordon D. Simons, Jr. Source: International Statistical Review / Revue Internationale de Statistique, Vol. 46, No. 2 (Aug., 1978), pp. 171-207 Published by: International Statistical Institute (ISI) Stable URL: http://www.jstor.org/stable/1402812 Accessed: 01/06/2010 14:44 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=isi. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. International Statistical Institute (ISI) is collaborating with JSTOR to digitize, preserve and extend access to International Statistical Review / Revue Internationale de Statistique.
    [Show full text]