<<

: The Compass for Navigating a Data-Centric World

Marie Davidian

Department of Statistics North Carolina State University

January 11, 2013 Available at http://statistics2013.org

Statistics2013 Video Statistics2013 Video

Available at http://statistics2013.org

using . . . Statistics

http://fivethirtyeight.blogs.nytimes.com/

Silver used a statistical model to combine the results of state-by-state polls, weighting them according their previous accuracy, and to simulate many elections and estimate probabilities of the outcome

Triumph of the geeks

Nate Silver predicted the outcome of the 2012 US presidential election in all 50 states http://fivethirtyeight.blogs.nytimes.com/

Silver used a statistical model to combine the results of state-by-state polls, weighting them according their previous accuracy, and to simulate many elections and estimate probabilities of the outcome

Triumph of the geeks

Nate Silver predicted the outcome of the 2012 US presidential election in all 50 states using . . . Statistics Triumph of the geeks

Nate Silver predicted the outcome of the 2012 US presidential election in all 50 states using . . . Statistics

http://fivethirtyeight.blogs.nytimes.com/

Silver used a statistical model to combine the results of state-by-state polls, weighting them according their previous accuracy, and to simulate many elections and estimate probabilities of the outcome Triumph of the geeks

Others did, too. . .

“Dynamic Bayesian forecasting of presidential elections in the states,” by Drew A. Linzer, Journal of the American Statistical Association, in press But the interest in statistics didn’t start with the US elections. . .

Triumph of the geeks

“Nate Silver-led statistics men crush pundits in election” – Bloomberg Businessweek “Nate Silver has made statistics sexy again” – Associated Press “Drew Linzer: The stats man who predicted Obama’s win” – BBC News Magazine “The allure of the statistics field grows” – Boston Globe Triumph of the geeks

“Nate Silver-led statistics men crush pundits in election” – Bloomberg Businessweek “Nate Silver has made statistics sexy again” – Associated Press “Drew Linzer: The stats man who predicted Obama’s win” – BBC News Magazine “The allure of the statistics field grows” – Boston Globe

But the interest in statistics didn’t start with the US elections. . . Statistics in the news

New York Times, August 6, 2009

“I keep saying that the sexy job in the next 10 years will be statisticians” – Hal Varian, Chief Economist, Google Statistics in the news

New York Times, January 26, 2012

“I went to parties and heard a little groan when people heard what I did. Now they’re all excited to meet me” – Rob Tibshirani, Department of Statistics, Stanford University Statistics in the news

New York Times, February 11, 2012

“Statistics are interesting and fun. It’s cool now” – Andrew Gelman, Department of Statistics, Columbia University Statistics in the news

The Wall Street Journal, December 28, 2012

Carl Bialik, The Numbers Guy Data

• Administrative (e.g., tax records), government surveys • Genomic, meteorological, air quality, seismic, . . . • Electronic medical records, health care databases • Credit card transactions, point-of-sale, mobile phone • Online search, social networks • Polls, voter registration records

A veritable tsunami/deluge/avalanche of data

Data, data, and more data

Why is there so much talk of statistics and statisticians? • Administrative (e.g., tax records), government surveys • Genomic, meteorological, air quality, seismic, . . . • Electronic medical records, health care databases • Credit card transactions, point-of-sale, mobile phone • Online search, social networks • Polls, voter registration records

A veritable tsunami/deluge/avalanche of data

Data, data, and more data

Why is there so much talk of statistics and statisticians? Data Data, data, and more data

Why is there so much talk of statistics and statisticians? Data

• Administrative (e.g., tax records), government surveys • Genomic, meteorological, air quality, seismic, . . . • Electronic medical records, health care databases • Credit card transactions, point-of-sale, mobile phone • Online search, social networks • Polls, voter registration records

A veritable tsunami/deluge/avalanche of data Demand

2011 McKinsey Global Institute report: Big data: The next frontier for innovation, competition, and productivity

“A significant constraint. . . will be a shortage of . . . people with deep expertise in statistics and data mining. . . a talent gap of 140K - 190K positions in 2018 (in the US)”

http://www.mckinsey.com/insights/mgi/research/technology and innovation/ big data the next frontier for innovation • However, Big Data does not automatically mean Big Information • Science, decision-making, and policy formulation require not only prediction and finding associations and patterns, but uncovering causal relationships • Which, as we’ll discuss later, is not so easy. . .

Opportunities and challenges

• Our ability to collect, store, access, and manipulate vast and complex data is ever-improving • The potential benefits to science and society of learning from these data are enormous Opportunities and challenges

• Our ability to collect, store, access, and manipulate vast and complex data is ever-improving • The potential benefits to science and society of learning from these data are enormous • However, Big Data does not automatically mean Big Information • Science, decision-making, and policy formulation require not only prediction and finding associations and patterns, but uncovering causal relationships • Which, as we’ll discuss later, is not so easy. . . Big Data also supplies more raw material for statistical shenanigans and biased fact-finding excursions. It offers a high-tech twist on an old trick: I know the facts, now let’s find ’em. That is, says Rebecca Goldin, a mathematician at George Mason University, “one of the most pernicious uses of data.”

Perils

From “The Age of Big Data” With huge data sets and fine-grained measurement,. . . there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data, says Trevor Hastie, a statistics professor at Stanford, is that “many bits of straw look like needles.” Perils

From “The Age of Big Data” With huge data sets and fine-grained measurement,. . . there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data, says Trevor Hastie, a statistics professor at Stanford, is that “many bits of straw look like needles.” Big Data also supplies more raw material for statistical shenanigans and biased fact-finding excursions. It offers a high-tech twist on an old trick: I know the facts, now let’s find ’em. That is, says Rebecca Goldin, a mathematician at George Mason University, “one of the most pernicious uses of data.” Statistics

While Big Data have inspired considerable current interest in statistics, statistics has been fundamental in numerous areas of science, business, and government for decades

Critical need

Sound, objective methods for modeling, analysis, and interpretation While Big Data have inspired considerable current interest in statistics, statistics has been fundamental in numerous areas of science, business, and government for decades

Critical need

Sound, objective methods for modeling, analysis, and interpretation

Statistics Critical need

Sound, objective methods for modeling, analysis, and interpretation

Statistics

While Big Data have inspired considerable current interest in statistics, statistics has been fundamental in numerous areas of science, business, and government for decades • Statistical stories • Our data-rich future

Roadmap

• A brief history • Our data-rich future

Roadmap

• A brief history • Statistical stories Roadmap

• A brief history • Statistical stories • Our data-rich future Statistics: The science of learning from data and of measuring, controlling, and communicating uncertainty

The path to what is now the formal discipline of statistical science is long and winding. . .

What is statistics? The path to what is now the formal discipline of statistical science is long and winding. . .

What is statistics?

Statistics: The science of learning from data and of measuring, controlling, and communicating uncertainty What is statistics?

Statistics: The science of learning from data and of measuring, controlling, and communicating uncertainty

The path to what is now the formal discipline of statistical science is long and winding. . . • But it was not until the the mid-1600s that the mathematical notions of probability began to be developed by (mainly) mathematicians and physicists (e.g., Blaise Pascal), often inspired by games of chance • The first formal attempt to summarize and learn from data was by John Graunt, who created a precursor to modern life tables used in demography • Christiaan Huygens was among the first to connect such data analysis to probability

Origins – pre-1700

• Sporadic accounts of measurement and data collection and interpretation date back as early as 5 B.C. • The first formal attempt to summarize and learn from data was by John Graunt, who created a precursor to modern life tables used in demography • Christiaan Huygens was among the first to connect such data analysis to probability

Origins – pre-1700

• Sporadic accounts of measurement and data collection and interpretation date back as early as 5 B.C. • But it was not until the the mid-1600s that the mathematical notions of probability began to be developed by (mainly) mathematicians and physicists (e.g., Blaise Pascal), often inspired by games of chance Origins – pre-1700

• Sporadic accounts of measurement and data collection and interpretation date back as early as 5 B.C. • But it was not until the the mid-1600s that the mathematical notions of probability began to be developed by (mainly) mathematicians and physicists (e.g., Blaise Pascal), often inspired by games of chance • The first formal attempt to summarize and learn from data was by John Graunt, who created a precursor to modern life tables used in demography • Christiaan Huygens was among the first to connect such data analysis to probability • Jakob Bernoulli– law of large numbers, the Bernoulli and binomial probability distributions • Abraham de Moivre – The Doctrine of Chances, precursor to the central limit theorem • Daniel Bernoulli – expected utility, applications of probability to measurement problems in astronomy

Origins – 1700-1750

• From 1700 to 1750, many key results in classical probability that underlie statistical theory were derived Origins – 1700-1750

• From 1700 to 1750, many key results in classical probability that underlie statistical theory were derived • Jakob Bernoulli– law of large numbers, the Bernoulli and binomial probability distributions • Abraham de Moivre – The Doctrine of Chances, precursor to the central limit theorem • Daniel Bernoulli – expected utility, applications of probability to measurement problems in astronomy • Arien-Marie Legendre described the method of least squares in 1805

● ●

● ●

● ● ● ● ●

Milestone events – 1750-1820

• Thomas Bayes’ 1763 An essay towards solving a problem in the Doctrine of Chances presented a special case of Bayes’ theorem (posthumously) ● ●

● ●

● ● ● ● ●

Milestone events – 1750-1820

• Thomas Bayes’ 1763 An essay towards solving a problem in the Doctrine of Chances presented a special case of Bayes’ theorem (posthumously) • Arien-Marie Legendre described the method of least squares in 1805 Milestone events – 1750-1820

• Thomas Bayes’ 1763 An essay towards solving a problem in the Doctrine of Chances presented a special case of Bayes’ theorem (posthumously) • Arien-Marie Legendre described the method of least squares in 1805

● ●

● ●

● ● ● ● ●

● Milestone events – 1750-1820

• Thomas Bayes’ 1763 An essay towards solving a problem in the Doctrine of Chances presented a special case of Bayes’ theorem (pothumously) • Arien-Marie Legendre described the method of least squares in 1805

● ●

● ●

● ● ● ● ●

● Milestone events – 1750-1820

• Thomas Bayes’ 1763 An essay towards solving a problem in the Doctrine of Chances presented a special case of Bayes’ theorem (pothumously) • Arien-Marie Legendre described the method of least squares in 1805 • Pierre-Simon Laplace derived the central limit theorem and connected the normal probability distribution to least squares in 1810

Milestone events – 1750-1820

• Carl Fredrich Gauss connected least squares to Bayes theorem in 1809 Milestone events – 1750-1820

• Carl Fredrich Gauss connected least squares to Bayes theorem in 1809 • Pierre-Simon Laplace derived the central limit theorem and connected the normal probability distribution to least squares in 1810 Milestone events – 1750-1820

• Carl Fredrich Gauss connected least squares to Bayes theorem in 1809 • Pierre-Simon Laplace derived the central limit theorem and connected the normal probability distribution to least squares in 1810 Milestone events – 1750-1820

• Carl Fredrich Gauss connected least squares to Bayes theorem in 1809 • Pierre-Simon Laplace derived the central limit theorem and connected the normal probability distribution to least squares in 1810 • The Royal Statistical Society (1834) and American Statistical Association (1839) were founded • Francis Galton introduced regression analysis (1885) and correlation (1888) • established the field of biometry and developed fundamental methods, and founded the first statistical journal, (1901)

More milestones – 1820-1900

• Aldolphe Quetelet pioneered the statistical analysis of social science data – the “average man” (1835) and the normal distribution as a model for measurements (1842) • Francis Galton introduced regression analysis (1885) and correlation (1888) • Karl Pearson established the field of biometry and developed fundamental methods, and founded the first statistical journal, Biometrika (1901)

More milestones – 1820-1900

• Aldolphe Quetelet pioneered the statistical analysis of social science data – the “average man” (1835) and the normal distribution as a model for measurements (1842) • The Royal Statistical Society (1834) and American Statistical Association (1839) were founded More milestones – 1820-1900

• Aldolphe Quetelet pioneered the statistical analysis of social science data – the “average man” (1835) and the normal distribution as a model for measurements (1842) • The Royal Statistical Society (1834) and American Statistical Association (1839) were founded • Francis Galton introduced regression analysis (1885) and correlation (1888) • Karl Pearson established the field of biometry and developed fundamental methods, and founded the first statistical journal, Biometrika (1901) • William Gosset (“Student”), a brewer for Guinness in Dublin, derived the Student’s t distribution in 1908 • In the 1920s, developed many fundamental concepts, including the ideas of statistical models and randomization, theory of experimental design, the method of analysis of variance, and tests of significance • In the 1930s, and developed the theory of sampling, the competing approach of hypothesis testing, and the concept of confidence intervals • Experimental design became a mainstay of agricultural research

Modern statistics – 1900-1950s The modern discipline of statistics was really established only in the twentieth century • In the 1920s, Ronald Fisher developed many fundamental concepts, including the ideas of statistical models and randomization, theory of experimental design, the method of analysis of variance, and tests of significance • In the 1930s, Jerzy Neyman and Egon Pearson developed the theory of sampling, the competing approach of hypothesis testing, and the concept of confidence intervals • Experimental design became a mainstay of agricultural research

Modern statistics – 1900-1950s The modern discipline of statistics was really established only in the twentieth century • William Gosset (“Student”), a brewer for Guinness in Dublin, derived the Student’s t distribution in 1908 • In the 1930s, Jerzy Neyman and Egon Pearson developed the theory of sampling, the competing approach of hypothesis testing, and the concept of confidence intervals • Experimental design became a mainstay of agricultural research

Modern statistics – 1900-1950s The modern discipline of statistics was really established only in the twentieth century • William Gosset (“Student”), a brewer for Guinness in Dublin, derived the Student’s t distribution in 1908 • In the 1920s, Ronald Fisher developed many fundamental concepts, including the ideas of statistical models and randomization, theory of experimental design, the method of analysis of variance, and tests of significance • Experimental design became a mainstay of agricultural research

Modern statistics – 1900-1950s The modern discipline of statistics was really established only in the twentieth century • William Gosset (“Student”), a brewer for Guinness in Dublin, derived the Student’s t distribution in 1908 • In the 1920s, Ronald Fisher developed many fundamental concepts, including the ideas of statistical models and randomization, theory of experimental design, the method of analysis of variance, and tests of significance • In the 1930s, Jerzy Neyman and Egon Pearson developed the theory of sampling, the competing approach of hypothesis testing, and the concept of confidence intervals Modern statistics – 1900-1950s The modern discipline of statistics was really established only in the twentieth century • William Gosset (“Student”), a brewer for Guinness in Dublin, derived the Student’s t distribution in 1908 • In the 1920s, Ronald Fisher developed many fundamental concepts, including the ideas of statistical models and randomization, theory of experimental design, the method of analysis of variance, and tests of significance • In the 1930s, Jerzy Neyman and Egon Pearson developed the theory of sampling, the competing approach of hypothesis testing, and the concept of confidence intervals • Experimental design became a mainstay of agricultural research • Also in the 1930s, Bayesian statistical inference was developed by Bruno de Finetti and others • In the 1940s, many departments of statistics were established at universities in the US and Europe • And fundamental theory of statistical inference was pursued by Wald, Cramer,´ Rao and many others

Modern statistics – 1900-1950s

• Fisher/Neyman-Pearson established the paradigm of frequentist statistical inference that is used today • In the 1940s, many departments of statistics were established at universities in the US and Europe • And fundamental theory of statistical inference was pursued by Wald, Cramer,´ Rao and many others

Modern statistics – 1900-1950s

• Fisher/Neyman-Pearson established the paradigm of frequentist statistical inference that is used today • Also in the 1930s, Bayesian statistical inference was developed by Bruno de Finetti and others Modern statistics – 1900-1950s

• Fisher/Neyman-Pearson established the paradigm of frequentist statistical inference that is used today • Also in the 1930s, Bayesian statistical inference was developed by Bruno de Finetti and others • In the 1940s, many departments of statistics were established at universities in the US and Europe • And fundamental theory of statistical inference was pursued by Wald, Cramer,´ Rao and many others Modern statistics to the present

From the 1950s on, there were numerous advances in theory, methods, and application • The advent of medical statistics and epidemiological methods (, ) • The development of methods for analysis of censored time-to-event data (Paul Meier, D.R. Cox) • The use of the theory of sampling to design surveys and the US census (Jerzy Neyman, Morris Hansen) • The adoption of statistical quality control and experimental design in industry (W. Edwards Deming, George Box) • Exploratory data analysis (John Tukey) • And many, many more. . . • Statistical software implementing popular methods became widespread (e.g., SAS, developed at NC State in the 1960s/70s) • Simulation to investigate performance of statistical methods became possible • Bayesian statistical methods became feasible in complex settings (Markov chain Monte Carlo– MCMC)

Modern statistics to the present

Computing fundamentally altered the field of statistics forever • Complex calculations became feasible • Much larger and more complicated data sets could be created and analyzed • Sophisticated models and methods could be applied • Simulation to investigate performance of statistical methods became possible • Bayesian statistical methods became feasible in complex settings (Markov chain Monte Carlo– MCMC)

Modern statistics to the present

Computing fundamentally altered the field of statistics forever • Complex calculations became feasible • Much larger and more complicated data sets could be created and analyzed • Sophisticated models and methods could be applied • Statistical software implementing popular methods became widespread (e.g., SAS, developed at NC State in the 1960s/70s) Modern statistics to the present

Computing fundamentally altered the field of statistics forever • Complex calculations became feasible • Much larger and more complicated data sets could be created and analyzed • Sophisticated models and methods could be applied • Statistical software implementing popular methods became widespread (e.g., SAS, developed at NC State in the 1960s/70s) • Simulation to investigate performance of statistical methods became possible • Bayesian statistical methods became feasible in complex settings (Markov chain Monte Carlo– MCMC) • Statisticians are ubiquitous in medical and public health research, working with health sciences researchers to design studies, analyze data, and draw conclusions • Google, Facebook, LinkedIn, credit card companies, global retailers employ statisticians to develop and implement methods to mine their vast data • Government science, regulatory, and statistical agencies employ statisticians to design surveys, make forecasts, develop estimates of income, review new drug applications, assess evidence of health effects of pollutants, . . .

Today

Statistical methods are used routinely in science, industry/business, and government • Pharmaceutical companies employ statisticians, who work in all stages of drug development • Google, Facebook, LinkedIn, credit card companies, global retailers employ statisticians to develop and implement methods to mine their vast data • Government science, regulatory, and statistical agencies employ statisticians to design surveys, make forecasts, develop estimates of income, review new drug applications, assess evidence of health effects of pollutants, . . .

Today

Statistical methods are used routinely in science, industry/business, and government • Pharmaceutical companies employ statisticians, who work in all stages of drug development • Statisticians are ubiquitous in medical and public health research, working with health sciences researchers to design studies, analyze data, and draw conclusions • Government science, regulatory, and statistical agencies employ statisticians to design surveys, make forecasts, develop estimates of income, review new drug applications, assess evidence of health effects of pollutants, . . .

Today

Statistical methods are used routinely in science, industry/business, and government • Pharmaceutical companies employ statisticians, who work in all stages of drug development • Statisticians are ubiquitous in medical and public health research, working with health sciences researchers to design studies, analyze data, and draw conclusions • Google, Facebook, LinkedIn, credit card companies, global retailers employ statisticians to develop and implement methods to mine their vast data Today

Statistical methods are used routinely in science, industry/business, and government • Pharmaceutical companies employ statisticians, who work in all stages of drug development • Statisticians are ubiquitous in medical and public health research, working with health sciences researchers to design studies, analyze data, and draw conclusions • Google, Facebook, LinkedIn, credit card companies, global retailers employ statisticians to develop and implement methods to mine their vast data • Government science, regulatory, and statistical agencies employ statisticians to design surveys, make forecasts, develop estimates of income, review new drug applications, assess evidence of health effects of pollutants, . . . Statistical stories

Some diverse examples where statistics and statisticians are essential... • An experiment designed to compare a new treatment to a control treatment • Subjects are randomized to receive one treatment or the other ⇒ unbiased, fair comparison using statistical methods (hypothesis testing) • In addition, blinding, placebo • The first such was conducted in the UK by the Medical Research Council in 1948, comparing streptomycin+bed rest to bed rest alone in tuberculosis • In 1954, 800K children in the US were randomized to the Salk polio vaccine or placebo to assess the vaccine’s effectiveness in preventing paralytic polio

The controlled clinical trial

The gold standard study for comparison of treatments (a question of cause and effect) • In addition, blinding, placebo • The first such clinical trial was conducted in the UK by the Medical Research Council in 1948, comparing streptomycin+bed rest to bed rest alone in tuberculosis • In 1954, 800K children in the US were randomized to the Salk polio vaccine or placebo to assess the vaccine’s effectiveness in preventing paralytic polio

The controlled clinical trial

The gold standard study for comparison of treatments (a question of cause and effect) • An experiment designed to compare a new treatment to a control treatment • Subjects are randomized to receive one treatment or the other ⇒ unbiased, fair comparison using statistical methods (hypothesis testing) The controlled clinical trial

The gold standard study for comparison of treatments (a question of cause and effect) • An experiment designed to compare a new treatment to a control treatment • Subjects are randomized to receive one treatment or the other ⇒ unbiased, fair comparison using statistical methods (hypothesis testing) • In addition, blinding, placebo • The first such clinical trial was conducted in the UK by the Medical Research Council in 1948, comparing streptomycin+bed rest to bed rest alone in tuberculosis • In 1954, 800K children in the US were randomized to the Salk polio vaccine or placebo to assess the vaccine’s effectiveness in preventing paralytic polio • Because a trial involves only a sample of patients from the entire population, the results are subject to uncertainty • Statistical methods are critical for determining the sample size required to ensure that a real difference can be detected with a specified degree of confidence • Which is why regulatory bodies like the FDA employ 100s of statisticians • In the last 4 decades, statisticians have developed new methods to handle ethical and practical considerations • E.g., group sequential trials that allow interim analyses at which the trial can be stopped early without compromising the ability to make a valid comparison

The controlled clinical trial

• In 1969, evidence from a randomized clinical trial became mandatory for a new product to receive approval from the US Food and Drug Administration (FDA) • Which is why regulatory bodies like the FDA employ 100s of statisticians • In the last 4 decades, statisticians have developed new methods to handle ethical and practical considerations • E.g., group sequential trials that allow interim analyses at which the trial can be stopped early without compromising the ability to make a valid comparison

The controlled clinical trial

• In 1969, evidence from a randomized clinical trial became mandatory for a new product to receive approval from the US Food and Drug Administration (FDA) • Because a trial involves only a sample of patients from the entire population, the results are subject to uncertainty • Statistical methods are critical for determining the sample size required to ensure that a real difference can be detected with a specified degree of confidence • In the last 4 decades, statisticians have developed new methods to handle ethical and practical considerations • E.g., group sequential trials that allow interim analyses at which the trial can be stopped early without compromising the ability to make a valid comparison

The controlled clinical trial

• In 1969, evidence from a randomized clinical trial became mandatory for a new product to receive approval from the US Food and Drug Administration (FDA) • Because a trial involves only a sample of patients from the entire population, the results are subject to uncertainty • Statistical methods are critical for determining the sample size required to ensure that a real difference can be detected with a specified degree of confidence • Which is why regulatory bodies like the FDA employ 100s of statisticians The controlled clinical trial

• In 1969, evidence from a randomized clinical trial became mandatory for a new product to receive approval from the US Food and Drug Administration (FDA) • Because a trial involves only a sample of patients from the entire population, the results are subject to uncertainty • Statistical methods are critical for determining the sample size required to ensure that a real difference can be detected with a specified degree of confidence • Which is why regulatory bodies like the FDA employ 100s of statisticians • In the last 4 decades, statisticians have developed new methods to handle ethical and practical considerations • E.g., group sequential trials that allow interim analyses at which the trial can be stopped early without compromising the ability to make a valid comparison The controlled clinical trial • My friend Tim Gregoire of Yale University, an expert in forest biometry, was consulted to help plan and implement Bhutan’s comprehensive NFI

National forest inventory

Next stop, Bhutan • The Kingdom of Bhutan, in South Asia, transitioned to a constitutional democracy in 2008 • The new constitution mandates that Bhutan maintain 60% forest cover in perpetuity • A National Forest Inventory was called for. . . National forest inventory

Next stop, Bhutan • The Kingdom of Bhutan, in South Asia, transitioned to a constitutional democracy in 2008 • The new constitution mandates that Bhutan maintain 60% forest cover in perpetuity • A National Forest Inventory was called for. . . • My friend Tim Gregoire of Yale University, an expert in forest biometry, was consulted to help plan and implement Bhutan’s comprehensive NFI National forest inventory Statistics is critical to developing the sampling plan for both remote sensing and field data and to estimation of abundance of resources based on 100s of measurements

National forest inventory

A NFI is an assessment based on statistical sampling and estimation of the forest resources of a nation • Set policy on forest resource management • Monitor biodiversity, habitat type and extent, land conversion rates • Measure quantity/quality of wood fiber for commodities • Measure non-wood forest products • Measure carbon storage and change • Reference spatially where resources are located National forest inventory

A NFI is an assessment based on statistical sampling and estimation of the forest resources of a nation • Set policy on forest resource management • Monitor biodiversity, habitat type and extent, land conversion rates • Measure quantity/quality of wood fiber for commodities • Measure non-wood forest products • Measure carbon storage and change • Reference spatially where resources are located Statistics is critical to developing the sampling plan for both remote sensing and field data and to estimation of abundance of resources based on 100s of measurements National forest inventory • Key: UnderstandingAbsorption,Distribution,Metabolism, Excretion in the population and how these processes vary across patients and are altered by conditions • Statistical modeling is an integral part of the science

Pharmacokinetics

What’s behind a drug label? • A drug should be safe and effective • Labeling provides guidance on dose, conditions under which a drug should/should not be taken • Partly behind this – pharmacokinetics (PK), the science of “what the body does to the drug” Pharmacokinetics

What’s behind a drug label? • A drug should be safe and effective • Labeling provides guidance on dose, conditions under which a drug should/should not be taken • Partly behind this – pharmacokinetics (PK), the science of “what the body does to the drug” • Key: UnderstandingAbsorption,Distribution,Metabolism, Excretion in the population and how these processes vary across patients and are altered by conditions • Statistical modeling is an integral part of the science Pharmacokinetics

A hierarchical statistical model that allows these processes to vary across patients and conditions is fitted to drug concentration-time data Pharmacokinetics

ka Dose Conc(t) = [exp{−(Cl/V )t} − exp(−kat)] V (ka − Cl/V )

ka = absorption rate, V = volume of distribution, Cl = clearance Forensic science An area where statisticians and better statistics are desperately needed! • Fingerprints, DNA analysis, bite marks, firearm toolmarks, hair specimens, writing samples, toxicological analysis,. . . • Laboratory- or expert interpretation-based • 2009 US National Academy of Sciences report

• The report cites examples of lack of sufficient recognition of sources of variability and their effects on uncertainties in many types of forensic science analyses. . . Forensic science

“With the exception of nuclear DNA analysis, however, no forensic method has been rigorously shown to have the capacity to consistently, and with a high degree of certainty, demonstrate a connection between evidence and a specific individual or source.” “A body of research is required to establish the limits and measures of performance and to address the impact of sources of variability and potential bias.” “The development of quantifiable measures of uncertainty in the conclusions of forensic analyses . . . and of quantifiable measures of the reliability and accuracy of forensic analyses (are needed).” Basically, the report recommends that current and new forensic practices should be developed and assessed using properly designed experiments and statistical methods! The hazards of haphazard data

When data are simply observed and collected, without a principled design and randomization, be wary! • Investigations of causal relationships can be compromised by confounding • E.g., comparison of the effects of competing treatments • When individual patients and their providers decide which treatment to take, there may be factors that are associated with both the choice of treatment and outcome • Failure to recognize/identify such confounding factors can lead to misleading conclusions Simpson’s paradox

Data on 2 treatments from a healthcare database

Avg Trt B

Avg Trt A Average Outcome Average

Simpson’s paradox

Data on 2 treatments from a healthcare database

Trt A

Trt B

Trt A Average Outcome Average

Trt B

Male Female

Simpson’s paradox

Data on 2 treatments from a healthcare database

Trt A

Trt B

Avg Trt B

Avg Trt A

Trt A Average Outcome Average A: 80%/20% M/F B: 20%/80% M/F Trt B

Male Female

Confounding and other threats

• Statistical methods are available to take confounding into appropriate account • . . . but the confounding factors must be recorded in the database! Other threats • Missing information – why are some factors not recorded for some individuals? • Drop out – sicker patients may disappear sooner in a longitudinal study • Etc Comparative effectiveness research, which strives to recommend best uses for existing treatment through analyses of such databases, requires statistics! Big Data have enormous potential for new generating new knowledge and improving human welfare. However, Big Data without statistics have enormous potential to mislead. “The future demands that scientists, policy-makers, and the public be able to interpret increasingly complex information and recognize both the benefits and pitfalls of statistical analysis. Embedding statistics in science and society will pave the route to a data-informed future, and statisticians must lead this charge.” – Davidian and Louis, Science, April 6, 2012

Confronting our data-rich future

I hope I have convinced you that statistics and statisticians are essential to our data-rich future! “The future demands that scientists, policy-makers, and the public be able to interpret increasingly complex information and recognize both the benefits and pitfalls of statistical analysis. Embedding statistics in science and society will pave the route to a data-informed future, and statisticians must lead this charge.” – Davidian and Louis, Science, April 6, 2012

Confronting our data-rich future

I hope I have convinced you that statistics and statisticians are essential to our data-rich future! Big Data have enormous potential for new generating new knowledge and improving human welfare. However, Big Data without statistics have enormous potential to mislead. Confronting our data-rich future

I hope I have convinced you that statistics and statisticians are essential to our data-rich future! Big Data have enormous potential for new generating new knowledge and improving human welfare. However, Big Data without statistics have enormous potential to mislead. “The future demands that scientists, policy-makers, and the public be able to interpret increasingly complex information and recognize both the benefits and pitfalls of statistical analysis. Embedding statistics in science and society will pave the route to a data-informed future, and statisticians must lead this charge.” – Davidian and Louis, Science, April 6, 2012 2013 – the International Year of Statistics

A celebration of the contributions of statistics is long overdue!

http://statistics2013.org References and further reading

Aldrich, J. Figures from the history of probability and statistics. http://www.economics.soton.ac.uk/staff/aldrich/Figures.htm Davidian, M. and Louis, T.A. (2012). Why statistics? Science, 336, 12. Feinberg, S.E. (1992). A brief in three and one half chapters: A review essay. Statistical Science, 7, 208–225. Stigler, S.M. (1986). The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press.