A Light Introduction to Statistics
Total Page:16
File Type:pdf, Size:1020Kb
Primer Hypothesis Prior to testing Tests Corr & regress A Light Introduction to Statistics Pavol Jancura ESCMID Workshop 8-10-2012 © by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Sources & references Articles and books P Driscoll et al. An introduction to statistics, J Accid Emerg Med 2000;17:4-6 and 18:1-4.a H Motulsky. Intuitive Biostatistics, 2010 (2nd edition). R A Donnelly. The complete idiot's guide to statistics, 2007 (2nd edition).b B Illowsky, S Dean. Collaborative Statistics, free available.c a First chapter at http://emj.bmj.com/content/17/3/205.2.full b Errata at http://www.stat-guide.com c At http://cnx.org/content/col10522/latest/ or http://cnx.org/content/col10522/1.40/pdf © by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Sources & references Web http://en.wikipedia.org/wiki/Portal:Statistics http://en.wikibooks.org/wiki/Statistics http://en.wikiversity.org/wiki/Statistics http://www.graphpad.com/guides/prism/6/statistics/ http://www.sjsu.edu/faculty/gerstman/StatPrimer/ http://www.statsoft.com/textbook/ http://stattrek.com/tutorials/statistics-tutorial.aspx © by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress 1 Primer 2 Hypothesis Research hypothesis Statistical hypothesis Hypothesis test 3 Prior to testing Student's t-distribution 95% CI Test validity 4 Comparative tests One sample Two samples 5 Correlation and regression Correlation Linear regression© by author Non-linear regression Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Primer © by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Data collection When we design an experiment then population represents all possible subjects (individuals) relevant for the experiment (e.g. people). a sample represents a set of subjects selected from entire population ((e.g. a group of people)). data are the collection of measurements (values) taken from/on the sample. a variable is a type of one measurement taken for all or a subset of subjects within the sample (age, gender, ...). an observation is a set of measurements (values of some variables) for a single subject of our sample, e.g when you come to a physician, he makes an observation on you (temperature, heart beat, blood pressure,...). an outlier©is anby (outlying) author observation that is numerically distant (deviates markedly) from the rest of the data. Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Outlier © by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Sampling We have a collection of measurements (data) on a set of subjects (sample) selected from all possible subjects (population). Sampling is a process of selecting subjects from a population for investigation. a proper sample is critical to the accuracy of the statistical analysis; it should be representative of the population from which it was taken. Sampling bias: a sample is collected in such a way that some members of the intended population are less likely to be included than others ) a biased sample. Sampling error: the sample measurement is different from the population measurement. It is the result of selecting a (biased) sample© that isby not a perfect author match to the entire population. Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Displaying data I Scatter plot a type of mathematical diagram using Cartesian coordinates to display© values forby two variables author for a set of data. Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Displaying data I Histogram a graphical representation showing a visual impression of the distribution© of data. by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Displaying data I Histogram an estimate of the probability distribution of a continuous variable. © by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Distribution Probability distribution Probability density function (PDF) Cumulative distribution function (CDF) a probability distribution assigns a probability to each of the possible outcomes of a random experiment (on your sample or population).© by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Summarizing data How to summarize our data into few numbers? Summary statistics centrality (average) spread (dispersion,© by variation) author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Centrality measures Measures of central tendency describe the centre point of a data set with a single value. median - the (middle) value in the data set for which half the observations are higher and half the observations are lower; when there is an even number of data points, the median will be half of the sum of the two centre points. mode - the most frequent value in the data set. When the data on a sample are considered, we talk about sample median or mode. When the data on whole population are considered, we© talk about bypopulation author median or mode. Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Centrality measures Let N be the number of all possible subjects (the size of population) and n be the number of the selected subjects (n ≤ N) for an experiment (the size of sample). Let x1; x2;::: be numeric values of one variable X (salary) measured on the subjects (people) of a population. arithmetic mean: sample mean n x1 + x2 + ··· + xn 1 X x¯ = = x n n i i=1 population mean N x1 + x2 + ··· + xN 1 X ©µ = by author= xi N N i=1 Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Centrality measures © by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Centrality measures geometric mean: sample ! 1 p n n n Y x1x2 ··· xn = xi i=1 population 1 ! N p N N Y x1x2 ··· xN = xi i=1 Mostly, you always work with a sample from a population (a group of mice, a group© of people,...). by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Centrality measures Geometric mean vs Arithmetic mean Geometric mean ≤ arithmetic mean Geometric mean changes accordingly with the changes in the proportion among values when their overall sum does not change. Geometric mean works only with positive numbers (> 0). © by author Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Centrality measures Geometric mean vs Arithmetic mean Arithmetic mean is good to represent data with no significant outliers. Arithmetic mean is used for numbers whose values are meant to be added together (to get a total gain). Geometric mean is used for numbers whose values are meant to be multiplied together (to get a total gain). Geometric mean is often used to evaluate data covering several orders of magnitude. If your data covers a narrow range, geometric© meansby may notauthor be appropriate. Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Centrality measures Geometric mean vs Arithmetic mean Geometric mean is used in cases where the differences among data points are logarithmic or exponential in nature (e.g. a population growth, interest rates). Geometric means is more appropriate than the arithmetic mean for describing proportional growth, both exponential growth (constant proportional growth) and varying growth. The geometric mean of growth over periods yields the equivalent constant growth rate that would yield the same final amount. Do not use geometric mean on a log-transform data (the log of geometric mean is the arithmetic mean of a log transform 1 1 data: log© (Q x ) n =byP log xauthor). i n i Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Spread measures (Dispersion) Measures of dispersion describe how far the individual data values have strayed from the centre point (mean). Let X = fx1; x2;::: g be a sample or population data. Let N be the size of a population and n < N be the size of a population sample. range Range = maxfX g − minfX g. quartiles are three values (Q1; Q2; Q3) that divide the data set into four equal segments covering approx 25% data each after it has been arranged in ascending order. Q2 = medianfX g Q1 = medianfxi 2 (minfX g; Q2)g Q3 = medianfxi 2 (Q2; maxfX g)g Then each© of the intervalsby (minauthorfX g; Q1); (Q1; Q2); (Q 2; Q3) and (Q3; maxfX g) covers 25% data. Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Spread measures (Dispersion) interquartile range measures the spread of the centre half of the data (covers 50% of the data). IQR = Q3 − Q1 The interquartile range is used to identify outliers, as outliers' accuracy may be questioned and can cause unwanted distortions in statistical results. Outliers are identified as data points outside of the following interval: ©(Q 1 −by1:5 × IQR author; Q3 + 1:5 × IQR) Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Displaying data II Box plot Box-and-whisker diagram (plot) box represents the data within IQR. whiskers©usually correspondby toauthorRange, ±1:5IQR or other dispersion measures Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Displaying data II Box plot © by author identifying an outlier on a box plot Pavol Jancura ESCMID Online Lecture Library Primer Hypothesis Prior to testing Tests Corr & regress Spread measures (Dispersion) II Measures of dispersion continue..