
<p>Introduction to Biostatistics </p><p>Jie Yang, Ph.D. </p><p>Associate Professor <br>Department of Family, Population and Preventive Medicine </p><p>Director </p><p>Biostatistical Consulting Core <br>Director </p><p>Biostatistics and Bioinformatics Shared Resource, Stony Brook Cancer Center </p><p>In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). </p><p>OUTLINE </p><p> What is Biostatistics </p><p> What does a biostatistician do </p><p>• Experiment design, clinical trial design </p><p>• Descriptive and Inferential analysis </p><p>• Result interpretation </p><p> What you should bring while consulting </p><p>with a biostatistician </p><p>WHAT IS BIOSTATISTICS </p><p>• The science of (bio)statistics encompasses </p><p> the design of </p><p>biological/clinical experiments </p><p> the collection, </p><p>summarization, and analysis of data from </p><p>those experiments </p><p> the interpretation of, and inference from, the </p><p><em>How to Lie with Statistics </em>(1954) by Darrell Huff. </p><p>results </p><p><a href="/goto?url=http://www.youtube.com/watch?v=PbODigCZqL8" target="_blank">http://www.youtube.com/watch?v=PbODigCZqL8 </a></p><p>GOAL OF STATISTICS </p><p>Sampling </p><p>Probability Theory </p><p>POPULATION </p><p>SAMPLE </p><p>Descriptive Statistics <br>Descriptive Statistics </p><p>Inference </p><p><strong>Sample </strong></p><p><strong>Population </strong></p><p><strong>Parameters: </strong></p><p>흁, 흈, 흅<strong>… </strong></p><p>Inferential Statistics </p><p><strong>Statistics: </strong></p><p>ഥ</p><p>ෝ</p><p>푿 <strong>, </strong>풔<strong>, </strong>풑 <strong>,… </strong></p><p>PROPERTIES OF A “GOOD” SAMPLE </p><p>• Adequate sample size (statistical power) </p><p>• Random selection (representative) </p><p>Sampling Techniques: </p><p>1.Simple random sampling 2.Stratified sampling </p><p>3.Systematic sampling </p><p>4.Cluster sampling 5.Convenience sampling </p><p>STUDY DESIGN </p><p>EXPERIEMENT DESIGN </p><p> <strong>Completely Randomized Design (CRD) </strong></p><p>- Randomly assign the experiment units to the treatments <br> <strong>Design with Blocking </strong>– dealing with nuisance factor which has </p><p>some effect on the response, but of no interest to the experimenter; Without </p><p>blocking, large unexplained error leads to less detection power. </p><p>1. Randomized Complete Block Design (RCBD) - One single blocking factor </p><p>2. Latin Square </p><p>Design (two blocking factor) <br>3. Cross over Design </p><p>(each subject=blocking factor) <br>4. Balanced Incomplete </p><p>Block Design </p><p>EXPERIMENT DESIGN </p><p> <strong>Factorial Design</strong>: similar to randomized block design, but allowing to test the interaction between two treatment effects. A significant interaction between A and B tells: <br>• the effect of <em>A </em>is different at each level of <em>B</em>. Or the </p><p>effect of <em>B </em>differs at each level of <em>A</em>. </p><p>• it is not very sensible to even be talking about the main </p><p>effect of <em>A </em>and <em>B </em></p><p> <strong>Experiment with random factors: </strong>randomly select <em>n </em>of </p><p>the possible levels of the factor of interest. Typically random factors are categorical. <br> <strong>Split-plot Design</strong>: confounding a main effect with blocks </p><p>EVIDENCE PYRAMID </p><p>IMPACT Observatory: tracking the evolution of clinical trial data sharing and research integrity - Scientific Figure on <a href="/goto?url=https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368" target="_blank">ResearchGate. Available from: https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368 [accessed 14 Jan, 2019] </a></p><p>WHAT CAN A STATISTICIAN HELP </p><p>DURING STUDY DESIGN PHASE </p><p> Blinding/masking and randomization </p><p>The number and combination of experimental inventions The timing of measurements or visits </p><p>Collect information on a larger sample or on the same </p><p>sample over time <br>Ways to maximize the efficient use of the available resources </p><p>Even for data management – how to code measures </p><p>and what to computerize directly affect the ease even the feasibility of subsequent analysis </p><p>How a biostatistician analyzes data </p><p>TYPE OF DATA <br><strong>1. Nominal data</strong>: unordered categories or classes </p><p>e.g. gender, blood type, transplant type </p><p>2. <strong>Ordinal data</strong>: order among categories is important </p><p>e.g. disease severity, AE level </p><p>3. <strong>Discrete data</strong>: both ordering and magnitude are important; </p><p>often integers or counts, no intermediate values are possible </p><p>e.g. # of accidents within a month, # of kids in a family </p><p>4. <strong>Continuous data</strong>: difference between two possible data </p><p>values can be arbitrarily small </p><p>e.g. height, weight, body temperature, serum level, BP </p><p>5. <strong>Time to event data</strong>: censoring presents </p><p>e.g. overall survival </p><p>DESCRIPTIVE STATISTICS </p><p> General goal is to describe the distribution of a single </p><p>variable (center, spread, shape, functional form) Helpful for checking data and assumptions Stratified (by group) analysis can be done for groups of interest </p><p> Values and comparisons can be visualized and </p><p>“estimated” but descriptive statistics alone will provide no </p><p>information about our level of confidence in conclusions </p><p>DESCRIPTIVE STATISTICS </p><p><strong>1. Measure of central tendency </strong></p><p> Mean: average </p><p> Median: the 50<sup style="top: -0.7em;">th </sup>percentile point (median </p><p>value); <br> Mode: value that occurs most frequently; </p><p>unimodal and multimodal </p><p>DESCRIPTIVE STATISTICS </p><p>• Reporting a measure of center gives only partial </p><p>information about a data set. </p><p>– Example: Consider the following three datasets: </p><p>Dataset 1: 4 5 5 5 6 </p><p>Dataset 2: 1 3 5 7 9 Dataset 3: 1 5 5 5 9 </p><p> All the three datasets have identical means and medians. </p><p> Datasets 2&3 are more variable than the 1<sup style="top: -0.5em;">st </sup>one. </p><p>• It is also important to describe the spread of values </p><p>about the center. </p><p>DESCRIPTIVE STATISTICS </p><p><strong>2. Measure of variability </strong></p><p>• Range= Max –Min </p><p>• Inter-Quartile Range (IQR)=Q3-Q1 </p><p>• Variance, Sample Variance • Standard Deviation, Sample Standard Deviation </p><p>IDENTIFYING POTENTIAL OUTLIERS </p><p>• An Outlying Value is a value, <em>X, such that </em></p><p><strong>X> Q3+ 1.5(IQR) </strong></p><p><strong>or </strong></p><p><strong>X< Q1–1.5(IQR) </strong></p><p>• An Extreme Outlying Value is a value, <em>X, such </em></p><p><em>that </em></p><p><strong>X> Q3+ 3(IQR) </strong></p><p><strong>or </strong></p><p><strong>X< Q1–3(IQR) </strong></p><p>EFFECTS OF OUTLIERS <br> Median and IQR are generally unaffected by the removal </p><p>of outliers but minor changes are possible. </p><p> Mean and Standard Deviation will be affected by the outlying values. </p><p> Apparent shape of the distribution can also be affected by </p><p>outlying values. <br> One should never simply remove data values from a </p><p>dataset. </p><p> In practice, if the outliers are not errors, sensitivity analysis will often be conducted or robust statistical </p><p>methods will be used. </p><p>WAYS OF PRESENTING DATA </p><p>• Summary table </p><p>• Bar/Pie chart • Histogram </p><p>• Scatter plot </p><p>• Boxplot </p><p>1. Outlier </p><p>2. Extreme Outlier 3. Modified Boxplot <br>SUMMARY TABLE </p><p>1. By one variable </p><p>side left right <br>N14 14 <br>Mean 18.83 18.61 <br>SD <br>6.04 5.48 <br>Median <br>18.25 17.75 <br>Min 8.00 8.80 <br>Max <br>30.10 28.21 </p><p>Variable N_missing Level Total (N=25) Case (N=12) Control (N=13) <br>No 24 (96.00%) 11 (91.67%) 13 (100.00%) </p><p></p><ul style="display: flex;"><li style="flex:1">Cancer </li><li style="flex:1">0</li></ul><p></p><p></p><ul style="display: flex;"><li style="flex:1">Yes </li><li style="flex:1">1 (4.00%) </li><li style="flex:1">1 (8.33%) </li><li style="flex:1">0 (0.00%) </li></ul><p></p><p>2. By multiple categorical variables </p><p>Radiation </p><p>Sequence with </p><p>Surgery <br>Location of </p><p>Before 2002 </p><p>107(19.21%) 450(80.79%) <br>20(13.16%) </p><p>132(86.84%) <br>After 2002 <br>Tumor </p><p>Preoperative Postoperative Preoperative </p><p>Postoperative </p><p>65(15.66%) <br>350(84.34%) <br>21(16.03%) </p><p>110(83.97%) </p><p>Lower <br>(n = 972) </p><p>Upper </p><p>(n = 283) </p><p>BAR CHART AND PIE CHART </p><p><strong>Bariatric surgeries, 2010-2013 </strong></p><p>14.00% 12.00% 10.00% <br>8.00% 6.00% 4.00% 2.00% 0.00% </p><p>12.76% <br>10.35% <br>9.32% </p><p>8.12% </p><p>7.44% <br>5.75% </p><p>2.88% <br>4.46% <br>3.59% </p><p></p><ul style="display: flex;"><li style="flex:1">AGB </li><li style="flex:1">LSG </li></ul><p>Admitted from ED <br>RYGB </p><p>Diagnosis for Cholecystectomy patients, 2006-2013 </p><p></p><ul style="display: flex;"><li style="flex:1">ED revisit </li><li style="flex:1">Discharged from ED </li></ul><p></p><p>HISTOGRAM AND SCATTER PLOT </p><p><strong>DISC measurement by group </strong><br><strong>70 </strong></p><p><strong>60 50 40 30 20 10 </strong><br><strong>0</strong></p><ul style="display: flex;"><li style="flex:1"><strong>0</strong></li><li style="flex:1"><strong>10 20 30 40 50 60 70 </strong></li></ul><p><strong>Right </strong></p><p></p><ul style="display: flex;"><li style="flex:1">Control </li><li style="flex:1">No Surgery </li><li style="flex:1">Gamma-knife </li><li style="flex:1">Resection </li></ul><p></p><p>BOX-PLOT </p><p>One continuous variable and one categorical variable </p><p>OTHER </p><p>THEORETICAL DISTRIBUTIONS </p><p><strong>Variable Type of Outcome Theoretical Distribution </strong></p><p>Continuous numeric </p><p>Discrete numeric Binary </p><p>Normal, Log-normal, </p><p>Exponential,… </p><p>Poisson, Negative </p><p>Binomial,… Bernoulli, Binomial,…. </p><p>Categorical with multiple Multinomial, </p><p>categories Hypergeometric,… </p><p>CONFIDENCE INTERVALS </p><p><strong>Sample </strong></p><p><strong>Population </strong></p><p><strong>Parameters: </strong></p><p>흁, 흈, 흅<strong>… </strong></p><p><strong>Statistics: </strong></p><p>ഥ</p><p>ෝ</p><p>푿 <strong>, </strong>풔<strong>, </strong>풑 <strong>,… </strong></p><p><strong>A point estimate </strong>alone is not enough: it gives us no way to </p><p>judge how accurate it is as an estimator. <strong>A confidence interval </strong>provides a better estimate by combining the point estimate with its standard error to define a range of </p><p>values that are likely to cover the true value of the parameter. </p><p><strong>A confidence intervals starts </strong>with the point estimate and adds a </p><p>“margin of error.” A confidence interval is defined as: <strong>point </strong></p><p><strong>estimate +/- margin of error</strong>. </p><p>CONFIDENCE INTERVALS </p><p>95% CI for μ: P(-??<µ<??)=0.95 </p><p><sub style="top: 0.5959em;"><em>x</em></sub><sup style="top: -1.0754em;">2 </sup></p><p>)</p><p>Since by central limit theorem, </p><p><em>x </em>~ <em>N</em>(<sub style="top: 0.5959em;"><em>x </em></sub></p><p>,</p><p><em>x </em> <br> / <em>n </em><br><em>P</em>(1.96 </p><p> 1.96) 0.95 </p><p></p><ul style="display: flex;"><li style="flex:1"></li><li style="flex:1"></li></ul><p></p><p><em>P</em>(<em>x </em>1.96 * <em>x </em>1.96 * ) 0.95 </p><ul style="display: flex;"><li style="flex:1"><em>n</em></li><li style="flex:1"><em>n</em></li></ul><p></p><p>CONFIDENCE INTERVALS </p><p><strong>95% Confidence Interval (CI) for µ: </strong></p><p></p><p><em>x </em>1.96 * <em>n</em></p><p><strong>Interpretation 1: </strong>You can be 95% sure that the true mean </p><p>(μ) will fall within the upper and lower bounds. </p><p><strong>Interpretation 2: </strong>95% of the intervals constructed using sample means (x) will contain the true population mean </p><p>(μ). </p><p></p><p><em>x </em> <em>Z</em><sub style="top: 0.5475em;">1 / 2 </sub></p><p>100(1-α)% CI: </p><p><em>n</em></p><p>A good link for simulation of CI: <a href="/goto?url=http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html" target="_blank">http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html </a></p><p><em>The cartoon guide to Statistics </em>by Gonick </p><p>and Smith </p><p>HYPOTHESIS TESTING </p><p>• Using data to test specific hypotheses </p><p>• Making decisions based on probability </p><p>(instead of subjective impressions) </p><p>• Distribution is usually assumed </p><p>• Methods that require no distributional </p><p>assumptions are called non-parametric or distribution free </p>
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages47 Page
-
File Size-