Introduction to Biostatistics

Introduction to Biostatistics

<p>Introduction to Biostatistics </p><p>Jie Yang, Ph.D. </p><p>Associate Professor <br>Department of Family, Population and Preventive Medicine </p><p>Director </p><p>Biostatistical Consulting Core <br>Director </p><p>Biostatistics and Bioinformatics Shared Resource, Stony Brook Cancer Center </p><p>In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). </p><p>OUTLINE </p><p> What is Biostatistics </p><p> What does a biostatistician do </p><p>• Experiment design, clinical trial design </p><p>• Descriptive and Inferential analysis </p><p>• Result interpretation </p><p> What you should bring while consulting </p><p>with a biostatistician </p><p>WHAT IS BIOSTATISTICS </p><p>• The science of (bio)statistics encompasses </p><p> the design of </p><p>biological/clinical experiments </p><p> the collection, </p><p>summarization, and analysis of data from </p><p>those experiments </p><p> the interpretation of, and inference from, the </p><p><em>How to Lie with Statistics </em>(1954) by Darrell Huff. </p><p>results </p><p><a href="/goto?url=http://www.youtube.com/watch?v=PbODigCZqL8" target="_blank">http://www.youtube.com/watch?v=PbODigCZqL8 </a></p><p>GOAL OF STATISTICS </p><p>Sampling </p><p>Probability Theory </p><p>POPULATION </p><p>SAMPLE </p><p>Descriptive Statistics <br>Descriptive Statistics </p><p>Inference </p><p><strong>Sample </strong></p><p><strong>Population </strong></p><p><strong>Parameters: </strong></p><p>흁, 흈, 흅<strong>… </strong></p><p>Inferential Statistics </p><p><strong>Statistics: </strong></p><p>ഥ</p><p>ෝ</p><p>푿 <strong>, </strong>풔<strong>, </strong>풑 <strong>,… </strong></p><p>PROPERTIES OF A “GOOD” SAMPLE </p><p>• Adequate sample size (statistical power) </p><p>• Random selection (representative) </p><p>Sampling Techniques: </p><p>1.Simple random sampling 2.Stratified sampling </p><p>3.Systematic sampling </p><p>4.Cluster sampling 5.Convenience sampling </p><p>STUDY DESIGN </p><p>EXPERIEMENT DESIGN </p><p> <strong>Completely Randomized Design (CRD) </strong></p><p>- Randomly assign the experiment units to the treatments <br> <strong>Design with Blocking </strong>– dealing with nuisance factor which has </p><p>some effect on the response, but of no interest to the experimenter; Without </p><p>blocking, large unexplained error leads to less detection power. </p><p>1. Randomized Complete Block Design (RCBD) - One single blocking factor </p><p>2. Latin Square </p><p>Design (two blocking factor) <br>3. Cross over Design </p><p>(each subject=blocking factor) <br>4. Balanced Incomplete </p><p>Block Design </p><p>EXPERIMENT DESIGN </p><p> <strong>Factorial Design</strong>: similar to randomized block design, but allowing to test the interaction between two treatment effects. A significant interaction between A and B tells: <br>• the effect of <em>A </em>is different at each level of <em>B</em>. Or the </p><p>effect of <em>B </em>differs at each level of <em>A</em>. </p><p>• it is not very sensible to even be talking about the main </p><p>effect of <em>A </em>and <em>B </em></p><p> <strong>Experiment with random factors: </strong>randomly select <em>n </em>of </p><p>the possible levels of the factor of interest. Typically random factors are categorical. <br> <strong>Split-plot Design</strong>: confounding a main effect with blocks </p><p>EVIDENCE PYRAMID </p><p>IMPACT Observatory: tracking the evolution of clinical trial data sharing and research integrity - Scientific Figure on <a href="/goto?url=https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368" target="_blank">ResearchGate. Available from: https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368 [accessed 14 Jan, 2019] </a></p><p>WHAT CAN A STATISTICIAN HELP </p><p>DURING STUDY DESIGN PHASE </p><p> Blinding/masking and randomization </p><p>The number and combination of experimental inventions The timing of measurements or visits </p><p>Collect information on a larger sample or on the same </p><p>sample over time <br>Ways to maximize the efficient use of the available resources </p><p>Even for data management – how to code measures </p><p>and what to computerize directly affect the ease even the feasibility of subsequent analysis </p><p>How a biostatistician analyzes data </p><p>TYPE OF DATA <br><strong>1. Nominal data</strong>: unordered categories or classes </p><p>e.g. gender, blood type, transplant type </p><p>2. <strong>Ordinal data</strong>: order among categories is important </p><p>e.g. disease severity, AE level </p><p>3. <strong>Discrete data</strong>: both ordering and magnitude are important; </p><p>often integers or counts, no intermediate values are possible </p><p>e.g. # of accidents within a month, # of kids in a family </p><p>4. <strong>Continuous data</strong>: difference between two possible data </p><p>values can be arbitrarily small </p><p>e.g. height, weight, body temperature, serum level, BP </p><p>5. <strong>Time to event data</strong>: censoring presents </p><p>e.g. overall survival </p><p>DESCRIPTIVE STATISTICS </p><p> General goal is to describe the distribution of a single </p><p>variable (center, spread, shape, functional form)  Helpful for checking data and assumptions  Stratified (by group) analysis can be done for groups of interest </p><p> Values and comparisons can be visualized and </p><p>“estimated” but descriptive statistics alone will provide no </p><p>information about our level of confidence in conclusions </p><p>DESCRIPTIVE STATISTICS </p><p><strong>1. Measure&nbsp;of central tendency </strong></p><p> Mean: average </p><p> Median: the 50<sup style="top: -0.7em;">th </sup>percentile point (median </p><p>value); <br> Mode: value that occurs most frequently; </p><p>unimodal and multimodal </p><p>DESCRIPTIVE STATISTICS </p><p>• Reporting a measure of center gives only partial </p><p>information about a data set. </p><p>– Example: Consider the following three datasets: </p><p>Dataset 1:&nbsp;4 5 5 5 6 </p><p>Dataset 2:&nbsp;1 3 5 7 9 Dataset 3:&nbsp;1 5 5 5 9 </p><p> All the three datasets have identical means and medians. </p><p> Datasets 2&amp;3 are more variable than the 1<sup style="top: -0.5em;">st </sup>one. </p><p>• It is also important to describe the spread of values </p><p>about the center. </p><p>DESCRIPTIVE STATISTICS </p><p><strong>2. Measure of variability </strong></p><p>• Range= Max –Min </p><p>• Inter-Quartile Range (IQR)=Q3-Q1 </p><p>• Variance, Sample Variance • Standard Deviation, Sample Standard Deviation </p><p>IDENTIFYING POTENTIAL OUTLIERS </p><p>• An Outlying Value is a value, <em>X, such that </em></p><p><strong>X&gt; Q3+ 1.5(IQR) </strong></p><p><strong>or </strong></p><p><strong>X&lt; Q1–1.5(IQR) </strong></p><p>• An Extreme Outlying Value is a value, <em>X, such </em></p><p><em>that </em></p><p><strong>X&gt; Q3+ 3(IQR) </strong></p><p><strong>or </strong></p><p><strong>X&lt; Q1–3(IQR) </strong></p><p>EFFECTS OF OUTLIERS <br> Median and IQR are generally unaffected by the removal </p><p>of outliers but minor changes are possible. </p><p> Mean and Standard Deviation will be affected by the outlying values. </p><p> Apparent shape of the distribution can also be affected by </p><p>outlying values. <br> One should never simply remove data values from a </p><p>dataset. </p><p> In practice, if the outliers are not errors, sensitivity analysis will often be conducted or robust statistical </p><p>methods will be used. </p><p>WAYS OF PRESENTING DATA </p><p>• Summary table </p><p>• Bar/Pie chart • Histogram </p><p>• Scatter plot </p><p>• Boxplot </p><p>1. Outlier </p><p>2. Extreme Outlier 3. Modified Boxplot <br>SUMMARY TABLE </p><p>1. By one variable </p><p>side left right <br>N14 14 <br>Mean 18.83 18.61 <br>SD <br>6.04 5.48 <br>Median <br>18.25 17.75 <br>Min 8.00 8.80 <br>Max <br>30.10 28.21 </p><p>Variable N_missing Level Total&nbsp;(N=25) Case&nbsp;(N=12) Control&nbsp;(N=13) <br>No 24&nbsp;(96.00%) 11&nbsp;(91.67%) 13&nbsp;(100.00%) </p><p></p><ul style="display: flex;"><li style="flex:1">Cancer </li><li style="flex:1">0</li></ul><p></p><p></p><ul style="display: flex;"><li style="flex:1">Yes </li><li style="flex:1">1 (4.00%) </li><li style="flex:1">1 (8.33%) </li><li style="flex:1">0 (0.00%) </li></ul><p></p><p>2. By multiple categorical variables </p><p>Radiation </p><p>Sequence with </p><p>Surgery <br>Location of </p><p>Before 2002 </p><p>107(19.21%) 450(80.79%) <br>20(13.16%) </p><p>132(86.84%) <br>After 2002 <br>Tumor </p><p>Preoperative Postoperative Preoperative </p><p>Postoperative </p><p>65(15.66%) <br>350(84.34%) <br>21(16.03%) </p><p>110(83.97%) </p><p>Lower <br>(n = 972) </p><p>Upper </p><p>(n = 283) </p><p>BAR CHART AND PIE CHART </p><p><strong>Bariatric surgeries, 2010-2013 </strong></p><p>14.00% 12.00% 10.00% <br>8.00% 6.00% 4.00% 2.00% 0.00% </p><p>12.76% <br>10.35% <br>9.32% </p><p>8.12% </p><p>7.44% <br>5.75% </p><p>2.88% <br>4.46% <br>3.59% </p><p></p><ul style="display: flex;"><li style="flex:1">AGB </li><li style="flex:1">LSG </li></ul><p>Admitted from ED <br>RYGB </p><p>Diagnosis for Cholecystectomy patients, 2006-2013 </p><p></p><ul style="display: flex;"><li style="flex:1">ED revisit </li><li style="flex:1">Discharged from ED </li></ul><p></p><p>HISTOGRAM AND SCATTER PLOT </p><p><strong>DISC measurement by group </strong><br><strong>70 </strong></p><p><strong>60 50 40 30 20 10 </strong><br><strong>0</strong></p><ul style="display: flex;"><li style="flex:1"><strong>0</strong></li><li style="flex:1"><strong>10 20 30 40 50 60 70 </strong></li></ul><p><strong>Right </strong></p><p></p><ul style="display: flex;"><li style="flex:1">Control </li><li style="flex:1">No Surgery </li><li style="flex:1">Gamma-knife </li><li style="flex:1">Resection </li></ul><p></p><p>BOX-PLOT </p><p>One continuous variable and one categorical variable </p><p>OTHER </p><p>THEORETICAL DISTRIBUTIONS </p><p><strong>Variable Type of Outcome&nbsp;Theoretical Distribution </strong></p><p>Continuous numeric </p><p>Discrete numeric Binary </p><p>Normal, Log-normal, </p><p>Exponential,… </p><p>Poisson, Negative </p><p>Binomial,… Bernoulli, Binomial,…. </p><p>Categorical with multiple&nbsp;Multinomial, </p><p>categories Hypergeometric,… </p><p>CONFIDENCE INTERVALS </p><p><strong>Sample </strong></p><p><strong>Population </strong></p><p><strong>Parameters: </strong></p><p>흁, 흈, 흅<strong>… </strong></p><p><strong>Statistics: </strong></p><p>ഥ</p><p>ෝ</p><p>푿 <strong>, </strong>풔<strong>, </strong>풑 <strong>,… </strong></p><p><strong>A point estimate </strong>alone is not enough:&nbsp;it gives us no way to </p><p>judge how accurate it is as an estimator. <strong>A confidence interval </strong>provides a better estimate by combining the point estimate with its standard error to define a range of </p><p>values that are likely to cover the true value of the parameter. </p><p><strong>A confidence intervals starts </strong>with the point estimate and adds a </p><p>“margin of error.” A confidence interval is defined as: <strong>point </strong></p><p><strong>estimate +/- margin of error</strong>. </p><p>CONFIDENCE INTERVALS </p><p>95% CI for μ: P(-??&lt;µ&lt;??)=0.95 </p><p><sub style="top: 0.5959em;"><em>x</em></sub><sup style="top: -1.0754em;">2 </sup></p><p>)</p><p>Since by central limit theorem, </p><p><em>x </em>~ <em>N</em>(<sub style="top: 0.5959em;"><em>x </em></sub></p><p>,</p><p><em>x </em>  <br> / <em>n </em><br><em>P</em>(1.96  </p><p> 1.96)  0.95 </p><p></p><ul style="display: flex;"><li style="flex:1"></li><li style="flex:1"></li></ul><p></p><p><em>P</em>(<em>x </em>1.96 *&nbsp;   <em>x </em>1.96 *&nbsp;)  0.95 </p><ul style="display: flex;"><li style="flex:1"><em>n</em></li><li style="flex:1"><em>n</em></li></ul><p></p><p>CONFIDENCE INTERVALS </p><p><strong>95% Confidence Interval (CI) for µ: </strong></p><p></p><p><em>x </em>1.96 * <em>n</em></p><p><strong>Interpretation 1: </strong>You can be 95% sure that the true mean </p><p>(μ) will fall within the upper and lower bounds. </p><p><strong>Interpretation 2: </strong>95% of the intervals constructed using sample means (x) will contain the true population mean </p><p>(μ). </p><p></p><p><em>x </em> <em>Z</em><sub style="top: 0.5475em;">1 / 2 </sub></p><p>100(1-α)% CI: </p><p><em>n</em></p><p>A good link for simulation of CI: <a href="/goto?url=http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html" target="_blank">http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html </a></p><p><em>The cartoon guide to Statistics </em>by Gonick </p><p>and Smith </p><p>HYPOTHESIS TESTING </p><p>• Using data to test specific hypotheses </p><p>• Making decisions based on probability </p><p>(instead of subjective impressions) </p><p>• Distribution is usually assumed </p><p>• Methods that require no distributional </p><p>assumptions are called non-parametric or distribution free </p>

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    47 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us