Introduction to Biostatistics

Introduction to Biostatistics Jie Yang, Ph.D. Associate Professor Department of Family, Population and Preventive Medicine Director Biostatistical Consulting Core Director Biostatistics and Bioinformatics Shared Resource, Stony Brook Cancer Center In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). OUTLINE  What is Biostatistics  What does a biostatistician do • Experiment design, clinical trial design • Descriptive and Inferential analysis • Result interpretation  What you should bring while consulting with a biostatistician WHAT IS BIOSTATISTICS • The science of (bio)statistics encompasses  the design of biological/clinical experiments  the collection, summarization, and analysis of data from those experiments  the interpretation of, and inference from, the How to Lie with Statistics (1954) by Darrell Huff. results <a href="/goto?url=http://www.youtube.com/watch?v=PbODigCZqL8" target="_blank">http://www.youtube.com/watch?v=PbODigCZqL8 </a>GOAL OF STATISTICS Sampling Probability Theory POPULATION SAMPLE Descriptive Statistics Descriptive Statistics Inference Sample Population Parameters: 흁, 흈, 흅… Inferential Statistics Statistics: ഥෝ푿 , 풔, 풑 ,… PROPERTIES OF A “GOOD” SAMPLE • Adequate sample size (statistical power) • Random selection (representative) Sampling Techniques: 1.Simple random sampling 2.Stratified sampling 3.Systematic sampling 4.Cluster sampling 5.Convenience sampling STUDY DESIGN EXPERIEMENT DESIGN  Completely Randomized Design (CRD) - Randomly assign the experiment units to the treatments  Design with Blocking – dealing with nuisance factor which has some effect on the response, but of no interest to the experimenter; Without blocking, large unexplained error leads to less detection power. 1. Randomized Complete Block Design (RCBD) - One single blocking factor 2. Latin Square Design (two blocking factor) 3. Cross over Design (each subject=blocking factor) 4. Balanced Incomplete Block Design EXPERIMENT DESIGN  Factorial Design: similar to randomized block design, but allowing to test the interaction between two treatment effects. A significant interaction between A and B tells: • the effect of A is different at each level of B. Or the effect of B differs at each level of A. • it is not very sensible to even be talking about the main effect of A and B  Experiment with random factors: randomly select n of the possible levels of the factor of interest. Typically random factors are categorical.  Split-plot Design: confounding a main effect with blocks EVIDENCE PYRAMID IMPACT Observatory: tracking the evolution of clinical trial data sharing and research integrity - Scientific Figure on <a href="/goto?url=https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368" target="_blank">ResearchGate. Available from: https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368 [accessed 14 Jan, 2019] </a>WHAT CAN A STATISTICIAN HELP DURING STUDY DESIGN PHASE  Blinding/masking and randomization The number and combination of experimental inventions The timing of measurements or visits Collect information on a larger sample or on the same sample over time Ways to maximize the efficient use of the available resources Even for data management – how to code measures and what to computerize directly affect the ease even the feasibility of subsequent analysis How a biostatistician analyzes data TYPE OF DATA 1. Nominal data: unordered categories or classes e.g. gender, blood type, transplant type 2. Ordinal data: order among categories is important e.g. disease severity, AE level 3. Discrete data: both ordering and magnitude are important; often integers or counts, no intermediate values are possible e.g. # of accidents within a month, # of kids in a family 4. Continuous data: difference between two possible data values can be arbitrarily small e.g. height, weight, body temperature, serum level, BP 5. Time to event data: censoring presents e.g. overall survival DESCRIPTIVE STATISTICS  General goal is to describe the distribution of a single variable (center, spread, shape, functional form)  Helpful for checking data and assumptions  Stratified (by group) analysis can be done for groups of interest  Values and comparisons can be visualized and “estimated” but descriptive statistics alone will provide no information about our level of confidence in conclusions DESCRIPTIVE STATISTICS 1. Measure of central tendency  Mean: average  Median: the 50th percentile point (median value);  Mode: value that occurs most frequently; unimodal and multimodal DESCRIPTIVE STATISTICS • Reporting a measure of center gives only partial information about a data set. – Example: Consider the following three datasets: Dataset 1: 4 5 5 5 6 Dataset 2: 1 3 5 7 9 Dataset 3: 1 5 5 5 9  All the three datasets have identical means and medians.  Datasets 2&3 are more variable than the 1st one. • It is also important to describe the spread of values about the center. DESCRIPTIVE STATISTICS 2. Measure of variability • Range= Max –Min • Inter-Quartile Range (IQR)=Q3-Q1 • Variance, Sample Variance • Standard Deviation, Sample Standard Deviation IDENTIFYING POTENTIAL OUTLIERS • An Outlying Value is a value, X, such that X> Q3+ 1.5(IQR) or X< Q1–1.5(IQR) • An Extreme Outlying Value is a value, X, such that X> Q3+ 3(IQR) or X< Q1–3(IQR) EFFECTS OF OUTLIERS  Median and IQR are generally unaffected by the removal of outliers but minor changes are possible.  Mean and Standard Deviation will be affected by the outlying values.  Apparent shape of the distribution can also be affected by outlying values.  One should never simply remove data values from a dataset.  In practice, if the outliers are not errors, sensitivity analysis will often be conducted or robust statistical methods will be used. WAYS OF PRESENTING DATA • Summary table • Bar/Pie chart • Histogram • Scatter plot • Boxplot 1. Outlier 2. Extreme Outlier 3. Modified Boxplot SUMMARY TABLE 1. By one variable side left right N14 14 Mean 18.83 18.61 SD 6.04 5.48 Median 18.25 17.75 Min 8.00 8.80 Max 30.10 28.21 Variable N_missing Level Total (N=25) Case (N=12) Control (N=13) No 24 (96.00%) 11 (91.67%) 13 (100.00%) <ul style="display: flex;"><li style="flex:1">Cancer </li><li style="flex:1">0</li></ul><ul style="display: flex;"><li style="flex:1">Yes </li><li style="flex:1">1 (4.00%) </li><li style="flex:1">1 (8.33%) </li><li style="flex:1">0 (0.00%) </li></ul>2. By multiple categorical variables Radiation Sequence with Surgery Location of Before 2002 107(19.21%) 450(80.79%) 20(13.16%) 132(86.84%) After 2002 Tumor Preoperative Postoperative Preoperative Postoperative 65(15.66%) 350(84.34%) 21(16.03%) 110(83.97%) Lower (n = 972) Upper (n = 283) BAR CHART AND PIE CHART Bariatric surgeries, 2010-2013 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% 12.76% 10.35% 9.32% 8.12% 7.44% 5.75% 2.88% 4.46% 3.59% <ul style="display: flex;"><li style="flex:1">AGB </li><li style="flex:1">LSG </li></ul>Admitted from ED RYGB Diagnosis for Cholecystectomy patients, 2006-2013 <ul style="display: flex;"><li style="flex:1">ED revisit </li><li style="flex:1">Discharged from ED </li></ul>HISTOGRAM AND SCATTER PLOT DISC measurement by group 70 60 50 40 30 20 10 0<ul style="display: flex;"><li style="flex:1">0</li><li style="flex:1">10 20 30 40 50 60 70 </li></ul>Right <ul style="display: flex;"><li style="flex:1">Control </li><li style="flex:1">No Surgery </li><li style="flex:1">Gamma-knife </li><li style="flex:1">Resection </li></ul>BOX-PLOT One continuous variable and one categorical variable OTHER THEORETICAL DISTRIBUTIONS Variable Type of Outcome Theoretical Distribution Continuous numeric Discrete numeric Binary Normal, Log-normal, Exponential,… Poisson, Negative Binomial,… Bernoulli, Binomial,…. Categorical with multiple Multinomial, categories Hypergeometric,… CONFIDENCE INTERVALS Sample Population Parameters: 흁, 흈, 흅… Statistics: ഥෝ푿 , 풔, 풑 ,… A point estimate alone is not enough: it gives us no way to judge how accurate it is as an estimator. A confidence interval provides a better estimate by combining the point estimate with its standard error to define a range of values that are likely to cover the true value of the parameter. A confidence intervals starts with the point estimate and adds a “margin of error.” A confidence interval is defined as: point estimate +/- margin of error. CONFIDENCE INTERVALS 95% CI for μ: P(-??<µ<??)=0.95 x2 )Since by central limit theorem, x ~ N(x ,x    / n P(1.96   1.96)  0.95 <ul style="display: flex;"><li style="flex:1"></li><li style="flex:1"></li></ul>P(x 1.96 *    x 1.96 * )  0.95 <ul style="display: flex;"><li style="flex:1">n</li><li style="flex:1">n</li></ul>CONFIDENCE INTERVALS 95% Confidence Interval (CI) for µ: x 1.96 * nInterpretation 1: You can be 95% sure that the true mean (μ) will fall within the upper and lower bounds. Interpretation 2: 95% of the intervals constructed using sample means (x) will contain the true population mean (μ). x  Z1 / 2 100(1-α)% CI: nA good link for simulation of CI: <a href="/goto?url=http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html" target="_blank">http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html </a>The cartoon guide to Statistics by Gonick and Smith HYPOTHESIS TESTING • Using data to test specific hypotheses • Making decisions based on probability (instead of subjective impressions) • Distribution is usually assumed • Methods that require no distributional assumptions are called non-parametric or distribution free

Introduction to Biostatistics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support