Ggplot2 for Epi Studies

Ggplot2 for Epi Studies

ggplot2 for Epi Studies Leah McGrath, PhD November 13, 2017 Introduction · Know your data: data exploration is an important part of research · Data visualization is an excellent way to explore data · ggplot2 is an elegant R library that makes it easy to create compelling graphs · plots can be iteratively built up and easily modified 2/42 Learning objectives · To create graphs used in manuscripts for epidemiology studies · To review and incorporate previously learned aspects of formatting graphs · To demonstrate novel data visualizations using Shiny 3/42 ggplot architecture review · Aesthetics: specify the variables to display - what are x and y? - can also link variables to color, shape, size and transparency · “geoms”: specify type of plot - do you want a scatter plot, line, bars, densities, or other type plot? · Scales: for transforming variables(e.g., log, sq. root). - also used to set legend – title, breaks, labels · Facets: creating separate panels for different factors · Themes: Adjust appearance: background, fonts, etc 4/42 Hemoglobin data · Data from the National Health and Nutritional Examination Survey (NHANES) dataset, 1999-2000 · containing data about n=3,990 patients · The file was created by merging demographic data with complete blood count file, and nutritional biochemistry lab file. · Contains measures hemoglobin, iron status, and other anemia- related parameters 5/42 Anemia data codebook · age = age in years of participant (years) · sex = sex of participant (Male vs Female) · tsat = transferrin saturation (%) · iron = total serum iron (ug/dL) · hgb = hemoglobin concentration (g/dL) · ferr = serum ferritin (mg/mL) · folate = serum folate (mg/mL) · race = participant race (Hispanic, White, Black, Other) · rdw = red cell distribution width (%) · wbc = white blood cell count (SI) · anemia = indicator variable for anemia (according to WHO definition) 6/42 Scatter plot review: hemoglobin by age, stratified by ethnicity and sex ggplot(data=anemia, aes(x=age,y=hgb,color=sex)) + geom_smooth() + geom_jitter(aes(size=1/iron), alpha=0.1) + xlab("Age")+ylab("Hemoglobin (g/dl)") + scale_size(name = "Iron Deficiency") + scale_color_discrete(name = "Sex") + facet_wrap(~race)+theme_bw() 7/42 Scatter plot review: hemoglobin by age, stratified by ethnicity and sex 8/42 Box plots ggplot(data=anemia, aes(x=race,y=hgb)) + geom_boxplot() 9/42 Box plots with points ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1) 10/42 Box plots with coordinates flipped ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1) + coord_flip() 11/42 Violin plots · Kernal density estimates that are placed on each side and mirrored so it forms a symmetrical shape · Easy to compare several distributions 12/42 Violin plots ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin() 13/42 Violin plots with underlying data points ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()+ geom_jitter(alpha=0.1) 14/42 Violin plots stratified by 2 variables ggplot(data=anemia, aes(x=sex,y=hgb,color=race)) + geom_violin() 15/42 Violin plots & boxplot with no outliers ggplot(data=anemia, aes(x=race,y=hgb, color=race)) + geom_violin() + geom_boxplot(width=.1, fill="black", outlier.color=NA) + stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=2.5) 16/42 Practice · Use the anemia dataset to practice making scatterplots, boxplots, and violin plots · Try faceting, flipping orientation, changing colors and labels str(anemia) ## Classes 'tbl_df', 'tbl' and 'data.frame': 3990 obs. of 13 variables: ## $ age : num 77 49 59 43 37 70 81 38 85 23 ... ## $ sex : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 2 2 2 ... ## $ tsat : num 16.3 41.5 27.6 28 19.7 18.5 16.9 27.1 13.4 35.8 ... ## $ iron : num 65 141 96 83 64 75 65 97 38 136 ... ## $ hgb : num 14.1 14.5 13.4 15.4 16 16.8 16.6 13.3 10.9 14.5 ... ## $ ferr : num 55 198 155 32 68 87 333 33 166 48 ... ## $ folate: num 24.6 17.1 12.2 13.5 23 46.9 14.6 6.1 30.3 19.9 ... ## $ vite : num 1488 1897 1311 528 3092 ... ## $ vita : num 74.9 84.6 54 41.9 72.5 ... ## $ race : Factor w/ 4 levels "Hispanic","White",..: 2 2 3 3 2 1 2 2 3 1 ... ## $ rdw : num 13.7 13.1 14.3 13.7 13.6 14.4 12.4 11.9 14.1 11.4 ... ## $ wbc : num 7.6 5.9 4.9 4.6 10.2 11.6 9.1 7.6 7.4 5.6 ... ## $ anemia: num 0 0 0 0 0 0 0 0 1 0 ... ## - attr(*, "na.action")=Class 'omit' Named int [1:805] 26 28 32 33 36 37 38 39 45 54 ... ## .. ..- attr(*, "names")= chr [1:805] "26" "28" "32" "33" ... 17/42 Forest plots · First gather the data into the proper format including the following variables: - Estimate - Lower CI - Upper CI - Grouping variable 18/42 Forest plots · For this example, we take the mean and calculate the upper and lower confidence interval for hemoglobin. · We will stack the row observations into one variable called "Type". anemia1 <- anemia %>% select(sex,hgb) %>% group_by(sex) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia1)[1] <- "Type" anemia2 <- anemia %>% select(race,hgb) %>% group_by(race) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia2)[1] <- "Type" anemia3 <- rbind(anemia1,anemia2) 19/42 Forest plots ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange() 20/42 Forest plots: flip the axes, add labels ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)") + theme_bw() 21/42 Forest plots: calculating mean and CI within ggplot · ggplot can calculate the mean and CI using stat_summary · Further data manipulation would be needed to stack multiple variables 22/42 Calculating mean and CI within ggplot ggplot(anemia, aes(x=race, y=hgb)) + stat_summary(fun.data=mean_cl_normal) + coord_flip() + theme_bw() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)") 23/42 Forest plots: adding faceting ggplot(any.fit3, aes(x=V3, y=A1, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Predictor Variable") + ylab("Adjusted Risk Difference per 100 (95% CI)") + scale_y_continuous(breaks=c(-20,-15,-10,-5,0,5,10,15,20,25), limits = c(-21,26)) + theme_bw() + geom_hline(yintercept=0, lty=2) + facet_grid(setting~., scales= 'free', space='free') 24/42 25/42 Practice · Use the anemia dataset to practice making forest plots using other continuous variables · Use dplyr to create a new, categorized age variable (hint: factor this before graphing). Create a forest plot of mean hemoglobin by age category. 26/42 Kaplan-Meier plots - WIHS data · Women’s Interagency HIV Study (WIHS) is an ongoing observational cohort study with semiannual visits at 10 sites in the US · Data on 1,164 patients who were HIV-positive, free of clinical AIDS, and not on antiretroviral therapy (ART) at study baseline (Dec. 6, 1995) · Contains measures information on age, race, CD4 count, drug use, ARV treatment, and time to aids/death 27/42 Kaplan-Meier plots · MANY package options to plot survival functions · All use the survival package to calculate survival over time - survfit(survival) + survplot(rms) - ggkm(sachsmc/ggkm) & ggplot2 - ggkm(michaelway/ggkm) · Allows for multiple treatments and subgroups · Does not take into account competing risks 28/42 Kaplan-Meier example 1 · Calculate KM within ggplot · https://github.com/sachsmc/ggkm · Prep data wihs$outcome <- ifelse(is.na(wihs$art),0,1) wihs$time <- ifelse(is.na(wihs$aids_death_art), wihs$dropout,wihs$aids_death_art) wihs <- wihs %>% mutate(time = ifelse(is.na(time),study_end,time)) 29/42 KM plot within ggplot2 devtools::install_github("sachsmc/ggkm") library(ggkm) ggplot(wihs, aes(time = time, status = outcome)) + geom_km() 30/42 KM by treatment group ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km() 31/42 Add confidence bands ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km() + geom_kmband() 32/42 KM example #2 · Calculated using survival package · Plots KM curve with numbers at risk · Same package name as previous example! · https://github.com/michaelway/ggkm remove.packages("ggkm") install_github("michaelway/ggkm") library(ggkm) 33/42 KM example 2 fit <- survfit(Surv(time,outcome)~idu, data=wihs) ggkm(fit) 34/42 KM with numbers at risk ggkm(fit, table=TRUE, marks = FALSE, ystratalabs = c("No IDU", "History of IDU")) 35/42 Cumulative incidence plots · 1-survival probability · ipwrisk package - coming soon! - calculates adjusted cumulative incidence curves using IPTW - addresses censoring (IPCW) and competing risks - produces tables and graphics 36/42 Sankey diagram · Visualization that shows the flow of patients between states (over time) · States, or nodes, can be treatments, comorbidities, hospitalizations etc. · Paths connecting states are called links - proportion corresponds to thickness of line · Example: https://vizhub.healthdata.org/dex/ 37/42 Basic sankey diagrams in R library(networkD3) library(reshape2) library(magrittr) nodes <- data.frame(name=c("Renal Failure", "Hemodialysis at 6m", "Transplant at 6m", "Death by 6m", "Hemodialysis at 12m", "Transplant at 12m", "Death by 12m")) links <- data.frame(source=c(0,0,0,1,1,1,2,2,2,3), target=c(1,2,3,4,5,6,4,5,6,6), value=c(70,20,10,40,20,10,15,4,1,10)) sankeyNetwork(Links = links, Nodes = nodes, Source = "source", Target = "target", Value = "value", NodeID ="name", fontSize = 22, nodeWidth = 30,nodePadding = 5) 38/42 Basic sankey diagrams in R Hemodialysis at 12m Hemodialysis at 6m Renal Failure Transplant at 12m Transplant at 6m Death by 12m Death by 6m 39/42 Final Tips · Spend time planning your graph · Make sure to have the data in the correct structure before you start graphing · Start with a simple graph, gradually build in complexity 40/42 Further reading · ggplot2: http://docs.ggplot2.org/current/ · Cookbook for R: http://www.cookbook-r.com/Graphs/ · Quick-R: http://www.statmethods.net/index.html 41/42 Wrap-up · Questions? · Acknowledgements: Alan Brookhart, Sara Levintow · Contact info: [email protected] 42/42.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    42 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us