ggplot2 for Epi Studies

Leah McGrath, PhD November 13, 2017 Introduction

· Know your data: data exploration is an important part of research · is an excellent way to explore data · ggplot2 is an elegant library that makes it easy to create compelling graphs · plots can be iteratively built up and easily modified

2/42 Learning objectives

· To create graphs used in manuscripts for epidemiology studies · To review and incorporate previously learned aspects of formatting graphs · To demonstrate novel data visualizations using Shiny

3/42 ggplot architecture review

· Aesthetics: specify the variables to display - what are x and y? - can also link variables to color, shape, size and transparency · “geoms”: specify type of plot - do you want a scatter plot, line, bars, densities, or other type plot? · Scales: for transforming variables(e.g., log, sq. root). - also used to set legend – title, breaks, labels · Facets: creating separate panels for different factors · Themes: Adjust appearance: background, fonts, etc

4/42 Hemoglobin data

· Data from the National Health and Nutritional Examination Survey (NHANES) dataset, 1999-2000 · containing data about n=3,990 patients · The file was created by merging demographic data with complete blood count file, and nutritional biochemistry lab file. · Contains measures hemoglobin, iron status, and other anemia- related parameters

5/42 Anemia data codebook

· age = age in years of participant (years) · sex = sex of participant (Male vs Female) · tsat = transferrin saturation (%) · iron = total serum iron (ug/dL) · hgb = hemoglobin concentration (g/dL) · ferr = serum ferritin (mg/mL) · folate = serum folate (mg/mL) · race = participant race (Hispanic, White, Black, Other) · rdw = red cell distribution width (%) · wbc = white blood cell count (SI) · anemia = indicator variable for anemia (according to WHO definition)

6/42 Scatter plot review: hemoglobin by age, stratified by ethnicity and sex

ggplot(data=anemia, aes(x=age,y=hgb,color=sex)) + geom_smooth() + geom_jitter(aes(size=1/iron), alpha=0.1) + xlab("Age")+ylab("Hemoglobin (g/dl)") + scale_size(name = "Iron Deficiency") + scale_color_discrete(name = "Sex") + facet_wrap(~race)+theme_bw()

7/42 Scatter plot review: hemoglobin by age, stratified by ethnicity and sex

8/42 Box plots

ggplot(data=anemia, aes(x=race,y=hgb)) + geom_boxplot()

9/42 Box plots with points

ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1)

10/42 Box plots with coordinates flipped

ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1) + coord_flip()

11/42 Violin plots

· Kernal density estimates that are placed on each side and mirrored so it forms a symmetrical shape · Easy to compare several distributions

12/42 Violin plots

ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()

13/42 Violin plots with underlying data points

ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()+ geom_jitter(alpha=0.1)

14/42 Violin plots stratified by 2 variables

ggplot(data=anemia, aes(x=sex,y=hgb,color=race)) + geom_violin()

15/42 Violin plots & boxplot with no outliers

ggplot(data=anemia, aes(x=race,y=hgb, color=race)) + geom_violin() + geom_boxplot(width=.1, fill="black", outlier.color=NA) + stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=2.5)

16/42 Practice

· Use the anemia dataset to practice making scatterplots, boxplots, and violin plots · Try faceting, flipping orientation, changing colors and labels str(anemia)

## Classes 'tbl_df', 'tbl' and 'data.frame': 3990 obs. of 13 variables: ## $ age : num 77 49 59 43 37 70 81 38 85 23 ... ## $ sex : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 2 2 2 ... ## $ tsat : num 16.3 41.5 27.6 28 19.7 18.5 16.9 27.1 13.4 35.8 ... ## $ iron : num 65 141 96 83 64 75 65 97 38 136 ... ## $ hgb : num 14.1 14.5 13.4 15.4 16 16.8 16.6 13.3 10.9 14.5 ... ## $ ferr : num 55 198 155 32 68 87 333 33 166 48 ... ## $ folate: num 24.6 17.1 12.2 13.5 23 46.9 14.6 6.1 30.3 19.9 ... ## $ vite : num 1488 1897 1311 528 3092 ... ## $ vita : num 74.9 84.6 54 41.9 72.5 ... ## $ race : Factor w/ 4 levels "Hispanic","White",..: 2 2 3 3 2 1 2 2 3 1 ... ## $ rdw : num 13.7 13.1 14.3 13.7 13.6 14.4 12.4 11.9 14.1 11.4 ... ## $ wbc : num 7.6 5.9 4.9 4.6 10.2 11.6 9.1 7.6 7.4 5.6 ... ## $ anemia: num 0 0 0 0 0 0 0 0 1 0 ... ## - attr(*, "na.action")=Class 'omit' Named int [1:805] 26 28 32 33 36 37 38 39 45 54 ... ## .. ..- attr(*, "names")= chr [1:805] "26" "28" "32" "33" ...

17/42 Forest plots

· First gather the data into the proper format including the following variables: - Estimate - Lower CI - Upper CI - Grouping variable

18/42 Forest plots

· For this example, we take the mean and calculate the upper and lower confidence interval for hemoglobin. · We will stack the row observations into one variable called "Type". anemia1 <- anemia %>% select(sex,hgb) %>% group_by(sex) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia1)[1] <- "Type" anemia2 <- anemia %>% select(race,hgb) %>% group_by(race) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia2)[1] <- "Type" anemia3 <- rbind(anemia1,anemia2)

19/42 Forest plots

ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange()

20/42 Forest plots: flip the axes, add labels

ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)") + theme_bw()

21/42 Forest plots: calculating mean and CI within ggplot

· ggplot can calculate the mean and CI using stat_summary · Further data manipulation would be needed to stack multiple variables

22/42 Calculating mean and CI within ggplot

ggplot(anemia, aes(x=race, y=hgb)) + stat_summary(fun.data=mean_cl_normal) + coord_flip() + theme_bw() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)")

23/42 Forest plots: adding faceting

ggplot(any.fit3, aes(x=V3, y=A1, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Predictor Variable") + ylab("Adjusted Risk Difference per 100 (95% CI)") + scale_y_continuous(breaks=c(-20,-15,-10,-5,0,5,10,15,20,25), limits = c(-21,26)) + theme_bw() + geom_hline(yintercept=0, lty=2) + facet_grid(setting~., scales= 'free', space='free')

24/42 25/42 Practice

· Use the anemia dataset to practice making forest plots using other continuous variables · Use to create a new, categorized age variable (hint: factor this before graphing). Create a forest plot of mean hemoglobin by age category.

26/42 Kaplan-Meier plots - WIHS data

· Women’s Interagency HIV Study (WIHS) is an ongoing observational cohort study with semiannual visits at 10 sites in the US · Data on 1,164 patients who were HIV-positive, free of clinical AIDS, and not on antiretroviral therapy (ART) at study baseline (Dec. 6, 1995) · Contains measures information on age, race, CD4 count, drug use, ARV treatment, and time to aids/death

27/42 Kaplan-Meier plots

· MANY package options to plot survival functions · All use the survival package to calculate survival over time - survfit(survival) + survplot(rms) - ggkm(sachsmc/ggkm) & ggplot2 - ggkm(michaelway/ggkm) · Allows for multiple treatments and subgroups · Does not take into account competing risks

28/42 Kaplan-Meier example 1

· Calculate KM within ggplot · https://github.com/sachsmc/ggkm · Prep data wihs$outcome <- ifelse(is.na(wihs$art),0,1) wihs$time <- ifelse(is.na(wihs$aids_death_art), wihs$dropout,wihs$aids_death_art) wihs <- wihs %>% mutate(time = ifelse(is.na(time),study_end,time))

29/42 KM plot within ggplot2

devtools::install_github("sachsmc/ggkm") library(ggkm) ggplot(wihs, aes(time = time, status = outcome)) + geom_km()

30/42 KM by treatment group

ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km()

31/42 Add confidence bands

ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km() + geom_kmband()

32/42 KM example #2

· Calculated using survival package · Plots KM curve with numbers at risk · Same package name as previous example! · https://github.com/michaelway/ggkm remove.packages("ggkm") install_github("michaelway/ggkm") library(ggkm)

33/42 KM example 2

fit <- survfit(Surv(time,outcome)~idu, data=wihs) ggkm(fit)

34/42 KM with numbers at risk

ggkm(fit, table=TRUE, marks = FALSE, ystratalabs = c("No IDU", "History of IDU"))

35/42 Cumulative incidence plots

· 1-survival probability · ipwrisk package - coming soon! - calculates adjusted cumulative incidence curves using IPTW - addresses censoring (IPCW) and competing risks - produces tables and graphics

36/42 Sankey diagram

· Visualization that shows the flow of patients between states (over time) · States, or nodes, can be treatments, comorbidities, hospitalizations etc. · Paths connecting states are called links - proportion corresponds to thickness of line · Example: https://vizhub.healthdata.org/dex/

37/42 Basic sankey diagrams in R

library(networkD3) library(reshape2) library(magrittr) nodes <- data.frame(name=c("Renal Failure", "Hemodialysis at 6m", "Transplant at 6m", "Death by 6m", "Hemodialysis at 12m", "Transplant at 12m", "Death by 12m")) links <- data.frame(source=c(0,0,0,1,1,1,2,2,2,3), target=c(1,2,3,4,5,6,4,5,6,6), value=c(70,20,10,40,20,10,15,4,1,10)) sankeyNetwork(Links = links, Nodes = nodes, Source = "source", Target = "target", Value = "value", NodeID ="name", fontSize = 22, nodeWidth = 30,nodePadding = 5)

38/42 Basic sankey diagrams in R

Hemodialysis at 12m Hemodialysis at 6m

Renal Failure

Transplant at 12m

Transplant at 6m Death by 12m Death by 6m

39/42 Final Tips

· Spend time planning your graph · Make sure to have the data in the correct structure before you start graphing · Start with a simple graph, gradually build in complexity

40/42 Further reading

· ggplot2: http://docs.ggplot2.org/current/ · Cookbook for R: http://www.cookbook-r.com/Graphs/ · Quick-R: http://www.statmethods.net/index.html

41/42 Wrap-up

· Questions? · Acknowledgements: Alan Brookhart, Sara Levintow · Contact info: [email protected]

42/42