Multinomial and Ordinal Logistic Regression Analyses with Multi- Categorical Variables Using R
Total Page:16
File Type:pdf, Size:1020Kb
982 Editorial Page 1 of 8 Multinomial and ordinal Logistic regression analyses with multi- categorical variables using R Jiaqi Liang, Guoshu Bi, Cheng Zhan Department of Thoracic Surgery, Zhongshan Hospital, Fudan University, Shanghai, China Correspondence to: Department of Thoracic Surgery, Zhongshan Hospital, Fudan University, 180 Fenglin Road, Shanghai, China. Email: [email protected]. Provenance and Peer Review: This article was commissioned by the editorial office, Annals of Translational Medicine. The article did not undergo external peer review. Comment on: Zhou ZR, Wang WW, Li Y, et al. In-depth mining of clinical data: the construction of clinical prediction model with R. Ann Transl Med 2019;7:796. Submitted Mar 17, 2020. Accepted for publication Apr 28, 2020. doi: 10.21037/atm-2020-57 View this article at: http://dx.doi.org/10.21037/atm-2020-57 It is essential in a standard linear regression analysis that calculating the points of each variable together with survival dependent variables are continuous. However, using the probabilities. standard linear regression for the analysis of a double-level The general objective of logistic regression models is to or multi-level outcome can lead to unsatisfactory results predict outcomes using variables based on certain existing because the validity of this regression model relies on the data, which have been applied in medical research for various variability of the outcome being the same for all values of diseases (8). In the study by Zhou and his colleagues (3), predictors, which is contrary to the nature of double-level the authors converted multi-categorical outcomes into or multi-level outcomes (1). Therefore, when the dependent dichotomous ones and introduced a dichotomous logistic variable consists of several categories, a maximum likelihood regression using R codes. However, multi-categorical estimator, such as multinomial logit or probit, should be outcomes can be directly applied in multinomial or ordinal used instead of the ordinary least square estimator (2). logistic regression analyses in the R software, although Logistic regression can be used to describe the relationship the results might be difficult to be interpreted with more between an independent variable(s) (either continuous or complicated steps. This study aimed to display the methods not) and a dichotomous or multi-categorical dependent and processes used to apply multi-categorical variables in variable as a supplementary variable to the standard linear logistic regression models in the R software environment. regression. The sample data was made up of patients registered Zhou et al. (3) elaborated a series of reliable methodologies in the SEER database in 2015 with diagnoses of lung using the R software to construct clinical prediction models adenocarcinomas. Patients with unclear race, primary with detailed steps and operable code examples, according site(s) of their tumors, differentiation grade of their to different types of clinical data and categories of variables. tumors, tumor stage (AJCC, 6th edition), or cause of They summarized the process of the construction of their death were excluded. Finally, 6,483 patients met practical clinical prediction models (nomograms), including the inclusion and exclusion criteria, of which 3,000 were data screening, primary model training, and internal and randomly selected for the following analysis. Age, sex, race, external validations, which was an extraordinary work and primary site of the tumor, cell differentiation grade, AJCC a practical reference in the field of statistics (4-6). Based stage of the tumor, and history of chemotherapy, were on this study, a simpler and more accurate prediction chosen as independent variables. The ages of included model was introduced as an extension by Bi et al. (7), which patients, ranging from 20 to 100 years, were divided was designed for extracting polynomial equations and into eight groups, labeled as “AgeGroup”. Also, several © Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(16):982 | http://dx.doi.org/10.21037/atm-2020-57 Page 2 of 8 Liang et al. Logistic regression with multi-categorical variables in R subgroup variables were defined for each of the other six seer_multi$Grade <- factor(seer_multi$Grade, ordered = T, variables: “Sex” (male and female), “Race” (white, black, levels = c("GradeI", "GradeII", and others), “PrimarySite” (main, upper lobe, middle "GradeIII", "GradeIV")) lobe, lower lobe, and overlapped), “Grade” (I–IV), “Stage” seer_multi$Stage <- factor(seer_multi$Stage, ordered = T, (I–IV), and “Chemotherapy” (yes and no or unknown). levels = c("I", "II", "III", "IV")) Survival status labeled as “Causespecificdeathclassifica ## Specify the baseline level of the outcomes tion” was defined as the dependent variable consisting of three categories: alive, dead due to lung cancer, and seer_multi$Causespecificdeathclassification <- relevel( dead due to other causes. Survival months were extracted seer_multi$Causespecificdeathclassification, ref = "Alive") for constructing an ordinal outcome. Using the dplyr package in R, the survival months were converted into Step 2: model running “SurvivalStatus” based on the tertiles of survival months. Patient outcomes were defined as “0 = no response” for Second, the model was run using the multinorm function those whose survival months are in the first third tertile; in the nnet package in R as follows. In the outputs of the “1 = partial response” and “2 = complete response” were summary of the model (Figure 1A), a block of coefficients defined as outcomes for patients whose survival months displayed as logged odds was shown followed by their are in the middle and last third tertiles, respectively. standard errors. Multinomial logistic regressions can be applied for multi- Each of these blocks had one row of values corresponding categorical outcomes, whereas ordinal variables should be to a model equation. In the first row, for instance, preferentially analyzed using an ordinal logistic regression coefficients (logged odds) of each independent variable model. Besides, if the ordinal model does not meet the comparing “Dead (attributable to this cancer dx)” with parallel regression assumption, the multinomial one will “Alive” were shown. The coefficients were pulled out still be an alternative (9). A multinomial regression analysis from the model and converted into interpretable odds was started as follows with four main steps. ratios additionally by exponential analyses using the exp() command. Multinomial regression analysis ## Install and library the R package nnet Step 1: data preparation install.packages("nnet") First, the sample data set was imported into R, and the library(nnet) ordinal categorical variables (“Grade” and “Stage”) in ## Run the multinomial model with the multinom function and the data were rewritten as ordered factors using the summarize it factor function. The level of the outcome to be used as multimodel <- multinom(Causespecificdeathclassification ~ the baseline was selected and specified using the relevel AgeGroup + Sex + Race + PrimarySite + function. Here, “Alive” was defined as a baseline to be Grade + Stage + Chemotherapy, compared with “Dead (attributable to this cancer dx)” and data = seer_multi) “Dead of other cause” in the regression. The codes in R are summary(multimodel) shown below. The data we used in the analyses can be found ## Turn coefficients into interpretable odds ratios in http://cdn.amegroups.cn/static/application/949e8411be9 oddsratio <- exp(coef(multimodel)) 1d730d1670c07b0d01072/10.21037atm-2020-57-1.pdf. ## Import the data Step 3: tests of the coefficients seer_multi <- read.csv("D://Table S1.csv") After constructing the regression model, the next thing ## Display part of the data was to calculate P values of the regression coefficients head(seer_multi) using Wald tests (z tests used in this study). R codes and tail(seer_multi) part of the outputs are shown below. The coefficients were ## Rewrite the ordinal categorical variables considered to be of significance with a two-tailed value © Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(16):982 | http://dx.doi.org/10.21037/atm-2020-57 Annals of Translational Medicine, Vol 8, No 16 August 2020 Page 3 of 8 A B Figure 1 Outputs of the multinomial logistic regression model. (A) Summary of the model. (B) Results of ANOVA. Grade. L: Grade II; Grade. Q: Grade III; Grade. C: Grade IV; Stage. L: Stage II; Stage. Q: Stage III; Stage. C: Stage IV. of P<0.05, and no significance was attributed otherwise. zvalues <- summary(multimodel)$coefficients / Focusing on the block of P values below, variables with summary(multimodel)$standard.errors significant P values of their coefficients were determined as pvalues <- pnorm(abs(zvalues), lower.tail = FALSE) * 2 significant prognostic factors that contributed significantly pvalues differently to cancer-specific death or death of other causes rather than to survival in patients with lung ## (Intercept) AgeGroup30-39 adenocarcinoma. ##Dead (attributable to this cancer dx) 0.000000e+00 0 ##Dead of other cause 1.206239e-197 0 ## Wald tests/z tests of the coefficients … © Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(16):982 | http://dx.doi.org/10.21037/atm-2020-57 Page 4 of 8 Liang et al. Logistic regression with multi-categorical variables in R ## PrimarySiteUpper Grade.L multimodelfit5 <- multinom (data = seer_multi, ##Dead (attributable to this