Case Study 3- Surviving the Titanic Disaster Description The Titanic disaster, which occurred on the 31st of May 1911, still captures the interest of the film producers, historians and other scientists. The carefully designed boat became the tomb of more than 1500 people. Several characteristics of the passengers are recorded in the dataset used for this analysis. The dataset contains 2201 subjects. The data available are coded as follows:

SURVIVED: 0 = Not survived 1 = Survived AGE: 0 = Child 1 = Adult GENDER: 0 = Male 1 = Female CLASS: 1 = First class 2 = Second class 3 = Third class 4 = Crew members

1

 Which factor(s) is most important in predicting survival rate?  Which subgroup has the highest survival rate? For instance,  Is “Women and children first?” true in emergencies?  Did crew leave the boat last resulting in low survival rate in this group?

2 Suggested approaches:

Approach Reason Type of questions addressed Data Restructuring To compare the survival “ Is it women and children Create a new variable for rates of adult males with first?” women and children the combination of women and children

Create dummy variables for To use in regression class variable Summary statistics Survival rates for each To compare survival rates “ Which subgroup has the group (eg male vs female, of different groups highest survival rate?” or first class vs second class)

3 Odds ratio between survival To quantify the association “Is the survival independent status and age; between between survival and other of characteristic of survival status and gender variables passenger?” Cross-table of survival status and class Visual displays Bar charts of survival for To compare survival rates “ Which subgroup has the all variables of different subgroups highest survival rate?”

Mosaic plots of all To check independence “ Are variables categorical variables among variables, and to independent?” “Are there explore multivariate any unusually small or relations large subgroups?”

Regression Logistic regression of To determine the most “ Which factor(s) is most survival status on other significant factors in important in predicting/ variables survival from Titanic estimating survival rate?”

4 Partial solutions

Raw counts

Not survived Survived 1490 711 adult child female male 2092 109 470 1731

1st 2nd 3rd crew 325 285 706 885

5 Survival 1st 2nd 3rd Crew Total rate(Total) Total 0.625 (325) 0.414 (285) 0.252 (706) 0.24 (885) Women & 0.973 (150) 0.89 (117) 0.422 (244) 0.87 (23) 0.698 (534) Children Adult Males 0.326 (175) 0.083 (168) 0.162 (462) 0.223 (862) 0.203 (1667) Adult 0.972 (144) 0.86 (93) 0.46 (165) 0.869 (23) 0.744 (425) Female Children 1 (6) 1 (24) 0.34 (79) (0) 0.523 (109)

Male Adults Women & Children Total Survived 338 373 711 Not survived 1329 161 1490 Total 1667 534 2201

^ 338*161 OR = 1329*373 = 0.11

6 Odds of male adults surviving the Titanic disaster was 90% less likely compared to odds of women & children surviving.

7 8 Logistic regression logit(P(Yi=1))=β0+ β1* I(Women&Child)

where I(Women&Child)=0 for adult male

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.36914 0.06092 -22.48 <2e-16 *** I(Women&Child) 2.20931 0.11226 19.68 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

AIC: 2338.8 logit(P(Yi=1))=β0+ β1*AGE + β2*SEX + β3 * I (c1) + β4 * I (c2) + β5 * I (c3) where c1= 0 if crew and 1 o.w.; c2 = 0 if 2nd class or crew and 1 o.w.; c3 = 0 if 3rd class or crew and 1 o.w

9 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.1724 0.2567 -0.671 0.502 titanic2[, "AGE"] -1.0615 0.2440 -4.350 1.36e-05 *** titanic2[, "SEX"] 2.4201 0.1404 17.236 < 2e-16 *** titanic2[, c1] -1.9382 0.2535 -7.645 2.09e-14 *** titanic2[, c2] 1.0181 0.1960 5.194 2.05e-07 *** titanic2[, c3] 1.7778 0.1716 10.362 < 2e-16 ***

AIC: 2222.1

Akaike Information Criterion (AIC) This measure indicates a better fit when it is smaller. The measure is not standardized and is not interpreted for a given model. For two models estimated from the same data set, the model with the smaller AIC is to be preferred.  this is a better fit than the previous one.

10 As age increases, the odds of survival decreases, or equivalently, probability of survival decreases. (odds=p/(1-p), so odds ↓ imply p ↓ and (1-p) ↑) So, an adult has lower odds of survival compared to a child.

Odds ratio of survival for females to males = exp(2.42)=11.25  females were 11.25 times more likely to survive titanic compared to males

Compare odds of survival for 1st class with 2nd class: OR=exp(1.018)=2.7676

Compare odds of survival for 2nd class with 3rd class: OR=exp(1.778-1.018)=2.14

Compare odds of survival for 3rd class passengers with crew: OR=exp(- 1.94+1.018)=0.4 ! (3rd class is 60% less likely to survive!)

Why comparison of odds for survival for crew and 3rd class does not make much sense? Chisq = 349.9, df = 3, p-value = 1.557e-75  reject Ho: class and sex are independent.

11