# Case Study 3- Surviving the Titanic Disaster

**Case Study 3- Surviving the Titanic Disaster**

Description

The Titanic disaster, which occurred on the 31st of May 1911, still captures the interest of the film producers, historians and other scientists. The carefully designed boat became the tomb of more than 1500 people. Several characteristics of the passengers are recorded in the dataset used for this analysis. The dataset contains 2201 subjects. The data available are coded as follows:

SURVIVED: 0 = Not survived

1 =Survived

AGE:0 = Child

1 = Adult

GENDER: 0 = Male

1 = Female

CLASS: 1 = First class

2 = Second class

3 = Third class

4 = Crew members

- Which factor(s) is most important in predicting survival rate?
- Which subgroup has the highest survival rate? For instance,

Is “Women and children first?” true in emergencies?

Did crew leave the boat last resulting in low survival rate in this group?

**Suggested approaches:**

**Type of questions addressed**

**Data Restructuring**

Create a new variable for women and children

Create dummy variables for class variable / To compare the survival rates of adult males with the combination of women and children

To use in regression / “Is it women and children first?”

**Summary statistics**

Survival rates for each group (eg male vs female, or first class vs second class)

Odds ratio between survival status and age; between survival status and gender

Cross-table of survival status and class / To compare survival rates of different groups

To quantify the association between survival and other variables / “Which subgroup has the highest survival rate?”

“Is the survival independent of characteristic of passenger?”

Visual displays

Barcharts of survival for all variables

Mosaic plots of all categorical variables / To compare survival rates of different subgroups

To check independence among variables, and to explore multivariate relations / “Which subgroup has the highest survival rate?”

“Are variables independent?” “Are there any unusually small or large subgroups?”

Regression

Logistic regression of survival status on other variables / To determine the most significant factors in survival from Titanic / “Which factor(s) is most important in predicting/ estimating survival rate?”

**Partial solutions**

Raw counts

Not survived / Survived1490 / 711

adult / child

2092 / 109

female / male

470 / 1731

1st / 2nd / 3rd / crew

325 / 285 / 706 / 885

**Survival rate(Total)**/ 1st / 2nd / 3rd / Crew / Total

Total / 0.625 (325) / 0.414 (285) / 0.252 (706) / 0.24 (885)

Women & Children / 0.973 (150) / 0.89 (117) / 0.422 (244) / 0.87 (23) / 0.698 (534)

Adult Males / 0.326 (175) / 0.083 (168) / 0.162 (462) / 0.223 (862) / 0.203 (1667)

Adult Female / 0.972 (144) / 0.86 (93) / 0.46 (165) / 0.869 (23) / 0.744 (425)

Children / 1 (6) / 1 (24) / 0.34 (79) / (0) / 0.523 (109)

Male Adults / Women & Children / Total

Survived / 338 / 373 / 711

Not survived / 1329 / 161 / 1490

Total / 1667 / 534 / 2201

= = 0.11

Odds of male adults surviving the Titanic disaster was 90% less likely compared to odds of women & children surviving.

**Logistic regression**

logit(P(Yi=1))=β0+ β1* I(Women&Child)

where I(Women&Child)=0 for adult male

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.36914 0.06092 -22.48 <2e-16 ***

I(Women&Child) 2.20931 0.11226 19.68 <2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

AIC: 2338.8

logit(P(Yi=1))=β0+ β1*AGE + β2*SEX + β3 * I (c1) + β4 * I (c2) +β5* I (c3)

where c1= 0 if crew and 1 o.w.;

c2 = 0 if 2nd class or crew and 1 o.w.;

c3 = 0 if 3rd class or crew and 1 o.w

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.1724 0.2567 -0.671 0.502

titanic2[, "AGE"] -1.0615 0.2440 -4.350 1.36e-05 ***

titanic2[, "SEX"] 2.4201 0.1404 17.236 < 2e-16 ***

titanic2[, c1] -1.9382 0.2535 -7.645 2.09e-14 ***

titanic2[, c2] 1.0181 0.1960 5.194 2.05e-07 ***

titanic2[, c3] 1.7778 0.1716 10.362 < 2e-16 ***

AIC: 2222.1

**Akaike Information Criterion (AIC) **

This measure indicates a better fit when it is smaller. The measure is not standardized and is not interpreted for a given model. For two models estimated from the same data set, the model with the smaller AIC is to be preferred.

this is a better fit than the previous one.

As age increases, the odds of survival decreases, or equivalently, probability of survival decreases. (odds=p/(1-p), so odds ↓ imply p ↓ and (1-p) ↑)

So, an adult has lower odds of survival compared to a child.

Odds ratio of survival for females to males = exp(2.42)=11.25

females were 11.25 times more likely to survive titanic compared to males

Compare odds of survival for 1st class with 2nd class: OR=exp(1.018)=2.7676

Compare odds of survival for 2nd class with 3rd class: OR=exp(1.778-1.018)=2.14

Compare odds of survival for 3rd class passengers with crew: OR=exp(-1.94+1.018)=0.4 ! (3rd class is 60% less likely to survive!)

Why comparison of odds for survival for crew and 3rdclass does not make much sense?

Chisq = 349.9, df = 3, p-value = 1.557e-75

reject Ho: class and sex are independent.

1